What is the context window for Qwen3-Embedding-8B?

Qwen3-Embedding-8B supports a maximum context window of 32,768 tokens. However, for optimal retrieval performance in RAG applications, it is generally recommended to chunk documents into smaller segments (e.g., 512 to 2,048 tokens) before generating embeddings.

How much does it cost to run Qwen3-Embedding-8B on Lyceum?

Lyceum Technology charges $0.01 per million input tokens for Qwen3-Embedding-8B on our Fast tier. Because it is an embedding model, there are no output tokens. We also do not charge any egress fees for transferring the generated vectors.

Is Qwen3-Embedding-8B GDPR compliant?

Yes, when accessed through Lyceum Technology. We host the model on our EU-sovereign infrastructure in the eu-north1 region. All data processing remains strictly within European borders, ensuring full compliance with the GDPR and the EU AI Act.

How do I call Qwen3-Embedding-8B using the OpenAI SDK?

You can use the standard OpenAI Python or Node.js SDK. Initialize the client with your Lyceum API key and set the base URL to https://www.alibabacloud.com/blog/mastering-text-embedding-and-reranker-with-qwen3_601345. Then, use the embeddings.create method with the model string Qwen/Qwen3-Embedding-8B.

What is the output dimension of the embeddings?

The model natively outputs vectors with 4,096 dimensions. Because it was trained using Matryoshka Representation Learning (MRL), you can safely truncate these vectors to lower dimensions (such as 1024 or 256) and re-normalize them to save vector database storage space.

Under what license is Qwen3-Embedding-8B released?

The Qwen3-Embedding-8B model weights are open-sourced by Alibaba under the Apache 2.0 license, which permits both research and commercial use. Lyceum Technology provides managed API access to this open-weights model.

Qwen3-Embedding-8B API: pricing, benchmarks & EU

Qwen3-Embedding-8B is an 8-billion parameter text embedding model developed by Alibaba's Qwen team. Built as a decoder-only transformer, it specializes in dense retrieval, semantic search, and multilingual clustering across more than 100 languages. The model features a 32,768-token context window and outputs 4,096-dimensional vectors, with support for Matryoshka Representation Learning (MRL) to allow dimension truncation without significant performance loss. Lyceum Technology serves Qwen3-Embedding-8B through our OpenAI-compatible Serverless Inference API. Engineering teams can deploy this model on EU-sovereign infrastructure, ensuring strict GDPR compliance while maintaining drop-in compatibility with existing RAG applications.

Get started: call Qwen3-Embedding-8B on Lyceum

Lyceum provides an OpenAI-compatible API for Qwen3-Embedding-8B. Because the endpoint mirrors the standard OpenAI specification, you can switch your embedding provider by updating the base URL and API key, requiring minimal changes to your downstream RAG logic.

from openai import OpenAI

client = OpenAI(
 base_url="https://api.lyceum.technology/api/v2/external/serverless",
 api_key="<your lyceum api key>",
)
response = client.embeddings.create(
 model="Qwen/Qwen3-Embedding-8B",
 input="Your text to embed",
)

Pricing and region for Qwen3-Embedding-8B

This model is available on the Fast tier, which optimizes for cost-efficient, high-throughput workloads. It is hosted in the eu-north1 region, ensuring that all data processing remains within European borders.

The pricing for Qwen3-Embedding-8B is $0.01 per million tokens. Because this is an embedding model, billing applies exclusively to input tokens; there are no output tokens generated. Lyceum charges no egress fees, allowing you to transfer large volumes of vector data to your vector database without incurring hidden network costs.

What Qwen3-Embedding-8B is good at

Multilingual retrieval and cross-lingual search

Qwen3-Embedding-8B supports over 100 natural and programming languages. It excels in cross-lingual retrieval tasks where the search query and the target document are in different languages. This makes it highly effective for global enterprise search systems and multilingual RAG pipelines, outperforming previous-generation models on the MMTEB (Multilingual Text Embedding Benchmark).

Instruction-aware embedding generation

The model architecture is instruction-aware, meaning it can adapt its vector representations based on task-specific prompts. By prepending an instruction (e.g., Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {text}), developers can optimize the embeddings for specific downstream tasks like classification, clustering, or asymmetric retrieval. The Qwen team reports that using tailored instructions yields a 1% to 5% performance improvement across various benchmarks.

Matryoshka Representation Learning (MRL)

While the model natively outputs 4,096-dimensional vectors, it was trained using Matryoshka Representation Learning. This allows developers to truncate the output vectors to lower dimensions (e.g., 1024 or 256) and re-normalize them without a catastrophic drop in retrieval accuracy. This flexibility is critical for teams managing vector database storage costs, as it permits a trade-off between storage footprint and semantic precision.

Benchmarks and how it compares

Qwen3-Embedding-8B benchmark results

Qwen3-Embedding-8B achieves state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB), particularly in multilingual and retrieval-heavy subsets. The table below highlights its performance against both its smaller siblings and other prominent embedding models.

Model	Parameters	MTEB (Mean)	Retrieval	Classification
Qwen3-Embedding-8B	8B	70.58	70.88	80.89
Qwen3-Embedding-4B	4B	69.45	69.60	79.36
Qwen3-Embedding-0.6B	0.6B	64.33	64.64	72.22
gte-Qwen2-7b-Instruct	7B	62.51	60.08	73.92

Source: Qwen3-Embedding GitHub Repository (MTEB Leaderboard).

Comparison to sibling models

Within the Qwen3 embedding family, the 8B model offers the highest semantic accuracy, scoring 70.58 on the MTEB mean compared to the 4B model's 69.45 and the 0.6B model's 64.33. The 8B variant is the optimal choice for complex enterprise RAG pipelines where retrieval precision is paramount. However, for latency-constrained applications or teams processing billions of tokens where cost is the primary driver, the Qwen3-Embedding-0.6B provides a highly efficient alternative, trading a few percentage points of accuracy for significantly lower computational overhead.

Using it in production

Production configuration for Qwen3-Embedding-8B

When integrating Qwen3-Embedding-8B into a production environment, proper configuration of the API request ensures optimal performance. The model is accessed via the standard embeddings.create endpoint. Because it is hosted on Lyceum Technology's Fast tier in the eu-north1 region, you benefit from high-throughput processing tailored for bulk document ingestion and real-time query embedding.

To maximize retrieval quality, prepend task-specific instructions to your queries. For example, when embedding a user's search query, format the input string as Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {user_input}. When embedding the documents themselves for storage in your vector database, instructions are typically omitted, allowing the model to generate a neutral semantic representation of the text.

Calculating per-token pricing

Lyceum bills Qwen3-Embedding-8B at $0.01 per million tokens for input processing. Because embedding models do not generate text, there are no output token costs.

For a realistic production workload, consider a pipeline that ingests 50,000 documents, each averaging 800 tokens. The total input volume is 40 million tokens. At $0.01 per million tokens, the total compute cost to embed this entire dataset is $0.40. Furthermore, because Lyceum does not charge egress fees, transferring the resulting 4,096-dimensional vectors to your external vector database incurs zero additional network costs, making this predictable for large-scale enterprise deployments.

Running Qwen3-Embedding-8B on EU-sovereign infrastructure

Why run Qwen3-Embedding-8B on Lyceum

For European enterprises and AI startups, data residency is a regulatory requirement. Processing sensitive corporate documents, medical records, or proprietary code through US-hosted APIs introduces compliance considerations. Lyceum addresses these requirements by hosting Qwen3-Embedding-8B on EU-sovereign infrastructure in our eu-north1 region.

By utilizing our Serverless Inference API, your data remains within European borders, supporting compliance with the GDPR and the EU AI Act. We own and operate our GPU infrastructure, which provides a cost efficiencies over providers that rent compute from hyperscalers. This allows us to offer the model at $0.01 per million tokens while maintaining performance and security.

Furthermore, Lyceum is built on open-stack transparency. We utilize optimized open-source inference engines like vLLM and NVIDIA Dynamo rather than proprietary black-box systems. This supports customer portability; you are not locked into a proprietary ecosystem. With our drop-in OpenAI-compatible API, migrating your RAG application to GDPR-compliant LLM inference in Europe requires a simple update to update the base URL, allowing your engineering team to focus on building features rather than managing infrastructure.

Qwen3-Embedding-8B: specs, benchmarks, and how to run it on Lyceum