Qwen3-Embedding-8B: specs, benchmarks, and how to run it on Lyceum
A multilingual 8B-parameter embedding model with 32k context and Matryoshka representation learning.
Magnus Grünewald
June 27, 2026 · CEO at Lyceum Technology
Qwen3-Embedding-8B is an 8-billion parameter text embedding model developed by Alibaba's Qwen team. Built as a decoder-only transformer, it specializes in dense retrieval, semantic search, and multilingual clustering across more than 100 languages. The model features a 32,768-token context window and outputs 4,096-dimensional vectors, with support for Matryoshka Representation Learning (MRL) to allow dimension truncation without significant performance loss. Lyceum Technology serves Qwen3-Embedding-8B through our OpenAI-compatible Serverless Inference API. Engineering teams can deploy this model on EU-sovereign infrastructure, ensuring strict GDPR compliance while maintaining drop-in compatibility with existing RAG applications.
Get started: call Qwen3-Embedding-8B on Lyceum
Lyceum provides an OpenAI-compatible API for Qwen3-Embedding-8B. Because the endpoint mirrors the standard OpenAI specification, you can switch your embedding provider by updating the base URL and API key, requiring minimal changes to your downstream RAG logic.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lyceum.technology/api/v2/external/serverless",
api_key="<your lyceum api key>",
)
response = client.embeddings.create(
model="Qwen/Qwen3-Embedding-8B",
input="Your text to embed",
)
Pricing and region for Qwen3-Embedding-8B
This model is available on the Fast tier, which optimizes for cost-efficient, high-throughput workloads. It is hosted in the eu-north1 region, ensuring that all data processing remains within European borders.
The pricing for Qwen3-Embedding-8B is $0.01 per million tokens. Because this is an embedding model, billing applies exclusively to input tokens; there are no output tokens generated. Lyceum charges no egress fees, allowing you to transfer large volumes of vector data to your vector database without incurring hidden network costs.
What Qwen3-Embedding-8B is good at
Multilingual retrieval and cross-lingual search
Qwen3-Embedding-8B supports over 100 natural and programming languages. It excels in cross-lingual retrieval tasks where the search query and the target document are in different languages. This makes it highly effective for global enterprise search systems and multilingual RAG pipelines, outperforming previous-generation models on the MMTEB (Multilingual Text Embedding Benchmark).
Instruction-aware embedding generation
The model architecture is instruction-aware, meaning it can adapt its vector representations based on task-specific prompts. By prepending an instruction (e.g., Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {text}), developers can optimize the embeddings for specific downstream tasks like classification, clustering, or asymmetric retrieval. The Qwen team reports that using tailored instructions yields a 1% to 5% performance improvement across various benchmarks.
Matryoshka Representation Learning (MRL)
While the model natively outputs 4,096-dimensional vectors, it was trained using Matryoshka Representation Learning. This allows developers to truncate the output vectors to lower dimensions (e.g., 1024 or 256) and re-normalize them without a catastrophic drop in retrieval accuracy. This flexibility is critical for teams managing vector database storage costs, as it permits a trade-off between storage footprint and semantic precision.
Benchmarks and how it compares
Qwen3-Embedding-8B benchmark results
Qwen3-Embedding-8B achieves state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB), particularly in multilingual and retrieval-heavy subsets. The table below highlights its performance against both its smaller siblings and other prominent embedding models.
| Model | Parameters | MTEB (Mean) | Retrieval | Classification |
|---|---|---|---|---|
| Qwen3-Embedding-8B | 8B | 70.58 | 70.88 | 80.89 |
| Qwen3-Embedding-4B | 4B | 69.45 | 69.60 | 79.36 |
| Qwen3-Embedding-0.6B | 0.6B | 64.33 | 64.64 | 72.22 |
| gte-Qwen2-7b-Instruct | 7B | 62.51 | 60.08 | 73.92 |
Source: Qwen3-Embedding GitHub Repository (MTEB Leaderboard).
Comparison to sibling models
Within the Qwen3 embedding family, the 8B model offers the highest semantic accuracy, scoring 70.58 on the MTEB mean compared to the 4B model's 69.45 and the 0.6B model's 64.33. The 8B variant is the optimal choice for complex enterprise RAG pipelines where retrieval precision is paramount. However, for latency-constrained applications or teams processing billions of tokens where cost is the primary driver, the Qwen3-Embedding-0.6B provides a highly efficient alternative, trading a few percentage points of accuracy for significantly lower computational overhead.
Using it in production
Production configuration for Qwen3-Embedding-8B
When integrating Qwen3-Embedding-8B into a production environment, proper configuration of the API request ensures optimal performance. The model is accessed via the standard embeddings.create endpoint. Because it is hosted on Lyceum Technology's Fast tier in the eu-north1 region, you benefit from high-throughput processing tailored for bulk document ingestion and real-time query embedding.
To maximize retrieval quality, prepend task-specific instructions to your queries. For example, when embedding a user's search query, format the input string as Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {user_input}. When embedding the documents themselves for storage in your vector database, instructions are typically omitted, allowing the model to generate a neutral semantic representation of the text.
Calculating per-token pricing
Lyceum bills Qwen3-Embedding-8B at $0.01 per million tokens for input processing. Because embedding models do not generate text, there are no output token costs.
For a realistic production workload, consider a pipeline that ingests 50,000 documents, each averaging 800 tokens. The total input volume is 40 million tokens. At $0.01 per million tokens, the total compute cost to embed this entire dataset is $0.40. Furthermore, because Lyceum does not charge egress fees, transferring the resulting 4,096-dimensional vectors to your external vector database incurs zero additional network costs, making this predictable for large-scale enterprise deployments.
Running Qwen3-Embedding-8B on EU-sovereign infrastructure
Why run Qwen3-Embedding-8B on Lyceum
For European enterprises and AI startups, data residency is a regulatory requirement. Processing sensitive corporate documents, medical records, or proprietary code through US-hosted APIs introduces compliance considerations. Lyceum addresses these requirements by hosting Qwen3-Embedding-8B on EU-sovereign infrastructure in our eu-north1 region.
By utilizing our Serverless Inference API, your data remains within European borders, supporting compliance with the GDPR and the EU AI Act. We own and operate our GPU infrastructure, which provides a cost efficiencies over providers that rent compute from hyperscalers. This allows us to offer the model at $0.01 per million tokens while maintaining performance and security.
Furthermore, Lyceum is built on open-stack transparency. We utilize optimized open-source inference engines like vLLM and NVIDIA Dynamo rather than proprietary black-box systems. This supports customer portability; you are not locked into a proprietary ecosystem. With our drop-in OpenAI-compatible API, migrating your RAG application to GDPR-compliant LLM inference in Europe requires a simple update to update the base URL, allowing your engineering team to focus on building features rather than managing infrastructure.