LLM Inference & Model Serving Inference Optimization 14 min read read

LLM Inference Tokens Per Second: 2026 Hardware and Software Benchmarks

A technical breakdown of H100, B200, vLLM, and TensorRT-LLM performance for production workloads.

Maximilian Niroomand

Maximilian Niroomand

June 10, 2026 · CTO & Co-Founder at Lyceum Technology

Optimizing large language models for production requires a deep understanding of memory bandwidth and software execution. When you deploy a model, the inference process splits into two distinct phases. The prefill phase processes the input prompt and is heavily compute-bound. The decode phase generates the output tokens one by one and is almost entirely memory-bandwidth bound. This architectural reality makes tokens per second (TPS) the ultimate metric for evaluating inference performance in 2026. Raw teraflops look impressive on a spec sheet, but if your GPU cannot move data from memory to the compute cores fast enough, those cores sit idle.

The Physics of LLM Inference in 2026

Engineering teams must track two specific latency metrics alongside overall throughput to accurately measure user experience and system efficiency. two specific latency metrics alongside overall throughput. Time to First Token (TTFT) measures the delay before the model begins generating the response. This metric relies entirely on the compute-heavy prefill phase, where the GPU processes the input prompt in parallel. Inter-Token Latency (ITL) measures the time between each generated token, which relies on the memory-bound decode phase. Understanding the distinction between these two phases is critical for diagnosing performance bottlenecks in production environments.

The Autoregressive Bottleneck

Because large language models are autoregressive, every single token generated requires a full pass through the model weights. The GPU must load the entire model and the Key-Value (KV) cache from memory into the compute cores for every single step of the generation process. This architectural reality makes memory bandwidth the ultimate limiting factor for inference speed. Raw compute power matters for the initial prompt processing, but once generation begins, the speed at which data moves from High Bandwidth Memory (HBM) to the streaming multiprocessors dictates your tokens per second. If the memory bus is too narrow, the most powerful compute cores in the world will simply sit idle waiting for data to arrive.

Balancing Batch Size and Latency

Maximizing tokens per second while keeping ITL low requires a delicate balance of batch sizes, quantization levels, and KV cache management. If you scale your batch size too high, overall system throughput increases, but individual request latency spikes to unacceptable levels for real-time applications. Finding the optimal frontier demands the right combination of silicon and software. Engineering teams must carefully monitor their workload profiles. A chat application requires very low ITL to feel responsive to human users, whereas an offline batch processing job summarizing documents can tolerate higher latency in exchange for maximum throughput. Understanding these physical constraints is the first step in optimizing your 2026 inference architecture for both speed and cost efficiency.

Hardware Benchmarks: A100 vs H100 vs B200

The 2026 hardware landscape offers clear generational leaps in memory bandwidth. Because the decode phase is memory-bound, upgrading your silicon directly multiplies your tokens per second. Engineering teams must evaluate these hardware options based on their specific throughput requirements and budget constraints, ensuring they select the right GPU for their target model size.

The A100 and H100 Performance Gap

The NVIDIA A100 80GB provides 2.0 TB/s of memory bandwidth. It remains a reliable workhorse for smaller models, but it struggles with high-concurrency serving for models exceeding 70 billion parameters. The NVIDIA H100 80GB SXM5 increases that bandwidth to 3.35 TB/s and introduces the Transformer Engine for native FP8 computation. According to recent benchmarking data evaluating Llama 3.3 70B inference, the performance gap is significant. When running at a batch size of 32, the A100 INT4 achieves approximately 1,400 tokens per second. Under the same conditions, the H100 W4A8 hits roughly 4,800 tokens per second. You get more than triple the throughput from a single generation upgrade, making the H100 significantly more cost-effective for heavy workloads that require rapid generation.

The B200 Inference Leap

The newly deployed Blackwell B200 pushes the ceiling even further. Featuring 192GB of HBM3e memory and 8.0 TB/s of bandwidth, the B200 is built specifically for large-scale inference workloads. Recent performance analyses of Llama 4 Maverick on NVIDIA H200 versus B200 using vLLM show the B200 delivering up to 47% higher output token throughput than the H200 at peak concurrency. For engineering teams, the decision comes down to utilization. If you have a steady stream of high-concurrency traffic, the H100 or B200 will process requests much faster and cheaper than an array of older A100s. The high memory capacity of the B200 also allows for serving much larger models without resorting to complex multi-GPU tensor parallelism, further reducing latency overhead and simplifying deployment architecture.

Calculating the True Cost Per Million Tokens

Tokens per second directly dictates your unit economics. High throughput dilutes your hourly infrastructure costs across millions of generated tokens, making raw performance a critical financial metric for AI startups and enterprises alike. Understanding how to calculate and optimize this cost is essential for building a sustainable business model around large language models.

The Economics of High Throughput

Consider a typical production workload. If an H100 generates 4,800 tokens per second at a high batch size, it produces roughly 17.28 million tokens per hour. Your cost per million tokens is entirely dependent on what you pay for that hour of compute. Standard hyperscaler pricing for on-demand H100 virtual machines can quickly become unsustainable for startups running sustained inference or long-term training jobs. When you rely on public clouds, you are often paying a significant premium for their corporate overhead and marketing budgets, rather than the raw compute power you actually need to serve your users.

Structural Cost Advantages

Lyceum offers a structural cost advantage. Because we own our GPU infrastructure rather than renting from hyperscalers, we provide H100 VMs at a fraction of the cost of traditional clouds. We pair this with per-second billing across the board, meaning you never pay for idle minutes. This direct-to-metal approach eliminates the middleman markup that plagues the current AI infrastructure market, allowing you to achieve a significantly lower cost per million tokens.

The Pythia AI Scheduler

To drive costs down further, the Pythia AI Scheduler automatically predicts VRAM requirements and optimizes runtime estimation. Engineering teams using Pythia see an average of 30% to 34% cost savings per job by ensuring workloads are routed to the most efficient available hardware. Instead of over-provisioning expensive H100s for tasks that could run on smaller GPUs, Pythia intelligently matches the workload to the silicon. This level of orchestration ensures that your cost per million tokens remains highly competitive, even as model sizes and user demand continue to grow exponentially in 2026.

European Data Sovereignty and Production Deployment

Raw performance and low costs are irrelevant for European AI teams if the infrastructure fails compliance audits. The regulatory landscape in 2026 requires strict adherence to GDPR and the newly enforced AI Act. Deploying models without considering data sovereignty can lead to severe legal and financial penalties, making infrastructure location a primary concern for enterprise architects.

The Risks of US-Based Routing

Routing your proprietary data or customer information through US-based API providers is a significant compliance risk. Most existing inference platforms host their infrastructure outside the EU or rely on complex legal frameworks that do not guarantee true data isolation. This makes them a deal-breaker for healthcare, manufacturing, and enterprise applications where data privacy is paramount. Relying on foreign infrastructure also exposes your application to unpredictable latency spikes caused by transatlantic data transfers, which can severely degrade the user experience for European customers.

Sovereign Infrastructure Solutions

Lyceum is a dedicated EU-native inference platform built for enterprise scale. All data stays strictly within European data centers. When you deploy a model on our Inference Engine, the machine is exclusively yours. There is no shared tenancy and no black-box data routing. We guarantee that your prompts, weights, and generated tokens never leave the European Union, providing complete peace of mind for your legal and compliance teams.

Seamless API Integration

We provide a 100% OpenAI-compatible API endpoint. You simply change the base URL in your SDK and your application runs on sovereign infrastructure. With 40+ supply-side partners, we guarantee high availability even during global GPU shortages. You can provision a VM in 18 seconds, deploy your model, and scale to zero when traffic drops. This seamless integration allows European developers to build fully compliant applications without sacrificing the developer experience, rewriting their entire codebase, or compromising on tokens per second performance.

Avoiding Common Inference Bottlenecks

Scaling LLM inference exposes several common infrastructure bottlenecks. Recognizing these pitfalls early prevents significant cost overruns and degraded user experiences. Even with the fastest GPUs and the most optimized software engines, a poorly architected deployment will struggle to maintain high tokens per second under heavy load.

KV Cache Footprint Management

The KV cache stores the context of the conversation. As context windows grow to 128k tokens and beyond, the KV cache can consume more VRAM than the model weights themselves. If you do not utilize FP8 quantization for your cache, you will run out of memory long before you hit your compute limits. Engineering teams must implement advanced caching strategies, such as prompt caching or context window sliding, to keep memory usage under control during long conversational sessions. Failing to manage the KV cache will result in out-of-memory errors and dropped requests.

Reliable Scaling vs. Hyperscaler Auto-scaling

Auto-scaling GPUs on public clouds is notoriously unreliable. You often wait 20 minutes for a node to spin up, only to receive an out-of-capacity error. This cold start latency destroys the user experience for real-time applications. Lyceum solves this with dedicated endpoints that scale reliably and infrastructure designed for instant, per-token execution. Our orchestration layer ensures that warm nodes are always available to handle sudden spikes in traffic, maintaining consistent tokens per second regardless of user demand.

Eliminating Data Transfer Fees

Moving large datasets and model weights in and out of public clouds incurs heavy egress fees. These hidden costs can quickly dwarf your actual compute spend. Lyceum eliminates this friction by providing free S3-compatible storage with zero data transfer charges. You can experiment, train, and serve without worrying about hidden network costs. This predictable pricing model allows engineering teams to focus entirely on optimizing their tokens per second rather than constantly auditing their cloud bills for unexpected egress charges.

Evaluating Models on the LLM Leaderboard

As new models are released at a breakneck pace in 2026, tracking their real-world inference performance requires standardized benchmarks. Engineering teams cannot rely solely on theoretical hardware specifications to predict how a specific model will behave in production. This is where comprehensive tracking tools become essential for architectural planning and capacity management.

Tracking Performance Metrics

Resources like the Vellum LLM Leaderboard provide critical visibility into how different models perform across various hardware configurations. These leaderboards track essential metrics such as context window size, output tokens per second, and overall latency. By comparing models side-by-side, developers can make informed decisions about which architecture best suits their specific use case. For example, a smaller 8B parameter model might dominate the leaderboard in raw speed, making it ideal for real-time chat applications, while a large 70B model might offer superior reasoning capabilities at a lower tokens per second rate. Understanding these tradeoffs is vital for optimizing user experience.

The Impact of Model Architecture

The architecture of the model itself heavily influences its position on these performance leaderboards. Mixture of Experts (MoE) models, for instance, only activate a subset of their parameters during inference. This allows them to achieve much higher tokens per second compared to dense models of a similar total parameter count. However, MoE models require significantly more VRAM to store all the inactive experts, creating a complex tradeoff between memory capacity and generation speed. When consulting leaderboards, teams must look beyond the top-line speed and consider the memory footprint required to achieve those results. Lyceum provides the flexible infrastructure needed to deploy both dense and MoE architectures efficiently, allowing you to match the right model to the right hardware without overspending on unnecessary VRAM.

Quantization Strategies for Maximum Throughput

Maximum tokens per second in 2026 requires aggressive quantization. By reducing the precision of the model weights and activations, you can drastically reduce the memory bandwidth required for the decode phase. This allows the GPU to process tokens much faster, directly improving your overall throughput and significantly reducing your cost per million tokens in production environments.

Understanding Precision Formats

Historically, models were served in FP16 or BF16 precision. However, modern hardware like the NVIDIA H100 and B200 feature specialized tensor cores designed to accelerate lower-precision formats. As seen in the NVIDIA A100 vs H100 benchmarks, utilizing formats like W4A8 (4-bit weights, 8-bit activations) can yield significant performance gains. The H100 running W4A8 achieves roughly 4,800 tokens per second for a 70B model, far outpacing the older A100 running INT4. The Blackwell B200 takes this further with native FP4 support, pushing the boundaries of what is possible for high-speed inference. Upgrading to these newer formats is the most effective way to scale throughput.

The Accuracy Tradeoff

The primary concern with quantization is the potential degradation of model accuracy. Dropping from 16-bit to 4-bit precision can introduce rounding errors that affect the quality of the generated text. However, advanced calibration techniques and quantization-aware training have largely mitigated these issues in 2026. For most production workloads, the slight drop in theoretical accuracy is imperceptible to the end user, while the 3x to 4x increase in tokens per second is highly noticeable. Engineering teams must test their specific prompts against quantized models to ensure the output quality remains acceptable for their business logic. Lyceum supports all major quantization formats natively, allowing you to easily benchmark different precision levels and find the perfect balance between speed and accuracy for your specific application.

Frequently Asked Questions

What is the difference between Time to First Token (TTFT) and Inter-Token Latency (ITL)?

TTFT measures the time it takes to process the input prompt and generate the very first output token. This phase is heavily compute-bound, relying on the raw processing power of the GPU. ITL measures the time between each subsequent generated token, which is heavily dependent on memory bandwidth. Balancing both metrics is crucial for a responsive user experience.

How does the NVIDIA B200 improve LLM inference?

The Blackwell B200 features 192GB of HBM3e memory and 8.0 TB/s of memory bandwidth. This increase in bandwidth allows it to process up to 47% more output tokens per second than the H200 at peak concurrency. The B200 is specifically engineered to handle large-scale inference workloads and large parameter models efficiently.

Why should European teams avoid US-based inference APIs?

US-based API providers typically route data through servers outside the EU, which violates strict GDPR and AI Act compliance requirements for sensitive data. This exposes companies to significant legal risks. Lyceum ensures all data remains on owned, EU-sovereign infrastructure, providing complete data isolation and regulatory compliance for enterprise applications.

How does the Pythia AI Scheduler reduce inference costs?

The Pythia AI Scheduler automatically predicts VRAM requirements and estimates runtime for your specific workloads. By intelligently routing jobs to the most efficient available hardware rather than over-provisioning expensive GPUs, it reduces overall compute costs by 30% to 34%. This ensures you maximize your tokens per second without wasting budget on idle compute.

Can I use my existing OpenAI SDK with Lyceum?

Yes. Lyceum provides a 100% OpenAI-compatible API endpoint for seamless integration. You simply change the base URL in your existing code and insert your API key. This requires zero architectural changes, allowing you to instantly deploy your applications on sovereign European infrastructure while maintaining the exact same developer experience.

Related Resources

/magazine/vllm-production-deployment-guide-2026; /magazine/nvidia-dynamo-inference-orchestration-guide; /magazine/reduce-llm-inference-latency-gpu