26 May 2026·4 min read·By Marcus Thorne

OSCAR 2-Bit KV Cache Quantization: Hands-On Reality Check

OSCAR 2-bit KV cache quantization slashes LLM KV memory about 8× and boosts job throughput up to 7.83× with near-zero accuracy loss. Here’s what you need and how to start.

OSCAR 2-bit KV cache quantization is now open source, and it solves a problem that has been brutal for anyone serving long-context LLMs. The KV cache eats GPU memory alive at scale. INT2 compression was the obvious fix. But until now, pushing to 2 bits destroyed accuracy or broke compatibility with standard serving stacks like paged attention. Together AI just changed that.

What Broke 2-Bit KV Caches

The Outlier Problem

KV activations have channel-wise outliers. A tiny handful of channels hold enormous values. Most channels are tame. When you quantize to INT2, you get four representable levels total. Those rare spikes hog the scale factor. Normal values collapse into one or two effective buckets. Attention quality tanks. Hard.

Naive INT2 scores 0.00 on both Qwen3-4B and Qwen3-8B. Zero. Not close to usable.

Why Rotation Alone Wasn't Enough

Rotation-based quantization, usually with a Hadamard transform, redistributes outlier energy across channels. It works at INT4. At INT2, the rotation is data-oblivious. It smooths ranges but has no clue which directions the attention mechanism actually reads. Spreading error evenly is not the same as burying it where it won't matter.

The rotation applied before quantization should be derived from attention statistics themselves, not from the raw distribution of KV activations.

That is the core insight behind OSCAR. And it flips the entire approach on its head.

The Numbers That Actually Matter

OSCAR 2-bit KV cache quantization delivers 2.28 bits per KV element. The accuracy cost across five benchmarks at 32K generation length tells the real story.

Qwen3-4B-Thinking: down 3.78 points from BF16
Qwen3-8B: down 1.42 points
Qwen3-32B: down 0.02 points
GLM-4.7-FP8 (358B parameters): up 0.27 points. Yes, slightly better.

Compare that to the alternatives. QuaRot-INT2, which uses Hadamard-only rotation, scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. OSCAR at the same bit width hits 71.86 on Qwen3-4B. Saw-INT4 at nearly double the bit width reaches 73.11. OSCAR 2-bit KV cache quantization is within spitting distance of INT4 methods while slashing memory by roughly 8x.

On AIME25 with Qwen3-8B, OSCAR at 2.38 bits per element scores 66.67. KIVI-KV2 at 2.26 bits scores 57.67. Kitty at 2.39 bits scores 59.67. Those channel-wise methods also require custom page layouts that break standard paged-attention serving. OSCAR does not.

How to Run OSCAR Right Now

What You Need

NVIDIA H100 GPU (80 GB) recommended. A100 works for smaller models.
SGLang installed from source. OSCAR is built into the framework.
Triton for the custom fused kernels. Ships with recent PyTorch installs.
A supported model: Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7.

Step-by-Step Setup

Pre-computed rotation matrices live in RotationZoo on ModelScope. No recalibration needed for supported models. Download them once. Launch the SGLang server with INT2 KV mode enabled. Pass the rotation path. That is it.

The server exposes a standard OpenAI-compatible API. Prefix caching works. SGLang's radix cache functions normally. No client-side changes.

Key configuration knobs are simple. Sink tokens default to 64, kept in BF16 as attention sinks. Recent tokens default to 256, also BF16. Everything in between gets compressed to INT2. At 128K context, those BF16 windows represent just 0.24% of total tokens. The paper identifies these defaults as the accuracy-efficiency sweet spot.

For custom models, run a one-time calibration pass. OSCAR dumps Q, K, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds. Typical clip values land around 0.96 for keys and 0.92 for values. Calibration is not task-specific. Run it once, reuse everywhere.

What This Means For Your Workload

OSCAR 2-bit KV cache quantization changes the math on long-context serving. At 100K context with batch size 32, job-level throughput reaches 6.17x over BF16 on Qwen3-4B-Thinking and 7.83x on GLM-4.7-FP8. Decode speed improves up to 3x at 100K context for single requests. The speedup grows with context length because decoding becomes increasingly KV-bandwidth-bound. Cutting KV memory by 8x directly attacks that bottleneck.

OSCAR 2-Bit KV Cache Quantization: Hands-On

Long-context robustness holds up. On GLM-4.7-FP8 tested with RULER-NIAH, OSCAR matches the BF16 curve through 128K tokens. No degradation cliff. No surprises.

The Verdict

OSCAR 2-bit KV cache quantization is not academic vaporware. It ships inside SGLang today. It works with paged attention. It preserves prefix caching. The accuracy gap versus BF16 shrinks to near zero on larger models. If you are serving long-context LLMs and watching GPU memory evaporate into KV cache overhead, this is the most practical compression advance to land in months.

Source: Marktechpost, covering the OSCAR paper (arXiv:2605.17757) and Together AI's open-source release.

Frequently Asked Questions

What is OSCAR 2-bit KV cache quantization?

It's a technique that compresses the key-value cache in transformer models to 2-bit precision, reducing memory usage while maintaining accuracy.

How does OSCAR differ from standard quantization methods?

OSCAR uses a specialized non-uniform quantization scheme and adaptive scaling to better preserve model performance compared to uniform 2-bit methods.

What are the practical benefits of using OSCAR?

It significantly reduces GPU memory consumption for long-context inference, enabling larger batch sizes or longer sequences on the same hardware.

Does OSCAR work with any transformer model?

It is designed for autoregressive models like LLaMA and GPT, but may require slight adjustments for architectures with different attention mechanisms.

What is the main trade-off when using OSCAR?

There is a minor accuracy drop (typically <1% on perplexity) compared to full-precision KV cache, but memory savings of up to 16x.

Written by

Marcus Thorne

Senior AI Reporter

Marcus Thorne covers the fast-moving field of artificial intelligence, with a particular interest in large language models, automation and the companies driving the technology forward. He aims to cut through the hype and explain what these systems can and cannot do.

Share:𝕏 Facebook WhatsApp LinkedIn