The default config is a tax
FP16 weights, default KV-cache, and no batching strategy mean every token costs more than it should — on hardware already paid for.
NavyaAI tunes production model serving: quantization, KV-cache management, continuous batching, serving framework fit, and GPU right-sizing. The published result of this playbook: 42% lower cost per million tokens and 2.3x throughput on Llama 3 70B, without quality loss you would notice in production.
$0.82 → $0.47
cost per million tokens
2.3x
throughput gain
1.3
MMLU points — the total quality cost
FP16 weights, default KV-cache, and no batching strategy mean every token costs more than it should — on hardware already paid for.
Teams buy GPUs to hit p95 targets that serving-stack tuning could meet on existing capacity.
INT8 quantization costs ~1 MMLU point in our benchmarks — imperceptible for most workloads, yet teams skip it for lack of an eval harness.
Deep Dive
Most production LLM deployments run close to default settings: FP16 weights, default KV-cache, conservative batching. Each default is individually reasonable and collectively expensive — the case study that produced our benchmark started as 4 GPUs at ~30% utilization serving a workload that, optimized, fit on 2 with headroom.
We published the full methodology in the Token Tax benchmark: INT8 (GPTQ) quantization alone delivered 1.8x throughput and 45% VRAM reduction; KV-cache pruning with Heavy Hitter Oracle at a 50% budget freed another ~30% of cache memory for batching. Combined, cost per million tokens fell from $0.82 to $0.47 — a 42% cut on identical traffic.
The quality cost was measured, not assumed: 1.3 MMLU points total. That number is the reason optimization is safe to ship — every change runs against an eval harness before production, so the cost-quality trade is explicit.
The decision this service replaces is buying more capacity. Before a GPU order or an API tier upgrade, the audit checks whether precision, caching, and batching can serve the same traffic on existing hardware — in our experience the answer is yes more often than procurement plans assume.
The same numbers feed the API-versus-private decision: optimized self-hosting at $0.47 per million tokens against $1.80-$2.50 for frontier APIs is the spread that makes break-even math interesting at sustained volume. Unoptimized self-hosting at $0.82 makes the same math twice as hard — which is why optimization comes first, whatever the infrastructure direction.
Audit Focus
The first pass is designed to identify the smallest useful intervention: routing, caching, prompt control, serving tuning, or a deeper break-even audit.
Measured levers from our Token Tax benchmark on Llama 3 70B — what each one buys and what it costs.
| Lever | Measured gain | Measured cost |
|---|---|---|
| INT8 quantization (GPTQ) | 1.8x throughput, 45% less VRAM | ~1.1 MMLU points |
| KV-cache pruning (H2O, 50%) | ~30% cache memory freed for batching | ~0.2 additional MMLU points |
| Combined optimization | 2.3x throughput, $0.82 → $0.47 per M tokens | 1.3 MMLU points total |
| GPU right-sizing after optimization | 4 GPUs → 2 for the same workload | Engineering time, no quality cost |
Worked Example
The measured progression from our published benchmark — each step, its gain, and its quality cost.
| Step | Throughput effect | Quality effect (MMLU) |
|---|---|---|
| Baseline: FP16, default cache, vLLM defaults | 1.0x — $0.82/M tokens | 79.2% (reference) |
| + INT8 quantization (GPTQ) | 1.8x, 45% less VRAM | 78.1% (-1.1 points) |
| + KV-cache pruning (H2O, 50% budget) | 2.3x combined — $0.47/M tokens | 77.9% (-1.3 total) |
| GPU consolidation enabled by the above | Same traffic on 2 GPUs instead of 4 | No additional cost |
Full configs, hardware details, and methodology in the Token Tax post; engagement context in the Llama 3 70B case study.
How It Works
Step 1
Submit monthly spend range, provider mix, token volume, and what your serving stack actually do: chat, RAG, agents, extraction, or batch jobs. Takes minutes, no production access needed.
Step 2
We break the invoice into cost per completed user action: prompt and output tokens, retries, retrieval, tool calls, and orchestration overhead — the 72% of AI cost that hides outside the model call.
Step 3
Each leak gets a lever — routing, caching, prompt compression, retry control, quantization, or a private break-even case — ranked by expected savings against implementation effort.
Step 4
You receive the first cost-leak read in writing. If the economics justify deeper work, the next step is a scoped engagement; if not, you keep the findings.
The audit form routes teams below $20K/month toward self-serve estimators and routes qualified spend into follow-up.
$47K → $28K
Case study: a Llama 3 70B production workload moved from 4 GPUs to 2 with INT8 quantization, KV-cache pruning, and serving changes — a 42% monthly cost cut with 2.3x throughput.
Read the full auditFAQ
Model inference optimization improves the speed, memory use, and cost of serving AI models in production — through quantization, KV-cache management, batching, serving framework choice, and GPU right-sizing — while holding output quality inside a measured tolerance.
Our published Llama 3 70B benchmark cut cost per million tokens 42%, from $0.82 to $0.47, and lifted throughput 2.3x on half the GPUs. Gains depend on how untuned the starting stack is — default-config deployments have the most headroom.
Measurably but rarely meaningfully: INT8 GPTQ cost ~1.1 MMLU points in our benchmark, and adding KV-cache pruning brought the total to 1.3 points. For summarization, RAG, code, and chat workloads that loss is imperceptible — and we gate every change behind an eval harness.
It depends on model, traffic shape, and latency floor. vLLM excels at continuous batching for variable traffic; TensorRT-LLM wins on dense, latency-critical serving. The audit benchmarks your workload on the candidates rather than picking by reputation.
Usually the opposite: optimization typically frees existing hardware. The case-study workload went from 4 GPUs to 2 while serving 2.3x the throughput. Buying capacity is the last resort after precision, caching, and batching are tuned.