Model Inference Optimization
Last reviewed

Model inference optimization, measured in cost per million tokens.

NavyaAI tunes production model serving: quantization, KV-cache management, continuous batching, serving framework fit, and GPU right-sizing. The published result of this playbook: 42% lower cost per million tokens and 2.3x throughput on Llama 3 70B, without quality loss you would notice in production.

$0.82 → $0.47

cost per million tokens

Llama 3 70B before and after INT8 quantization and KV-cache pruning in our published benchmark. See the data

2.3x

throughput gain

Combined optimization more than doubled tokens per second on the same hardware class. See the data

1.3

MMLU points — the total quality cost

The full optimization stack's measured accuracy loss: imperceptible for most production workloads. See the data

The default config is a tax

FP16 weights, default KV-cache, and no batching strategy mean every token costs more than it should — on hardware already paid for.

Latency targets drive overprovisioning

Teams buy GPUs to hit p95 targets that serving-stack tuning could meet on existing capacity.

Quality fears block easy savings

INT8 quantization costs ~1 MMLU point in our benchmarks — imperceptible for most workloads, yet teams skip it for lack of an eval harness.

Deep Dive

The token tax: paying for the default config

Most production LLM deployments run close to default settings: FP16 weights, default KV-cache, conservative batching. Each default is individually reasonable and collectively expensive — the case study that produced our benchmark started as 4 GPUs at ~30% utilization serving a workload that, optimized, fit on 2 with headroom.

We published the full methodology in the Token Tax benchmark: INT8 (GPTQ) quantization alone delivered 1.8x throughput and 45% VRAM reduction; KV-cache pruning with Heavy Hitter Oracle at a 50% budget freed another ~30% of cache memory for batching. Combined, cost per million tokens fell from $0.82 to $0.47 — a 42% cut on identical traffic.

The quality cost was measured, not assumed: 1.3 MMLU points total. That number is the reason optimization is safe to ship — every change runs against an eval harness before production, so the cost-quality trade is explicit.

When optimization beats procurement

The decision this service replaces is buying more capacity. Before a GPU order or an API tier upgrade, the audit checks whether precision, caching, and batching can serve the same traffic on existing hardware — in our experience the answer is yes more often than procurement plans assume.

The same numbers feed the API-versus-private decision: optimized self-hosting at $0.47 per million tokens against $1.80-$2.50 for frontier APIs is the spread that makes break-even math interesting at sustained volume. Unoptimized self-hosting at $0.82 makes the same math twice as hard — which is why optimization comes first, whatever the infrastructure direction.

Audit Focus

What we inspect before prescribing a platform change.

The first pass is designed to identify the smallest useful intervention: routing, caching, prompt control, serving tuning, or a deeper break-even audit.

Serving precision: FP16 vs INT8/FP8 quantization fit
KV-cache budget, pruning, and batch headroom
Continuous batching and serving framework (vLLM, TensorRT-LLM) fit
GPU utilization, sizing, and cost per million tokens
Eval harness so optimization never ships a quality regression
Inference optimization lever map — see the full map

Measured levers from our Token Tax benchmark on Llama 3 70B — what each one buys and what it costs.

LeverMeasured gainMeasured cost
INT8 quantization (GPTQ)1.8x throughput, 45% less VRAM~1.1 MMLU points
KV-cache pruning (H2O, 50%)~30% cache memory freed for batching~0.2 additional MMLU points
Combined optimization2.3x throughput, $0.82 → $0.47 per M tokens1.3 MMLU points total
GPU right-sizing after optimization4 GPUs → 2 for the same workloadEngineering time, no quality cost

Worked Example

Llama 3 70B optimization, step by step

The measured progression from our published benchmark — each step, its gain, and its quality cost.

StepThroughput effectQuality effect (MMLU)
Baseline: FP16, default cache, vLLM defaults1.0x — $0.82/M tokens79.2% (reference)
+ INT8 quantization (GPTQ)1.8x, 45% less VRAM78.1% (-1.1 points)
+ KV-cache pruning (H2O, 50% budget)2.3x combined — $0.47/M tokens77.9% (-1.3 total)
GPU consolidation enabled by the aboveSame traffic on 2 GPUs instead of 4No additional cost

Full configs, hardware details, and methodology in the Token Tax post; engagement context in the Llama 3 70B case study.

How It Works

How the audit works

  1. Step 1

    Share spend and workload shape

    Submit monthly spend range, provider mix, token volume, and what your serving stack actually do: chat, RAG, agents, extraction, or batch jobs. Takes minutes, no production access needed.

  2. Step 2

    Map cost to workflows

    We break the invoice into cost per completed user action: prompt and output tokens, retries, retrieval, tool calls, and orchestration overhead — the 72% of AI cost that hides outside the model call.

  3. Step 3

    Rank levers by friction and savings

    Each leak gets a lever — routing, caching, prompt compression, retry control, quantization, or a private break-even case — ranked by expected savings against implementation effort.

  4. Step 4

    Get the written read, then decide

    You receive the first cost-leak read in writing. If the economics justify deeper work, the next step is a scoped engagement; if not, you keep the findings.

Start with spend, provider, and workload shape.

The audit form routes teams below $20K/month toward self-serve estimators and routes qualified spend into follow-up.

Request Free Audit

$47K → $28K

Case study: a Llama 3 70B production workload moved from 4 GPUs to 2 with INT8 quantization, KV-cache pruning, and serving changes — a 42% monthly cost cut with 2.3x throughput.

Read the full audit

FAQ

Common questions

What is model inference optimization?

Model inference optimization improves the speed, memory use, and cost of serving AI models in production — through quantization, KV-cache management, batching, serving framework choice, and GPU right-sizing — while holding output quality inside a measured tolerance.

How much does inference optimization reduce costs?

Our published Llama 3 70B benchmark cut cost per million tokens 42%, from $0.82 to $0.47, and lifted throughput 2.3x on half the GPUs. Gains depend on how untuned the starting stack is — default-config deployments have the most headroom.

Does quantization hurt model quality?

Measurably but rarely meaningfully: INT8 GPTQ cost ~1.1 MMLU points in our benchmark, and adding KV-cache pruning brought the total to 1.3 points. For summarization, RAG, code, and chat workloads that loss is imperceptible — and we gate every change behind an eval harness.

Which serving framework is best for LLM inference?

It depends on model, traffic shape, and latency floor. vLLM excels at continuous batching for variable traffic; TensorRT-LLM wins on dense, latency-critical serving. The audit benchmarks your workload on the candidates rather than picking by reputation.

Do I need new hardware to optimize inference?

Usually the opposite: optimization typically frees existing hardware. The case-study workload went from 4 GPUs to 2 while serving 2.3x the throughput. Buying capacity is the last resort after precision, caching, and batching are tuned.