Self-Hosted LLM Deployment

Last reviewed June 10, 2026

Self-hosted LLM deployment with break-even math first.

NavyaAI helps teams decide whether private LLM serving should exist, then designs the deployment path: model choice, quantization, vLLM or TensorRT-LLM serving, GPU sizing, latency targets, security boundaries, and operating cost.

Request Self-Hosted LLM Audit Run On-Prem Estimator

$0.47

per million tokens, optimized 70B

Llama 3 70B with INT8 quantization and KV-cache pruning on a single H100 — versus $2.50 for GPT-4o and $1.80 for Claude Sonnet via API at testing time. See the data

4 → 2

GPUs for the same workload

INT8 cut weight memory ~45%, letting the case-study workload serve more traffic on half the hardware. See the data

2.3x

throughput after optimization

Quantization plus KV-cache pruning more than doubled tokens per second — the gain that makes break-even math work. See the data

Self-hosting is not automatically cheaper

GPU cost, utilization, maintenance, reliability, and engineering time decide whether private serving wins.

Latency and privacy shape the architecture

A private model can solve data concerns but still fail if throughput and response targets are wrong.

GPU sizing happens too early

Teams often choose hardware before measuring token volume, batchability, and model fit.

Deep Dive

Quantization economics: why the default config is the expensive one

The largest self-hosting cost decision is made before any hardware is bought: serving precision. In our Token Tax benchmark, Llama 3 70B in default FP16 cost $0.82 per million tokens; INT8 (GPTQ) quantization cut weight memory ~45% and lifted throughput 1.8x, and adding KV-cache pruning (H2O at 50% budget) freed another ~30% of cache memory for batching. Combined: 2.3x throughput and $0.47 per million tokens on the same hardware class — at a quality cost of 1.3 MMLU points, imperceptible in most production workloads.

Teams that size GPU fleets for FP16 buy roughly twice the hardware the workload needs. The deployment plan we produce fixes precision, batching, and serving stack first, then sizes hardware to the optimized profile.

GPU sizing from workload shape, not model size

Model parameters set a VRAM floor, but the real sizing inputs are traffic-shaped: tokens per day, concurrency at peak, context length distribution, and the p95 latency the product actually requires. A latency-tolerant batch workload and an interactive assistant with the same token volume need different GPU counts and sometimes different GPU classes.

This is also where the L40S-versus-H100 question gets answered with numbers: lower-cost GPUs win when the model and concurrency fit inside their profile, premium GPUs win when serving density drives cost per token down. The deciding metric is cost per accepted output at your latency floor — never raw benchmark speed.

From data-center GPUs to the edge

Self-hosting does not start at a $30K GPU. For private, local workloads — internal QA over your own documents, field deployments, regulated environments where data cannot leave the building — quantized small models on Jetson Orin Nano-class edge devices are a real deployment tier: hundreds of dollars of hardware, single-digit watts, and zero per-token fees.

We run this tier ourselves: a RAG-backed QA bot serving answers entirely on a Jetson Orin Nano — embeddings, retrieval, and generation on-device, nothing leaving it. It is the same break-even-first discipline applied at the smallest scale, and we published the fully measured benchmark — throughput, latency, power draw, and cost per million tokens, from 1 to 16 concurrent users — on our blog.

The audit treats edge as a first-class option in the GPU sizing decision: when a workload is local, private, and latency-tolerant, the answer to 'which GPU?' is sometimes 'a device that fits in your hand' — and when it is not, the math says so before hardware is bought.

The operations line most break-even spreadsheets omit

Self-hosting fails in practice when the plan covers hardware but not operations: monitoring, upgrades, incident response, capacity management, and the utilization loss of real traffic versus theoretical peak. These costs are why self-hosting below roughly 1M tokens per day rarely makes sense — the API premium is cheaper than the operational floor.

Our deployment engagements treat operations as a designed deliverable: serving observability, cost telemetry per workflow, and an upgrade path. The break-even math presented up front includes that line, so the decision survives contact with production.

Audit Focus

What we inspect before prescribing a platform change.

The first pass is designed to identify the smallest useful intervention: routing, caching, prompt control, serving tuning, or a deeper break-even audit.

Self-host LLM vs API cost comparison

vLLM, TensorRT-LLM, and serving architecture fit

GPU sizing for latency, throughput, and VRAM — edge to HPC

Quantization and model-size right-sizing

Edge deployment fit: Jetson-class devices for local private workloads

Private AI security, monitoring, and operations plan

Private LLM deployment map — see the full map

The decision starts with economics, then moves to serving architecture.

Decision	Risk	Audit question
Model size	Larger model than task requires	What quality floor is needed?
GPU class	VRAM or throughput mismatch	What batch and context shape is real?
Serving stack	Framework choice limits throughput	Does vLLM fit the workload?
Operations	Hidden maintenance and uptime cost	Who owns incidents?
Break-even	Private serving loses at low volume	What monthly tokens justify migration?

Worked Example

Llama 3 70B serving cost: default vs optimized

Measured results from our Token Tax benchmark — the same model and traffic, before and after optimization, against API alternatives.

Configuration	Hardware	Cost per million tokens
Llama 3 70B, FP16 default	4 GPUs (case-study baseline)	$0.82 — with ~30% utilization waste
Llama 3 70B, INT8 + KV-cache pruning	2 GPUs, 2.3x throughput	$0.47
GPT-4o via API	None (managed)	$2.50 at testing time
Claude Sonnet via API	None (managed)	$1.80 at testing time

Benchmark methodology, quality measurements (1.3 MMLU points total loss), and full configs published in the Token Tax post. API prices as measured at testing time; break-even depends on your volume and operations cost.

How It Works

How the audit works

Step 1
Share spend and workload shape
Submit monthly spend range, provider mix, token volume, and what your serving workloads actually do: chat, RAG, agents, extraction, or batch jobs. Takes minutes, no production access needed.
Step 2
Map cost to workflows
We break the invoice into cost per completed user action: prompt and output tokens, retries, retrieval, tool calls, and orchestration overhead — the 72% of AI cost that hides outside the model call.
Step 3
Rank levers by friction and savings
Each leak gets a lever — routing, caching, prompt compression, retry control, quantization, or a private break-even case — ranked by expected savings against implementation effort.
Step 4
Get the written read, then decide
You receive the first cost-leak read in writing. If the economics justify deeper work, the next step is a scoped engagement; if not, you keep the findings.

Start with spend, provider, and workload shape.

The audit form routes teams below $20K/month toward self-serve estimators and routes qualified spend into follow-up.

Request Free Audit

$47K → $28K

Case study: a Llama 3 70B production workload moved from 4 GPUs to 2 with INT8 quantization, KV-cache pruning, and serving changes — a 42% monthly cost cut with 2.3x throughput.

Read the full audit

FAQ

Common questions

When should a team self-host an LLM?

A team should evaluate self-hosting when traffic is predictable, token volume is high, privacy or data residency matters, and GPU utilization can beat managed API economics.

What is included in self-hosted LLM deployment?

A self-hosted LLM deployment usually includes model selection, quantization, serving stack setup, GPU sizing, monitoring, security controls, and cost/performance tuning.

Can NavyaAI deploy vLLM?

Yes. NavyaAI can evaluate and implement vLLM-style serving where it fits the model, latency target, batching pattern, and operational constraints.

How much does it cost to self-host an LLM?

Hardware ranges from $5-10K for a single-GPU 7B deployment to $30K+ per H100 for 70B-class models, plus power, hosting, and operations. Per-token economics are what matter: our benchmark measured optimized Llama 3 70B at roughly $0.47 per million tokens on one H100, versus $0.82 unoptimized and $1.80-$2.50 for frontier APIs.

How many GPUs does Llama 3 70B need?

In FP16, 70B weights need multiple high-memory GPUs — the case study workload ran on 4. With INT8 quantization cutting weight memory roughly in half and KV-cache pruning freeing batch headroom, the same workload served 2.3x the throughput on 2 GPUs. Quantization strategy, not just model size, sets the GPU count.

What is the break-even point for self-hosting vs OpenAI?

Break-even is the monthly volume where your avoided API bill exceeds the full private serving cost: GPU amortization, power, hosting, and the operations time most spreadsheets omit. Below roughly 1M tokens per day APIs usually win; at sustained higher volume the $0.47 vs $1.80-$2.50 per-million spread can pay back hardware in months. The audit runs this math on your traffic.

Can LLMs run on edge devices like the Jetson Orin Nano?

Yes. Quantized small models run on Jetson Orin Nano-class hardware with a full local RAG pipeline alongside them. NavyaAI operates an edge QA bot built exactly this way — documents, embeddings, retrieval, and generation all on the device, no data leaving it — and our published benchmark measured 88-158 tokens/sec aggregate at 16 concurrent users on the exact board.

Self-hosted LLM deployment with break-even math first.

Self-hosting is not automatically cheaper

Latency and privacy shape the architecture

GPU sizing happens too early

What we inspect before prescribing a platform change.

Llama 3 70B serving cost: default vs optimized

How the audit works

Share spend and workload shape

Map cost to workflows

Rank levers by friction and savings

Get the written read, then decide

Start with spend, provider, and workload shape.

Common questions