Commercial guide - Last reviewed 2026-06-10
GPU Requirements for Hosting LLMs: Edge to HPC
GPU and hardware requirements for hosting LLMs on-prem: VRAM by model size, edge devices like Jetson Orin Nano, L40S and H100 servers, and HPC clusters.
Direct answer for GPU requirements for hosting an LLM
The short answer
GPU requirements scale with model size and precision: quantized 3B-8B models run on edge devices like the Jetson Orin Nano or a single workstation GPU, 7B-13B production serving fits one L40S-class card, and 70B-class models need 80GB-class GPUs — two optimized H100s in our published benchmark. Quantization typically halves memory needs, so serving precision is the first sizing decision, not the GPU order.
Size from the workload, not the model card: token volume, concurrency, context length, and p95 latency set the real requirement.
Quantize before you buy — INT8/INT4 serving cuts VRAM roughly in half and can halve the GPU count, as in our Llama 3 70B audit.
Match the tier to the job: edge devices for local private QA and RAG, single GPUs for departmental serving, HPC clusters only for training or very high concurrency.
Comparison table
| Factor | Option A | Option B |
|---|---|---|
| VRAM rule of thumb | FP16 needs roughly 2GB per billion parameters, plus KV-cache headroom that grows with context and concurrency. | INT8 roughly halves weight memory; INT4 quarters it — that is how 3B-8B models fit edge devices. |
| Edge tier (Jetson Orin Nano class) | Runs quantized 3B-8B models with local RAG — our own edge QA bot serves answers entirely on a Jetson Orin Nano. | Best for private, local, latency-tolerant assistants over your own documents; not for high concurrency. |
| Single-server tier (L40S / H100) | 7B-13B production serving fits one L40S; 70B-class needs 80GB-class cards — 2 optimized H100s in our benchmark. | The workhorse tier for departmental and product workloads with measured break-even vs APIs. |
| HPC / multi-node tier | Needed for training, fine-tuning at scale, or very high-concurrency frontier serving with tensor parallelism. | Interconnect (NVLink/InfiniBand), power, and cooling dominate the requirement — not just GPU count. |
Worked example
Hardware tiers for self-hosted LLMs
- Quantized serving (INT8/INT4) assumed per tier
- 70B figures from our published Token Tax benchmark
- Edge tier: our live Jetson Orin Nano QA bot; full measured benchmark published
| Tier | What it runs | Cost picture |
|---|---|---|
| Edge: Jetson Orin Nano class | Quantized 3B-8B models + local RAG (embeddings, retrieval, generation on-device) | Hundreds of dollars of hardware and single-digit watts — private local QA at near-zero marginal cost |
| Workstation / single GPU (L40S class) | 7B-13B production serving, embedding and rerank fleets | ~$8-10K per card; best capex per token for small-model serving |
| Server: 2x H100 class | 70B-class serving — Llama 3 70B at INT8 + KV-cache pruning | ~$60K+ capex; benchmark: ~$0.47 per million tokens, 2.3x baseline throughput |
| HPC cluster (multi-node) | Training, large-scale fine-tuning, frontier-model serving at high concurrency | Interconnect, power, and cooling dominate; justified by workloads single nodes cannot hold |
The requirement ladder is set by model size times precision, then by concurrency: quantization moves workloads down a tier, and most teams need one tier less hardware than their FP16 sizing suggests.
Frequently asked questions
How much GPU memory does an LLM need?
Roughly 2GB per billion parameters at FP16, half that at INT8, a quarter at INT4 — plus KV-cache headroom that scales with context length and concurrent requests. A 70B model is ~140GB FP16 but fits two 80GB GPUs comfortably at INT8 with cache pruning.
Can an LLM run on a Jetson Orin Nano?
Yes — quantized small models run on the Orin Nano, and a full local RAG pipeline fits alongside them. NavyaAI runs an edge QA bot this way: documents, embeddings, retrieval, and generation all on the device, with no data leaving it. Our published benchmark measured 88-158 tokens/sec aggregate at 16 concurrent users on the exact board.
What are the hosting requirements beyond GPUs?
Power and cooling sized to sustained load, NVMe storage for model weights and vector indexes, enough system RAM for the serving stack, monitoring, and an operations owner. For multi-GPU serving, interconnect bandwidth matters; for edge fleets, remote update and observability matter most.
How many GPUs do I need for production LLM serving?
Start from tokens per day, peak concurrency, and p95 latency — then quantize and measure. Our published audit served a Llama 3 70B workload on 2 H100s at 2.3x the throughput of the original 4-GPU FP16 deployment. Sizing before optimizing routinely doubles the hardware bill.
References & related
Apply this to your stack
Request a free AI inference audit before changing providers or buying GPUs.
Share your monthly spend, token volume, model stack, RAG or agent pattern, and latency target. NavyaAI will identify the first cost levers to inspect.
Request Free Audit