Commercial guide - Last reviewed 2026-06-10

GPU Requirements for Hosting LLMs: Edge to HPC

GPU and hardware requirements for hosting LLMs on-prem: VRAM by model size, edge devices like Jetson Orin Nano, L40S and H100 servers, and HPC clusters.

Direct answer for GPU requirements for hosting an LLM

The short answer

GPU requirements scale with model size and precision: quantized 3B-8B models run on edge devices like the Jetson Orin Nano or a single workstation GPU, 7B-13B production serving fits one L40S-class card, and 70B-class models need 80GB-class GPUs — two optimized H100s in our published benchmark. Quantization typically halves memory needs, so serving precision is the first sizing decision, not the GPU order.

Size from the workload, not the model card: token volume, concurrency, context length, and p95 latency set the real requirement.

Quantize before you buy — INT8/INT4 serving cuts VRAM roughly in half and can halve the GPU count, as in our Llama 3 70B audit.

Match the tier to the job: edge devices for local private QA and RAG, single GPUs for departmental serving, HPC clusters only for training or very high concurrency.

Comparison table

Factor	Option A	Option B
VRAM rule of thumb	FP16 needs roughly 2GB per billion parameters, plus KV-cache headroom that grows with context and concurrency.	INT8 roughly halves weight memory; INT4 quarters it — that is how 3B-8B models fit edge devices.
Edge tier (Jetson Orin Nano class)	Runs quantized 3B-8B models with local RAG — our own edge QA bot serves answers entirely on a Jetson Orin Nano.	Best for private, local, latency-tolerant assistants over your own documents; not for high concurrency.
Single-server tier (L40S / H100)	7B-13B production serving fits one L40S; 70B-class needs 80GB-class cards — 2 optimized H100s in our benchmark.	The workhorse tier for departmental and product workloads with measured break-even vs APIs.
HPC / multi-node tier	Needed for training, fine-tuning at scale, or very high-concurrency frontier serving with tensor parallelism.	Interconnect (NVLink/InfiniBand), power, and cooling dominate the requirement — not just GPU count.

Worked example

Hardware tiers for self-hosted LLMs

Quantized serving (INT8/INT4) assumed per tier
70B figures from our published Token Tax benchmark
Edge tier: our live Jetson Orin Nano QA bot; full measured benchmark published

Tier	What it runs	Cost picture
Edge: Jetson Orin Nano class	Quantized 3B-8B models + local RAG (embeddings, retrieval, generation on-device)	Hundreds of dollars of hardware and single-digit watts — private local QA at near-zero marginal cost
Workstation / single GPU (L40S class)	7B-13B production serving, embedding and rerank fleets	~$8-10K per card; best capex per token for small-model serving
Server: 2x H100 class	70B-class serving — Llama 3 70B at INT8 + KV-cache pruning	~$60K+ capex; benchmark: ~$0.47 per million tokens, 2.3x baseline throughput
HPC cluster (multi-node)	Training, large-scale fine-tuning, frontier-model serving at high concurrency	Interconnect, power, and cooling dominate; justified by workloads single nodes cannot hold

The requirement ladder is set by model size times precision, then by concurrency: quantization moves workloads down a tier, and most teams need one tier less hardware than their FP16 sizing suggests.

Frequently asked questions

How much GPU memory does an LLM need?

Roughly 2GB per billion parameters at FP16, half that at INT8, a quarter at INT4 — plus KV-cache headroom that scales with context length and concurrent requests. A 70B model is ~140GB FP16 but fits two 80GB GPUs comfortably at INT8 with cache pruning.

Can an LLM run on a Jetson Orin Nano?

Yes — quantized small models run on the Orin Nano, and a full local RAG pipeline fits alongside them. NavyaAI runs an edge QA bot this way: documents, embeddings, retrieval, and generation all on the device, with no data leaving it. Our published benchmark measured 88-158 tokens/sec aggregate at 16 concurrent users on the exact board.

What are the hosting requirements beyond GPUs?

Power and cooling sized to sustained load, NVMe storage for model weights and vector indexes, enough system RAM for the serving stack, monitoring, and an operations owner. For multi-GPU serving, interconnect bandwidth matters; for edge fleets, remote update and observability matter most.

How many GPUs do I need for production LLM serving?

Start from tokens per day, peak concurrency, and p95 latency — then quantize and measure. Our published audit served a Llama 3 70B workload on 2 H100s at 2.3x the throughput of the original 4-GPU FP16 deployment. Sizing before optimizing routinely doubles the hardware bill.

References & related

Edge LLM Sizing Agent Self-Hosted LLM Deployment Services On-Prem LLM Cost Calculator L40S vs H100 ROI NavyaAI Jetson Orin Nano benchmark NavyaAI Token Tax benchmark NVIDIA Jetson Orin Nano

Apply this to your stack

Request a free AI inference audit before changing providers or buying GPUs.

Share your monthly spend, token volume, model stack, RAG or agent pattern, and latency target. NavyaAI will identify the first cost levers to inspect.

Request Free Audit