Back to NavyaAI

Edge LLM Deployment Sizing

Size a private, on-device LLM deployment — hardware tier, per-site cost, fleet totals, and break-even vs cloud APIs — grounded in throughput and power numbers physically measured on a Jetson Orin Nano.

Benchmark data measured June 2026 by NavyaAI Engineering. Read the full benchmark.

Loading...

Measured, not modeled

Sizing grounded in a real bench, not a spreadsheet

Most edge AI calculators model everything. This agent's throughput, power, and concurrency envelopes come from NavyaAI's physical benchmark of a Jetson Orin Nano 8GB: Ollama serving with true GPU concurrency, load-tested from 1 to 16 simultaneous users, with power captured from the board during every run.

88-158 tok/s measured aggregate

Gemma 3 1B-class served 88 tok/s and 270M-class 158 tok/s at 16 concurrent users with true parallel GPU execution.

8-9W under sustained load

Measured with tegrastats during the runs; 1.5W idle. Power cost is computed from these readings, not TDP guesses.

Honest failure boundaries

4B-class models measurably failed under concurrent load on the 8GB board. When your workload needs more, the agent says so and routes you to GPU-server sizing.

Written read in 1 business day

Unlocking the full sizing also gets you a written hardware + break-even read from the team that ran the benchmark.

Edge LLM sizing FAQ

What does the edge LLM sizing agent do?

It sizes a private, on-device LLM deployment from six inputs: use case, data sensitivity, number of sites or devices, users per site, latency, and timeline. You get a feasibility verdict, recommended hardware, per-site monthly cost, fleet totals, and a break-even comparison against cloud APIs.

Are the numbers real or modeled?

The throughput, power draw, and concurrency limits are physically measured on a Jetson Orin Nano 8GB in NavyaAI's June 2026 benchmark — including where the board fails (4B-class models under concurrent load). The full benchmark with all tables is published on our blog. Workload volume and API prices are modeled from your inputs and provider price pages, and every assumption is listed with the result.

What hardware does an edge LLM deployment need?

For document Q&A, support assistants, and classification workloads, a Jetson Orin Nano-class board running a quantized 1B-class model serves up to 16 concurrent users at a measured 88 tokens/sec aggregate. Larger models or heavier concurrency need a GPU server — the agent tells you when that is the honest answer and routes you to the on-prem calculator.

What happens after I unlock the full sizing?

You see the complete fleet cost breakdown and API comparison immediately, and the NavyaAI team sends a written sizing read — hardware tier, per-site costs, fleet math, and a deployment checklist — within one business day.