EngineeringFeatured

We Benchmarked LLMs on a $499 Jetson Orin Nano: 158 Tokens/sec at 8 Watts — and Where the Board Breaks

We load-tested Gemma 3 on a Jetson Orin Nano 8GB: measured throughput, latency, and power from 1 to 16 concurrent users — including the configs that failed.

Nithin Katta - NavyaAI Engineering Team

June 10, 2026

12 min read

Edge AIJetson Orin NanoLLM BenchmarksOllamaGemma 3Self-Hosted LLMAI Cost Optimization

We Benchmarked LLMs on a $499 Jetson Orin Nano: 158 Tokens/sec at 8 Watts — and Where the Board Breaks

We run a document Q&A bot whose entire pipeline — embeddings, retrieval, and generation — executes on a single NVIDIA Jetson Orin Nano. Before recommending that pattern to anyone else, we wanted numbers we could defend: not vendor TOPS figures, not extrapolations from datacenter GPUs, but measured throughput, latency, and wall power under realistic multi-user load.

So we load-tested the board the way a real deployment would be used: 1 to 16 concurrent users, fixed prompt set, power captured from the device during every run. This post is the write-up. It includes the configurations that worked, the one serving setting that changes everything, and — just as important — the configurations that failed.

Everything labeled measured below was physically captured on the board. Everything labeled modeled is arithmetic on top of those measurements with stated assumptions. We are publishing both because the difference matters.

TL;DR

Gemma 3 270M: 158.1 tokens/sec aggregate at 16 concurrent users, at 8.0W. (measured)
Gemma 3 1B: 88.1 tokens/sec aggregate at 16 concurrent users, at 9.3W. (measured)
Gemma 3 4B and FunctionGemma: zero completed requests under concurrent load on the 8GB board. Single-user serialized only. (measured — the honest part)
Idle power: 1.5W. The board costs almost nothing when nobody is asking it anything. (measured)
Cost per million output tokens: ~$0.11 (270M) to ~$0.21 (1B) at full concurrency, including board amortization and electricity. (modeled from measured throughput)
Crossover vs gpt-4o-mini: the board's total monthly cost drops below the API bill at roughly 24M tokens/month. (modeled)

Test setup

Component	Detail
Hardware	NVIDIA Jetson Orin Nano 8GB Developer Kit ($499 list)
OS / stack	JetPack 6, L4T 36.4.4, 15W power mode
Inference engine	Ollama (llama.cpp backend)
Quantization	Q4_K_M (4-bit) GGUF
Models	Gemma 3 270M, Gemma 3 1B, Gemma 3 4B, FunctionGemma
Load generator	Locust, headless, from a separate host over HTTP
Concurrency levels	1, 2, 4, 8, 16 simultaneous users
Power capture	tegrastats logged on-device during every run
Concurrent serving config	`OLLAMA_NUM_PARALLEL=16`, `OLLAMA_NUM_CTX=1024`

Requests used a fixed prompt set; the concurrency suite measured generation at 64 output tokens per request so queueing and co-execution behavior dominate the signal rather than long-tail generation variance.

The setting that changes everything: serialization vs true concurrency

Out of the box, an Ollama server processes requests one at a time. Under a single user that is invisible. Under sixteen, it is fatal: every user's wait time becomes the sum of everyone ahead of them in the queue. In our serialized runs, latencies at high user counts ballooned into many seconds of pure queueing — the model wasn't slow, the line was long.

Setting OLLAMA_NUM_PARALLEL changes the execution model: requests co-execute on the GPU, sharing compute cycles token by token. Aggregate throughput climbs with each added user instead of collapsing into a queue, and no single long request blocks everyone else. Every result below uses this true-concurrency configuration.

If you take one operational lesson from this post: a multi-user edge assistant without OLLAMA_NUM_PARALLEL tuned is misconfigured.

Results: Gemma 3 270M (measured)

Concurrent users	Aggregate tokens/sec	Avg latency	P95 latency	Power	Peak RAM
1	12.8	1.6s	4.0s	5.7W	3.0 GB
2	34.4	2.4s	2.5s	4.7W	2.6 GB
4	62.4	1.6s	2.4s	7.4W	3.1 GB
8	112.6	1.8s	3.7s	8.4W	3.2 GB
16	158.1	3.5s	5.1s	8.0W	3.2 GB

Throughput scales ~12x from 1 to 16 users — close to linear — because the bottleneck is memory bandwidth, and parallel decode keeps the memory bus saturated doing useful work instead of idling between serialized requests.

Results: Gemma 3 1B (measured)

Concurrent users	Aggregate tokens/sec	Avg latency	P95 latency	Power	Peak RAM
1	11.8	6.3s	6.3s	7.5W	3.5 GB
2	33.0	3.1s	3.4s	8.8W	3.8 GB
4	54.1	4.2s	4.2s	8.7W	3.9 GB
8	78.6	6.0s	6.3s	10.0W	4.0 GB
16	88.1	7.9s	8.0s	9.3W	4.3 GB

The 1B is the interesting model for real assistants: meaningfully better language quality than the 270M, and it still serves 16 simultaneous users at 88 tokens/sec aggregate — about 5.5 tokens/sec per user under full load, which reads as a steady, usable stream in a chat interface. Throughput begins to plateau between 8 and 16 users as the 68 GB/s memory bus becomes the wall; latency, not throughput, is what degrades past that point.

Where the board breaks (and why we're publishing it)

Gemma 3 4B and FunctionGemma completed zero requests under our concurrent load tests. Not slow — zero. With 8GB of unified memory shared between the OS, the weights, and a 16-slot KV cache, 4B-class models exhaust the budget and requests never complete.

Single-user serialized runs did work: the 4B generated at ~9.5 tokens/sec (with ~54-second end-to-end latencies on our standard request) and FunctionGemma at ~15 tokens/sec. That is fine for a one-person batch workload and unusable for a multi-user assistant.

So the practical, measured ceiling of the Orin Nano 8GB for real concurrent serving is ~1B-class quantized models. Most edge benchmark content stops at "it runs" — single-user, best case. The failure boundary is the number that actually determines whether your deployment works, which is why it is in this post and not in a footnote.

Power and cost (modeled from measured inputs)

Measured inputs: 1.5W idle, 8.0–9.3W under sustained load, and the throughput tables above. Modeled assumptions: $499 board amortized over 3 years, $0.15/kWh, 30% active duty cycle.

That works out to roughly $14/month per board all-in — and a cost per million output tokens of ~$0.11 (270M) to ~$0.21 (1B) at full concurrency.

Against gpt-4o-mini at $0.60 per million output tokens, the arithmetic crosses at roughly 24M tokens/month: below that volume the API is cheaper in raw dollars; above it the board wins, and keeps winning by more as volume grows. Two honest caveats on that comparison: a 1B local model is not gpt-4o-mini — this math applies to tasks where a small model passes your quality bar (document Q&A over a known corpus, classification, routing); and at small volumes the absolute dollars are tiny either way. The economics get interesting at fleet scale — 40 sites is ~$20K of hardware once versus a recurring per-query API bill on every site, plus the thing no API can offer: documents that never leave the device.

What this means for edge QA bots

This benchmark exists because we run this exact deployment: a RAG-backed Q&A assistant on an Orin Nano, fully on-device. The numbers translate into three deployment rules we now apply:

Match the model to the measured envelope. 1B-class for assistants, 270M-class for classification and routing. Wanting a 4B on this board is wanting a different board.
Configure for concurrency on day one. The serialized default is a demo configuration, not a deployment configuration.
Size by latency, not just throughput. The board holds aggregate throughput well past 8 users, but per-user latency stretches. If your product needs sub-2-second responses at 16 concurrent users, that is a two-board or bigger-tier conversation.

If you want these rules applied to your own workload — sites, users, data sensitivity — our Edge LLM Sizing Agent runs on exactly the numbers in this post and tells you honestly when the answer is "this doesn't fit on edge hardware."

Limits of this benchmark

What we did not measure here, and won't claim: end-to-end RAG pipeline latency (embedding + retrieval + generation as one number — that benchmark is in progress on the same rig), thermal behavior over multi-hour sustained load, other quantization formats, and the Orin Nano Super power mode. API prices and the cloud-GPU comparisons are from provider price pages as of June 2026 and will drift.

The raw run data — per-request CSVs, tegrastats logs, and the cost-model worksheet with every assumption — comes from the same suite this post is built on, and we're packaging the run scripts and prompt set into a public repository so the results can be reproduced on your own board.

Benchmarked and written by Nithin Katta, NavyaAI Engineering Team, June 2026.

FAQ

Common questions

Can an LLM run on a Jetson Orin Nano?

Yes — with the right model size. In our measured benchmark, Gemma 3 270M served 158 tokens/sec aggregate and Gemma 3 1B served 88 tokens/sec aggregate at 16 concurrent users on the 8GB board, using Q4_K_M quantization via Ollama. 4B-class models loaded but failed to complete requests under concurrent load.

How many tokens per second does a Jetson Orin Nano produce?

Measured on our bench with true parallel GPU execution: Gemma 3 270M scaled from 12.8 tokens/sec at 1 user to 158.1 tokens/sec aggregate at 16 users; Gemma 3 1B scaled from 11.8 to 88.1 tokens/sec aggregate. Single-user serialized runs of larger models managed ~9.5 (4B) and ~15 (FunctionGemma) tokens/sec.

What does Jetson LLM inference cost vs cloud APIs?

Modeled from our measured throughput and power: roughly $0.11 per million output tokens for the 270M model and $0.21 for the 1B at full concurrency, including board amortization and electricity. Against gpt-4o-mini at $0.60 per million output tokens, the board's total monthly cost crosses below the API bill at roughly 24M tokens/month.

What fails on the 8GB Jetson Orin Nano?

Gemma 3 4B-class models and FunctionGemma completed zero requests under our concurrent load tests — unified memory is the wall. They ran single-user serialized only, at ~9.5 and ~15 tokens/sec with high latencies. The practical ceiling for real multi-user serving on this board is ~1B-class quantized models.