AI InfrastructureFeatured

OatRAG on Jetson Orin Nano: Enterprise RAG in Under 100MB

We ran a full RAG stack on an NVIDIA Jetson Orin Nano in 90 MB — an 84.8% memory cut vs a traditional stack, leaving 6.45 GB free for the LLM.

Katta Naga Nithin

June 22, 2026

12 min read

OatRAGRAGJetson Orin NanoEdge AIRustPostgreSQLLLMMemory OptimizationEdge ComputingNVIDIA JetsonEdge RAGRAG Memory OptimizationpgvectorOn-Device AIPrivate RAG

OatRAG on Jetson Orin Nano: Enterprise RAG in Under 100MB

Most RAG systems focus on optimizing models. We focused on optimizing everything around the model.

During our testing, a traditional RAG stack consumed nearly 600 MB of runtime memory before inference even began.

OatRAG reduced that footprint to just 90.36 MB.

That's an 84.8% reduction in infrastructure memory usage, leaving more resources available for language models, embeddings, context windows, and concurrent users.

This is the story of how we built a production-grade RAG platform that consumes less memory than many standalone web applications.

The Headline Result

Metric	Value
Total Jetson Memory	7.6 GB
Memory Used by OatRAG Stack	90.36 MB
Total System RAM Usage	917 MB
Available Memory for Models	6.45 GB
Available Memory Percentage	84.8%

Key Finding

The complete OatRAG platform—including backend APIs, vector database, retrieval engine, analytics, and frontend delivery—consumed only 90.36 MB of RAM.

That leaves the overwhelming majority of Jetson memory available for running language models.

For edge AI deployments, this distinction matters. Infrastructure overhead should never become the bottleneck that limits model size, context length, or concurrent users.

Why This Matters

On Jetson Orin Nano, CPU and GPU share the same 8GB unified memory pool.

That memory must simultaneously hold:

Operating system
Application services
Databases
Model weights
KV cache
Runtime buffers
Active inference workloads

Many AI applications consume hundreds of megabytes before inference even begins.

OatRAG was designed differently.

Instead of maximizing framework layers and runtime dependencies, the goal was to minimize overhead while maintaining enterprise-grade retrieval capabilities.

The result is a production-ready RAG platform that consumes less memory than many standalone API servers. If you're weighing edge or on-prem deployment against a managed API, the on-prem LLM cost estimator models the break-even.

Deployment Footprint

Before a single query is executed, infrastructure size matters.

Lightweight deployments reduce update times, improve portability, and leave more resources available for AI workloads.

The complete OatRAG deployment consists of only three lightweight application containers alongside Ollama.

OatRAG Deployment Architecture

Docker Image Sizes

Container Image	Disk Size
oat-rag	120 MB
Oat UI	67.5 MB
pgvector/pgvector	459 MB
ollama/ollama	4.16 GB

Key Observation

Excluding the inference runtime itself, the entire OatRAG application stack occupies less than 650 MB of storage.

This allows rapid deployment even in bandwidth-constrained environments and makes updates significantly easier than larger enterprise AI stacks.

Container-Level Memory Breakdown

One of the primary goals behind OatRAG was reducing infrastructure overhead to an absolute minimum.

The following measurements were captured directly from the live Jetson deployment.

Container Memory Usage

OatRAG vs Traditional RAG Architectures

A common question we receive is:

Why build OatRAG in Rust when existing stacks already work?

To answer this, we compared OatRAG against a typical modern RAG deployment consisting of:

Next.js Frontend
Express Backend
ChromaDB
Python Runtime
Ollama

This architecture is common across internal AI assistants, proof-of-concept RAG deployments, and many open-source projects.

The results highlight how infrastructure choices directly impact edge deployment efficiency. For a cost-and-control comparison of these approaches, see our guide on edge RAG vs the OpenAI API.

Runtime Memory Comparison

Component	Traditional Stack	OatRAG
Frontend	247.38 MB	6.93 MB
Backend	167.59 MB	50.05 MB
Vector Database	79.52 MB	33.38 MB
Ollama Runtime	100.86 MB	55.90 MB
Total Runtime Memory	595.35 MB	90.36 MB

Memory Reduction

Metric	Value
Traditional Stack	595.35 MB
OatRAG	90.36 MB
Memory Saved	504.99 MB
Reduction Percentage	84.8%
Efficiency Improvement	6.6×

Key Finding

OatRAG reduces infrastructure memory consumption by nearly 85%, freeing more than 500 MB of additional RAM for language models, embeddings, KV cache, and concurrent users.

Memory Comparison

What Makes the Difference?

The memory savings do not come from using a smaller model.

They come from eliminating unnecessary infrastructure overhead.

Traditional Approach	OatRAG Approach
Next.js Dev Server	Oat UI
Express + TSX Watchers	Rust + Axum
Python + ChromaDB	PostgreSQL + pgvector
Large Runtime Dependencies	Native Compiled Binaries
Multiple Runtime Layers	Single Lightweight Stack

Instead of asking:

"How can we make the model smaller?"

We asked:

"How can we make everything around the model smaller?"

That philosophy ultimately resulted in over 500 MB of additional memory being available for inference on an 8GB Jetson Orin Nano.

Memory Consumption by Service

Service	Memory Usage
PostgreSQL + pgvector	33.38 MB
OatRAG Backend (Rust + Axum)	50.05 MB
Oat Frontend	6.93 MB
Total	90.36 MB

What Stands Out?

The complete platform remained below 100 MB even while serving live requests.

For comparison:

Runtime	Typical Memory Usage
Node.js API Service	150–250 MB
Python API Service	150–300 MB
OatRAG Full Platform	90 MB

The entire OatRAG infrastructure stack consumes less memory than many standalone Python or Node.js services.

On an 8GB Jetson Orin Nano, that means more memory remains available for what actually matters: running language models.

The Secret: Rust Instead of Heavy Runtime Layers

One of the biggest contributors to OatRAG's efficiency is its backend architecture.

Rather than relying on interpreted runtimes, OatRAG uses a compiled Rust backend powered by Axum.

Why Rust Matters

Rust provides:

No garbage collection pauses
Minimal runtime overhead
Predictable memory allocation
High-performance asynchronous execution
Native machine code performance

The entire backend—including retrieval orchestration, SSE streaming, analytics, and API serving—operates in just 51 MB of resident memory.

This is one of the primary reasons OatRAG maintains such a small operational footprint.

Retrieval Is Not the Bottleneck

A common misconception is that vector search dominates RAG latency.

Our measurements show the opposite.

Average Pipeline Latency

Pipeline Stage	Average Time
Hybrid Retrieval	127.5 ms
Context Compilation	0.28 ms
Time-to-First-Token (TTFT)	3,329 ms
Token Generation	41,204 ms

Key Observation

Retrieval contributes less than 1% of overall query time.

The overwhelming majority of latency originates from model inference itself.

This validates one of OatRAG's core design principles:

Retrieval should be fast enough that the language model remains the dominant contributor to latency.

Hybrid Search Architecture

Modern enterprise knowledge bases require more than vector search alone.

OatRAG combines three complementary retrieval strategies.

Technique	Purpose
pgvector HNSW	Semantic Search
SQLite FTS5 BM25	Keyword Search
Reciprocal Rank Fusion (RRF)	Result Fusion

Instead of choosing between semantic search and keyword search, OatRAG uses both.

This hybrid approach improves:

Recall
Precision
Citation quality
Enterprise document retrieval accuracy

while maintaining sub-150ms retrieval times.

Chunk Structure

Layer	Size	Purpose
Child Chunk	100–200 Characters	Precise Retrieval
Parent Chunk	~1000 Characters	Rich Context

The retrieval engine searches small chunks but returns larger parent sections to the language model.

This architecture provides:

Better retrieval accuracy
More complete context
Reduced hallucinations
Improved answer quality

without increasing retrieval latency.

refer to our Jetson Orin Nano LLM Benchmark, which evaluates real-world inference performance under production workloads.

Memory-Aware Model Management

Large language models introduce another challenge on embedded devices.

Memory exhaustion.

To address this, OatRAG actively manages model lifecycles.

The platform:

Monitors active Ollama sessions
Detects idle models
Automatically unloads inactive models
Reclaims memory immediately after use

This proactive strategy helps prevent out-of-memory conditions while maintaining system responsiveness.

The result is a more stable deployment capable of supporting multiple models without permanently occupying valuable memory resources.

Built for Real Edge Deployments

OatRAG was designed around a simple philosophy:

Every megabyte saved by infrastructure is a megabyte available for inference.

To achieve this, the platform combines:

Rust + Axum
PostgreSQL + pgvector
SQLite FTS5
Parent-Child Chunking
Reciprocal Rank Fusion (RRF)
Automatic Model Unloading

Together, these design decisions maximize the usable memory available to language models while maintaining enterprise-grade retrieval quality.

Final Takeaway

Building efficient AI systems is no longer just about choosing the right model.

As models become smaller, faster, and more capable, the infrastructure around them increasingly determines what can actually run on edge hardware.

OatRAG was built around a simple idea:

Infrastructure should consume as little memory as possible so that models can consume as much as necessary.

By replacing heavyweight runtimes with native Rust services, reducing dependency overhead, and combining lightweight retrieval technologies, OatRAG turns an 8GB Jetson Orin Nano into a capable enterprise RAG platform.

The result isn't just lower memory usage.

It's more room for better models, larger context windows, and future growth on the same hardware.

FAQ

Common questions

What is OatRAG?

OatRAG is a lightweight Retrieval-Augmented Generation (RAG) platform built specifically for resource-constrained environments. It combines Rust, PostgreSQL + pgvector, SQLite FTS5, and hybrid retrieval techniques to deliver enterprise-grade AI search with an extremely small memory footprint.

How much memory does OatRAG consume on Jetson Orin Nano?

During benchmarking, the complete OatRAG infrastructure stack—including backend APIs, vector database, analytics, and frontend delivery—consumed only 90.36 MB of RAM, leaving approximately 84.8% of the Jetson Orin Nano's memory available for model inference.

Why does OatRAG use PostgreSQL + pgvector instead of ChromaDB?

PostgreSQL with pgvector provides enterprise-grade vector search while avoiding the large Python runtime dependencies typically required by ChromaDB. This significantly reduces memory overhead and simplifies deployment.

Why was Rust chosen instead of Node.js or Python?

Rust provides native compiled performance, predictable memory usage, zero garbage collection pauses, and highly efficient concurrency. These characteristics make it ideal for edge AI deployments where every megabyte matters.

Can OatRAG run outside Jetson Orin Nano?

Yes. While optimized for edge AI deployments, OatRAG can also be deployed on cloud servers, on-premise infrastructure, industrial gateways, and other Linux-based environments.

Can you run RAG on a Jetson Orin Nano?

Yes. We ran a full production RAG stack—vector database, hybrid retrieval, API, and UI—on an 8GB NVIDIA Jetson Orin Nano using only 90.36 MB of RAM, leaving about 6.45 GB (84.8%) free for the language model.

How much memory does an edge RAG stack need?

A conventional RAG stack (Next.js, Express, ChromaDB, Python) uses roughly 595 MB before inference even starts. A Rust + PostgreSQL/pgvector stack like OatRAG delivers the same retrieval in about 90 MB—an 84.8% reduction—so most of the device's memory stays available for the model.

OatRAG on Jetson Orin Nano: Enterprise RAG in Under 100MB

The Headline Result

Key Finding

Why This Matters

Deployment Footprint

Docker Image Sizes

Key Observation

Container-Level Memory Breakdown

OatRAG vs Traditional RAG Architectures

Runtime Memory Comparison

Memory Reduction

What Makes the Difference?

Memory Consumption by Service

What Stands Out?

The Secret: Rust Instead of Heavy Runtime Layers

Why Rust Matters

Retrieval Is Not the Bottleneck

Average Pipeline Latency

Key Observation

Hybrid Search Architecture

Chunk Structure

Memory-Aware Model Management

Built for Real Edge Deployments

Final Takeaway

Common questions

Keep going

Edge RAG vs OpenAI API

On-Prem LLM Cost Estimator

Jetson Orin Nano LLM Benchmark