Building Production-Ready GPU-Accelerated Transformer Summarization Services: Python vs Rust
A comprehensive comparison of Python (FastAPI + Hugging Face) versus Rust (Axum + rust-bert) for production transformer inference. Load testing reveals Rust delivers 30-50% lower latency and 35-81% higher throughput.

Building Production-Ready GPU-Accelerated Transformer Summarization Services: Python vs Rust
Table of Contents
π Navigation
- π― Executive Summary
- π Introduction
- π― Why This Comparison Matters
- βοΈ System Requirements
- ποΈ Project Architecture
- π Python Implementation Deep Dive
- π¦ Rust Implementation Deep Dive
- π³ GPU Orchestration
- π§ API Usage Examples
- π Building and Deploying
- π Performance Analysis
- π Production Considerations
- π€ Decision Framework
- π§ Troubleshooting
- π― Conclusion
- π Additional Resources
Quick Navigation:
Executive Summary
TL;DR: Rust delivers 30-50% lower latency and 35-81% higher throughput compared to Python for GPU-accelerated transformer inference, with the advantage increasing under load and batch processing scenarios.
This comprehensive guide compares production-ready Python (FastAPI + Hugging Face) versus Rust (Axum + rust-bert) implementations for transformer inference services. Through extensive load testing on identical GPU infrastructure, we reveal when performance differences matter and provide complete deployment-ready solutions.
Key Findings:
- Latency: Rust consistently achieves 30-50% lower response times
- Throughput: 35-81% higher request processing capacity
- Scalability: Performance advantage increases with concurrent load
- Resource Efficiency: Better GPU utilization in batch scenarios
Introduction
The choice between Python and Rust for deploying machine learning models in production has sparked countless debates in the engineering community. Python, with its rich ecosystem and rapid development cycle, has been the undisputed champion for ML workflows. But as organizations scale their inference services to handle millions of requests, a critical question emerges: Is Python's convenience worth the performance trade-offs?
This comprehensive guide presents a head-to-head comparison of Python (FastAPI + Hugging Face Transformers) versus Rust (Axum + rust-bert) for production transformer inference services. We've built functionally equivalent microservices exposing a /summarize endpoint powered by BART-Large-CNN, deployed them on identical GPU infrastructure, and conducted extensive load testing to measure real-world performance differences.
The Results May Surprise You: Our benchmarks reveal that Rust consistently delivers 30-50% lower latency and 35-81% higher throughput across various workload patterns. But the story is more nuanced than raw numbersβwe'll explore when these advantages matter, when they don't, and how to make informed decisions for your specific use case.
Why This Comparison Matters
The ML inference landscape is evolving rapidly. As transformer models move from research labs to production systems serving millions of users, engineering teams face a critical decision: optimize for development velocity or runtime performance?
Python's Dominance
Python has been the default choice for ML deployments, offering:
- Vast ecosystem of ML libraries and frameworks
- Rapid prototyping and development
- Extensive community support and documentation
- Seamless integration with training pipelines
Rust's Promise
Rust is gaining traction in production ML systems, claiming:
- Significantly lower latency and higher throughput
- Better resource utilization and cost efficiency
- Memory safety without garbage collection overhead
- Predictable performance under load
The Reality: Most articles discussing this topic rely on microbenchmarks or theoretical analysis. This guide differs by providing complete, production-ready implementations and comprehensive load testing on the same hardware, same models, and same GPU infrastructure. The results reveal both the magnitude of performance differences and the scenarios where they matter most.
The Architecture We're Testing
Both implementations follow production-ready patterns:
- GPU-accelerated inference using NVIDIA CUDA
- RESTful API with health checks and error handling
- Rate limiting to prevent abuse
- Batch processing for improved throughput
- Docker containerization with multi-stage builds
- Comprehensive logging and observability
By keeping everything identical except the language and runtime, we can isolate the true performance impact of choosing Python versus Rust for ML inference services.
System Requirements and Platform Compatibility
Before diving into the implementation, it's crucial to understand the hardware and software requirements for running GPU-accelerated transformer models.
GPU Requirements
This guide requires an NVIDIA GPU with CUDA support. The services are designed for GPU acceleration and will not perform optimally on CPU-only systems.
Minimum Requirements
- NVIDIA GPU with CUDA Compute Capability 6.0 or higher
- 8GB+ GPU memory (VRAM) recommended for BART-Large-CNN
- NVIDIA Driver version compatible with CUDA 12.1 or higher
Check Your GPU:
# Linux/Windows with NVIDIA driver installed
nvidia-smi
# Expected output shows GPU model, driver version, and CUDA version
Platform Compatibility
β Supported Platforms
-
Linux (Recommended)
- Native CUDA support
- Best performance and compatibility
- Required: NVIDIA Driver + NVIDIA Container Toolkit
- Tested on: Ubuntu 20.04+, CentOS 8+, RHEL 8+
-
Windows (with WSL2)
- CUDA support via WSL2 (Windows Subsystem for Linux 2)
- Requires: Windows 10/11 with WSL2 + NVIDIA Driver for Windows
- Docker Desktop with WSL2 backend
- Performance: ~5-10% slower than native Linux
- See: NVIDIA CUDA on WSL2 Guide
β Not Supported
- macOS
- CUDA is NOT available on macOS (Apple discontinued NVIDIA GPU support after macOS 10.13)
- Even with NVIDIA eGPU, CUDA drivers are not available
- Alternatives for Mac users:
- Use CPU-only mode (significantly slower, 10-50Γ latency increase)
- Deploy to cloud GPU instances (AWS, GCP, Azure)
- Use Apple Silicon with Metal Performance Shaders (requires different implementation)
- Remote development on Linux/Windows machines
Verifying Your Setup
# Check NVIDIA driver
nvidia-smi
# Check Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Check CUDA version compatibility
nvcc --version # Should show CUDA 12.x or compatible
# Check available GPU memory
nvidia-smi --query-gpu=memory.free --format=csv
Project Architecture Overview
The project implements two functionally equivalent services with the following structure:
summarizer-service/
ββ python-fastapi/ # Python implementation
β ββ app.py # Main FastAPI application
β ββ models.py # Pydantic models for validation
β ββ requirements.txt # Python dependencies
β ββ Dockerfile # GPU-enabled container
β ββ start.sh # Entry point script
ββ rust-axum/ # Rust implementation
β ββ Cargo.toml # Rust dependencies
β ββ src/
β β ββ main.rs # Main Axum application
β ββ Dockerfile # GPU-enabled container
β ββ build.sh # Build script
ββ compose.gpu.yml # Docker Compose for both services
ββ README.md # Deployment documentation
ββ loadtest.md # Load testing guide
Both services provide identical API endpoints:
- GET /health: Service health and status information
- POST /summarize: Text summarization with configurable parameters
Deep Dive: Python Implementation (FastAPI + Hugging Face Transformers)
Design Philosophy
The Python implementation leverages the Hugging Face Transformers library, which has become the de facto standard for working with transformer models. FastAPI provides a modern, high-performance framework with automatic API documentation, type validation, and async support.
Core Dependencies
The Python service relies on several key libraries:
View Dependencies
fastapi==0.115.0 # Modern web framework
transformers==4.44.0 # Hugging Face transformers
torch # PyTorch for model inference
uvicorn==0.30.0 # ASGI server
accelerate==0.33.0 # Training/inference optimization
pydantic==2.8.2 # Data validation
The full requirements.txt includes exact version pins for reproducibility, with PyTorch configured for CUDA 12.1:
--extra-index-url https://download.pytorch.org/whl/cu121
Important Note: The CUDA version in the wheel URL (cu121) must match your host GPU driver compatibility. For different CUDA versions, adjust both the index URL and version suffix accordingly.
Data Models and Validation
Pydantic models provide automatic request validation and API documentation:
View Python Models
from pydantic import BaseModel, Field
from typing import List, Optional
class SummarizeIn(BaseModel):
texts: List[str] = Field(..., description="List of input documents to summarize")
max_length: int = Field(128, ge=16, le=512)
min_length: int = Field(32, ge=8, le=256)
do_sample: bool = False
temperature: float = Field(1.0, ge=0.1, le=2.0)
top_p: float = Field(0.9, ge=0.1, le=1.0)
class Summary(BaseModel):
text: str
class SummarizeOut(BaseModel):
summaries: List[Summary]
model: str
device: str
latency_ms: float
class HealthOut(BaseModel):
status: str
model: str
device: str
These models ensure:
- Type Safety: Automatic type checking and conversion
- Constraint Validation: Length and range constraints on parameters
- Documentation: Automatic OpenAPI schema generation
- Client Generation: Easy client library creation from schemas
Application Implementation
View Python Application Code
import os
import time
import logging
from typing import List
import torch
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import HTMLResponse
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from models import SummarizeIn, SummarizeOut, Summary, HealthOut
# ---------------- Config ----------------
MODEL_NAME = os.environ.get("MODEL_NAME", "facebook/bart-large-cnn")
MAX_CONCURRENT_BATCH = int(os.environ.get("MAX_CONCURRENT_BATCH", "8"))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = os.environ.get("DTYPE", "fp16") # "fp16" or "fp32"
GEN_KW = {"num_beams": int(os.environ.get("NUM_BEAMS", "4"))}
# --------------- Logging ----------------
logging.basicConfig(
level=os.environ.get("LOG_LEVEL", "INFO"),
format="%(asctime)s %(levelname)s %(name)s - %(message)s"
)
log = logging.getLogger("summarizer")
print("DEVICE", DEVICE)
# ----------- Load model/tokenizer -------
log.info(f"Loading model {MODEL_NAME} on {DEVICE} ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
if DEVICE == "cuda":
if DTYPE == "fp16":
model = model.half()
model = model.to(DEVICE)
model.eval()
log.info("Model loaded.")
Key Implementation Details:
-
Environment-Driven Configuration: All parameters (model name, batch size, precision) are configurable via environment variables, enabling easy deployment variations.
-
Mixed Precision Support: The
DTYPE=fp16option reduces GPU memory usage by ~50% and increases throughput on modern GPUs with Tensor Cores, without significant accuracy loss. -
Device Detection: Automatic fallback to CPU if CUDA isn't available, useful for development and testing.
-
Evaluation Mode: Setting
model.eval()disables dropout and batch normalization training behavior, ensuring consistent inference results.
Rate Limiting Implementation
View Rate Limiting Code
from time import monotonic
from collections import deque
RATE_PER_SEC = float(os.environ.get("RATE_PER_SEC", "20"))
WINDOW_SEC = 1.0
bucket = deque()
def rate_limit_ok() -> bool:
now = monotonic()
while bucket and now - bucket[0] > WINDOW_SEC:
bucket.popleft()
if len(bucket) < RATE_PER_SEC:
bucket.append(now)
return True
return False
This token bucket implementation:
- Maintains a sliding window of request timestamps
- Allows bursts up to the rate limit
- Provides configurable limits per deployment environment
- Uses efficient deque operations for O(1) updates
API Endpoints
View API Implementation
Health Check Endpoint
@app.get("/health", response_model=HealthOut)
def health():
return HealthOut(status="ok", model=MODEL_NAME, device=DEVICE)
The health endpoint provides:
- Service Status: Quick readiness check
- Configuration Visibility: Model and device information
- Load Balancer Integration: Standard endpoint for orchestration tools
Summarization Endpoint
@app.post("/summarize", response_model=SummarizeOut)
def summarize(payload: SummarizeIn, request: Request):
if not rate_limit_ok():
raise HTTPException(status_code=429, detail="Rate limit exceeded")
texts = payload.texts[:MAX_CONCURRENT_BATCH]
if not texts:
raise HTTPException(status_code=400, detail="No texts provided")
start = time.time()
with torch.inference_mode():
batch = tokenizer(
texts, padding=True, truncation=True, max_length=1024, return_tensors="pt"
)
if DEVICE == "cuda":
batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
outputs = model.generate(
**batch,
max_length=payload.max_length,
min_length=payload.min_length,
do_sample=payload.do_sample,
temperature=payload.temperature,
top_p=payload.top_p,
**GEN_KW
)
decoded: List[str] = tokenizer.batch_decode(outputs, skip_special_tokens=True)
latency_ms = (time.time() - start) * 1000.0
return SummarizeOut(
summaries=[Summary(text=s) for s in decoded],
model=MODEL_NAME,
device=DEVICE,
latency_ms=round(latency_ms, 2),
)
Performance Optimizations:
- Batch Processing: Processing multiple texts together maximizes GPU utilization
- Inference Mode:
torch.inference_mode()disables gradient tracking, reducing memory and computation - Non-blocking Transfers:
non_blocking=Trueoverlaps CPU-GPU data transfer with computation - Padding and Truncation: Ensures consistent tensor shapes for efficient batching
Docker Configuration
View Python Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive \
PIP_NO_CACHE_DIR=1 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 python3.10-venv python3-pip git curl ca-certificates && \
ln -sf /usr/bin/python3.10 /usr/bin/python && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt /app/
RUN python -m pip install --upgrade pip && pip install -r requirements.txt
COPY app.py models.py start.sh /app/
RUN chmod +x /app/start.sh
ENV MODEL_NAME=facebook/bart-large-cnn \
NUM_BEAMS=4 \
DTYPE=fp16 \
MAX_CONCURRENT_BATCH=8 \
RATE_PER_SEC=50
EXPOSE 8000
CMD ["/app/start.sh"]
Docker Best Practices:
- Base Image Selection: Using CUDA runtime (not devel) reduces image size while maintaining inference capability
- Layer Caching: Copying requirements before code enables faster rebuilds
- Cleanup: Removing apt lists reduces image size
- Default Configuration: Environment variables provide sensible defaults
Deep Dive: Rust Implementation (Axum + rust-bert)
Design Philosophy
The Rust implementation prioritizes performance, memory safety, and predictable latency. Using Axum for the web framework and rust-bert for model inference provides a type-safe, high-performance alternative to Python.
Dependencies Configuration
View Rust Dependencies
[package]
name = "summarizer-axum"
version = "0.1.0"
edition = "2021"
[dependencies]
axum = "0.7"
tokio = { version = "1.39", features = ["rt-multi-thread", "macros"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tower = "0.5"
tower-http = { version = "0.5", features = ["trace", "cors"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
anyhow = "1.0"
once_cell = "1.19"
# rust-bert + tch (libtorch) bindings
rust-bert = "0.23"
tch = "0.17"
# simple rate limiting
governor = { version = "0.6", features = ["std"] }
# Explicitly add console with std feature
console = { version = "0.16", features = ["std"] }
Key Library Choices:
- Axum: Modern, ergonomic web framework built on Tokio
- rust-bert: Rust bindings for transformer models
- tch: Rust bindings to libtorch (PyTorch's C++ backend)
- governor: High-performance rate limiting
- tokio: Async runtime for concurrent request handling
Application Structure
View Rust Application Code
use std::{net::SocketAddr, sync::Arc, time::Instant, num::NonZeroU32};
use anyhow::Result;
use axum::{
extract::State,
http::{HeaderValue, Method, StatusCode},
routing::{get, post},
Json, Router,
};
use governor::{
clock::DefaultClock,
state::{InMemoryState, NotKeyed},
Quota,
RateLimiter,
};
use once_cell::sync::Lazy;
use rust_bert::pipelines::summarization::SummarizationModel;
use serde::{Deserialize, Serialize};
use tokio::sync::Mutex;
use tower_http::{cors::CorsLayer, trace::TraceLayer};
use tracing::{error, info};
Rust Advantages:
- Zero-cost Abstractions: Serialization code is generated at compile time
- Type Safety: Impossible to construct invalid responses
- Performance: No runtime validation overhead
Request/Response Models
View Rust Models
#[derive(Debug, Deserialize)]
struct SummarizeIn {
texts: Vec<String>,
max_length: Option<i64>,
min_length: Option<i64>,
do_sample: Option<bool>,
temperature: Option<f64>,
top_p: Option<f64>,
}
#[derive(Debug, Serialize)]
struct Summary {
text: String,
}
#[derive(Debug, Serialize)]
struct SummarizeOut {
summaries: Vec<Summary>,
model: String,
device: String,
latency_ms: f64,
}
Application State and Initialization
View Rust Main Function
#[derive(Clone)]
struct AppState {
model: Arc<Mutex<SummarizationModel>>,
model_name: String,
device: String,
}
#[tokio::main]
async fn main() -> Result<()> {
// Setup logging
tracing_subscriber::registry()
.with(
tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| "info,tower_http=info".into()),
)
.with(tracing_subscriber::fmt::layer())
.init();
println!("CUDA available: {}", tch::Cuda::is_available());
println!("CUDNN available: {}", tch::Cuda::cudnn_is_available());
// Load summarization model
let model_name =
std::env::var("MODEL_NAME").unwrap_or_else(|_| "facebook/bart-large-cnn".into());
info!("Loading model: {}", &model_name);
let model = SummarizationModel::new(Default::default())?;
let device = if tch::Cuda::is_available() {
"cuda".into()
} else {
"cpu".into()
};
info!("Device: {}", device);
// Shared application state
let state = AppState {
model: Arc<new(Mutex::new(model)),
model_name: model_name.clone(),
device,
};
// Setup CORS
let cors = CorsLayer::new()
.allow_origin(HeaderValue::from_static("*"))
.allow_methods([Method::GET, Method::POST])
.allow_headers([axum::http::header::CONTENT_TYPE]);
// Build routes
let app = Router::new()
.route("/health", get(health))
.route("/summarize", post(summarize))
.with_state(state)
.layer(TraceLayer::new_for_http())
.layer(cors);
// Start server
let port: u16 = std::env::var("PORT")
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(8000);
let addr = SocketAddr::from(([0, 0, 0, 0], port));
info!("Listening on {}", addr);
axum::serve(tokio::net::TcpListener::bind(addr).await?, app).await?;
Ok(())
}
Implementation Highlights:
- Structured Concurrency: Tokio's async runtime manages thousands of concurrent connections efficiently
- Shared State:
Arc<Mutex<>>allows safe concurrent access to the model - Layer Composition: Middleware layers provide logging, CORS, and tracing
- CUDA Detection: Runtime checks ensure GPU availability
Rate Limiting
View Rust Rate Limiting
static RLIM: Lazy<RateLimiter<NotKeyed, InMemoryState, DefaultClock>> = Lazy::new(|| {
let q = std::env::var("RATE_PER_SEC")
.ok()
.and_then(|s| s.parse::<u32>().ok())
.unwrap_or(50);
RateLimiter::direct(Quota::per_second(NonZeroU32::new(q).unwrap()))
});
The governor crate provides:
- Low Overhead: Atomic operations for rate tracking
- Flexibility: Per-second, per-minute, or custom quotas
- Global State: Single rate limiter instance across all requests
API Endpoints Implementation
View Rust API Endpoints
async fn health(State(state): State<AppState>) -> Json<serde_json::Value> {
Json(serde_json::json!({
"status": "ok",
"model": state.model_name,
"device": state.device
}))
}
async fn summarize(
State(state): State<AppState>,
Json(input): Json<SummarizeIn>,
) -> Result<Json<SummarizeOut>, (StatusCode, String)> {
if RLIM.check().is_err() {
return Err((StatusCode::TOO_MANY_REQUESTS, "rate limit".into()));
}
if input.texts.is_empty() {
return Err((StatusCode::BAD_REQUEST, "no texts".into()));
}
let start = Instant::now();
let m = state.model.clone();
let texts = input.texts.clone();
let outputs = tokio::task::spawn_blocking(move || {
let mut m = m.blocking_lock();
m.summarize(&texts)
})
.await
.map_err(internal)?;
let summaries = outputs
.into_iter()
.flatten()
.map(|t| Summary { text: t })
.collect::<Vec<_>>();
let latency_ms = start.elapsed().as_secs_f64() * 1000.0;
Ok(Json(SummarizeOut {
summaries,
model: state.model_name.clone(),
device: state.device.clone(),
latency_ms: (latency_ms * 100.0).round() / 100.0,
}))
}
Critical Design Pattern: The spawn_blocking call is essential. Model inference is CPU/GPU-intensive and synchronous, which would block the async runtime. Moving it to a blocking thread pool prevents blocking other requests.
Rust Docker Configuration
View Rust Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 AS builder
ARG LIBTORCH_URL=https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu121.zip
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
curl unzip ca-certificates pkg-config build-essential git \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
ENV PATH=/root/.cargo/bin:$PATH
WORKDIR /opt
RUN curl -L "$LIBTORCH_URL" -o libtorch.zip && unzip libtorch.zip && rm libtorch.zip
ENV LIBTORCH=/opt/libtorch
ENV LD_LIBRARY_PATH=/opt/libtorch/lib:$LD_LIBRARY_PATH
WORKDIR /build
COPY Cargo.toml /build/
COPY src /build/src
RUN cargo build --release
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates libssl3 \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /opt/libtorch /opt/libtorch
ENV LIBTORCH=/opt/libtorch
ENV LD_LIBRARY_PATH=/opt/libtorch/lib:$LD_LIBRARY_PATH
ENV RUST_LOG=info
ENV MODEL_NAME=facebook/bart-large-cnn
ENV RATE_PER_SEC=80
ENV PORT=8000
COPY --from=builder /build/target/release/summarizer-axum /usr/local/bin/summarizer-axum
EXPOSE 8000
CMD ["/usr/local/bin/summarizer-axum"]
Multi-Stage Build Benefits:
- Smaller Final Image: Runtime image excludes build tools (often 50% size reduction)
- Security: Fewer packages means smaller attack surface
- Performance: Only runtime dependencies affect startup time
- Clarity: Separation of build and runtime concerns
GPU Orchestration with Docker Compose
View Docker Compose Configuration
version: "3.9"
services:
py-summarizer:
build:
context: ./python-fastapi
image: py-summarizer:cuda
environment:
- MODEL_NAME=facebook/bart-large-cnn
- DTYPE=fp16
- NUM_BEAMS=4
- MAX_CONCURRENT_BATCH=8
- RATE_PER_SEC=100
ports:
- "8001:8000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
rs-summarizer:
build:
context: ./rust-axum
image: rs-summarizer:cuda
environment:
- MODEL_NAME=facebook/bart-large-cnn
- RATE_PER_SEC=120
- RUST_LOG=info
ports:
- "8002:8000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
Configuration Notes:
- Port Mapping: Different host ports (8001, 8002) allow simultaneous testing
- GPU Access:
runtime: nvidiarequires NVIDIA Container Toolkit - Resource Limits: Deploy section reserves GPU resources
- Environment-Specific Settings: Each service can have different rate limits
API Usage Examples
Making Requests
View Example API Calls
Both services expose identical APIs. Here's how to interact with them:
# Python service (port 8001)
curl -X POST http://localhost:8001/summarize \
-H "Content-Type: application/json" \
-d '{
"texts": [
"Transformers are fantastic. They allow attention-based sequence modeling
which has revolutionized natural language processing. These models can
process entire sequences in parallel, unlike recurrent neural networks."
],
"max_length": 80,
"min_length": 32,
"do_sample": false
}'
# Rust service (port 8002)
curl -X POST http://localhost:8002/summarize \
-H "Content-Type: application/json" \
-d '{
"texts": [
"Climate change is one of the most pressing challenges facing humanity.
Rising temperatures and extreme weather events are becoming more frequent."
],
"max_length": 60,
"min_length": 20
}'
Response Format
View Example Response
Both services return identical JSON structure:
{
"summaries": [
{
"text": "Transformers use attention-based sequence modeling. They process sequences in parallel, unlike RNNs."
}
],
"model": "facebook/bart-large-cnn",
"device": "cuda",
"latency_ms": 945.12
}
Health Checks
View Health Check Examples
# Check Python service
curl http://localhost:8001/health
# Check Rust service
curl http://localhost:8002/health
# Response format:
{
"status": "ok",
"model": "facebook/bart-large-cnn",
"device": "cuda"
}
Building and Deploying
Individual Service Deployment
Python Service Deployment
cd summarizer-service/python-fastapi
# Build image
docker build -t py-summarizer:cuda .
# Run with GPU
docker run --rm -it --gpus all -p 8001:8000 \
-e MODEL_NAME=facebook/bart-large-cnn \
-e DTYPE=fp16 \
-e NUM_BEAMS=4 \
-e MAX_CONCURRENT_BATCH=8 \
-e RATE_PER_SEC=100 \
py-summarizer:cuda
Rust Service Deployment
cd summarizer-service/rust-axum
# Build image
docker build -t rs-summarizer:cuda .
# Run with GPU
docker run --rm -it --gpus all -p 8002:8000 \
-e MODEL_NAME=facebook/bart-large-cnn \
-e RATE_PER_SEC=120 \
-e RUST_LOG=info \
rs-summarizer:cuda
Compose Deployment
docker compose -f compose.gpu.yml up --build
Comprehensive Load Testing Analysis
Load testing reveals the real-world performance characteristics of both implementations. The following results were obtained on identical hardware with an NVIDIA GPU.
Testing Methodology
All tests use hey, a modern HTTP load generator:
Installing Hey Load Generator
Installation
# macOS
brew install hey
# Linux
sudo apt update
sudo apt install golang-go -y
source ~/.profile
go install github.com/rakyll/hey@latest
echo 'export PATH=$PATH:$(go env GOPATH)/bin' >> ~/.bashrc
source ~/.bashrc
hey -version
# Windows
# Download Go from https://go.dev/dl/
# Install MSI, restart PowerShell
go install github.com/rakyll/hey@latest
# Add %USERPROFILE%\go\bin to PATH if needed
hey -version
Test parameters:
- Duration: 30-60 seconds per test
- Concurrency: 8-16 concurrent connections
- Request Body: Consistent 100+ word text samples
- Metrics: Throughput (req/sec), latency (percentiles), error rates
Performance Comparison Charts
The following charts visualize the performance differences across various test scenarios:
Latency Comparison
Figure 1: Average latency comparison across different test scenarios. Lower values indicate better performance.
Throughput Comparison
Figure 2: Request throughput comparison showing requests per second. Higher values indicate better performance.
Performance Advantages Overview
Figure 3: Summary of Rust's performance advantages over Python across all test scenarios.
Detailed Test Results
Test 1: Light Load - Single Text (8 concurrent, 30s)
Python FastAPI Service:
hey -z 30s -c 8 -m POST \
-H "Content-Type: application/json" \
-d '{
"texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
"max_length": 80,
"min_length": 32
}' \
http://localhost:8001/summarize
Results:
Summary:
Total: 39.1539 secs
Slowest: 13.2829 secs
Fastest: 12.7253 secs
Average: 13.0247 secs
Requests/sec: 0.6130
Response time histogram:
P50: 13.0765 secs
P75: 13.1394 secs
P90: 13.2120 secs
P95: 13.2829 secs
Status code distribution:
[200] 24 responses
Rust Axum Service:
hey -z 30s -c 8 -m POST \
-H "Content-Type: application/json" \
-d '{
"texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
"max_length": 80,
"min_length": 32
}' \
http://localhost:8002/summarize
Results:
Summary:
Total: 38.7155 secs
Slowest: 10.1396 secs
Fastest: 1.5438 secs
Average: 8.6167 secs
Requests/sec: 0.8265
Response time histogram:
P50: 9.4874 secs
P75: 9.6208 secs
P90: 9.8464 secs
P95: 10.1396 secs
Status code distribution:
[200] 32 responses
Analysis:
- Throughput: Rust achieved 35% higher throughput (0.83 vs 0.61 req/sec)
- Latency: Rust's average latency was 34% lower (8.6s vs 13.0s)
- Consistency: Rust showed better latency distribution (1.5s-10.1s vs 12.7s-13.3s)
- Completed Requests: Rust completed 33% more requests (32 vs 24)
Test 2: Moderate Load - Single Text (16 concurrent, 60s)
Python FastAPI Service:
hey -z 60s -c 16 -m POST -t 30 \
-H "Content-Type: application/json" \
-d '{
"texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
"max_length": 80,
"min_length": 32
}' \
http://localhost:8001/summarize
Results:
Summary:
Total: 77.3369 secs
Slowest: 26.3156 secs
Fastest: 25.1225 secs
Average: 25.7400 secs
Requests/sec: 0.6207
Response time histogram:
P50: 25.7552 secs
P75: 25.9583 secs
P90: 26.1405 secs
P95: 26.1553 secs
Status code distribution:
[200] 48 responses
Rust Axum Service:
hey -z 60s -c 16 -m POST -t 30 \
-H "Content-Type: application/json" \
-d '{
"texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
"max_length": 80,
"min_length": 32
}' \
http://localhost:8002/summarize
Results:
Summary:
Total: 78.7227 secs
Slowest: 21.4153 secs
Fastest: 3.0054 secs
Average: 17.6692 secs
Requests/sec: 0.8003
Response time histogram:
P50: 19.4342 secs
P75: 19.6269 secs
P90: 19.7422 secs
P95: 20.0248 secs
Status code distribution:
[200] 63 responses
Analysis:
- Throughput: Rust maintained 29% higher throughput (0.80 vs 0.62 req/sec)
- Latency: Rust's average latency was 31% lower (17.7s vs 25.7s)
- Scalability: Rust handled increased concurrency better (63 vs 48 completed)
- Latency Variance: Rust showed wider variance (3-21s vs 25-26s), suggesting better request parallelization
Test 3: Batch Processing (4 texts, 8 concurrent, 30s)
Python FastAPI Service:
hey -z 30s -c 8 -m POST \
-H "Content-Type: application/json" \
-d '{
"texts": [
"Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models have shown remarkable performance.",
"Climate change is one of the most pressing challenges facing humanity. Rising temperatures and extreme weather events are becoming more frequent.",
"The global economy has seen significant changes in recent years with the rise of digital currencies and blockchain technology.",
"Space exploration continues to advance with new missions to Mars and plans for lunar bases in the coming decades."
],
"max_length": 60,
"min_length": 20
}' \
http://localhost:8001/summarize
Results:
Summary:
Total: 35.7881 secs
Slowest: 19.5253 secs
Fastest: 16.2402 secs
Average: 17.8389 secs
Requests/sec: 0.4471
Response time histogram:
P50: 18.8195 secs
P75: 19.2788 secs
P90: 19.5253 secs
Status code distribution:
[200] 16 responses
Total summaries: 64 (16 requests Γ 4 texts)
Rust Axum Service:
hey -z 30s -c 8 -m POST \
-H "Content-Type: application/json" \
-d '{
"texts": [
"Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models have shown remarkable performance.",
"Climate change is one of the most pressing challenges facing humanity. Rising temperatures and extreme weather events are becoming more frequent.",
"The global economy has seen significant changes in recent years with the rise of digital currencies and blockchain technology.",
"Space exploration continues to advance with new missions to Mars and plans for lunar bases in the coming decades."
],
"max_length": 60,
"min_length": 20
}' \
http://localhost:8002/summarize
Results:
Summary:
Total: 38.3619 secs
Slowest: 10.9850 secs
Fastest: 2.6813 secs
Average: 8.8654 secs
Requests/sec: 0.8081
Response time histogram:
P50: 9.6433 secs
P75: 9.7069 secs
P90: 9.8297 secs
P95: 10.9850 secs
Status code distribution:
[200] 31 responses
Total summaries: 124 (31 requests Γ 4 texts)
Analysis:
- Batch Efficiency: Rust processed 94% more total summaries (124 vs 64)
- Throughput: Rust achieved 81% higher request throughput (0.81 vs 0.45 req/sec)
- Latency: Rust's average latency was 50% lower (8.9s vs 17.8s)
- GPU Utilization: Better parallelization in Rust translates to more efficient GPU usage
Performance Summary Table
| Metric | Python (Single) | Rust (Single) | Python (Batch) | Rust (Batch) |
|---|---|---|---|---|
| Avg Latency | 13.0s | 8.6s (-34%) | 17.8s | 8.9s (-50%) |
| P95 Latency | 13.3s | 10.1s (-24%) | 19.5s | 11.0s (-44%) |
| Throughput | 0.61 req/s | 0.83 req/s (+35%) | 0.45 req/s | 0.81 req/s (+81%) |
| Completed (30s) | 24 | 32 (+33%) | 16 | 31 (+94%) |
Key Insights from Load Testing
1. Rust Performance Advantage:
- Consistently 30-50% lower latency across all test scenarios
- Better throughput, especially under increased load
- More efficient request handling and GPU utilization
2. Batching Benefits:
- Both services show improved efficiency with batched requests
- Rust's advantage is more pronounced with batching (81% vs 35% improvement)
- Proper batching can significantly increase throughput without additional hardware
3. Scalability Patterns:
- Python shows more consistent latency but lower throughput
- Rust shows wider latency distribution but higher overall throughput
- Both services maintain stability without errors under tested loads
4. Production Considerations:
- For high-throughput scenarios: Rust offers clear advantages
- For predictable latency requirements: Python's consistency may be valuable
- Batching should be enabled in production for both implementations
Production Deployment Considerations
Model Selection Strategy
Available Model Options
The default facebook/bart-large-cnn provides excellent summarization quality but isn't the only option:
Standard Models:
facebook/bart-large-cnn: Best quality, slower inference (~13s Python, ~9s Rust)sshleifer/distilbart-cnn-12-6: Distilled version, 40% faster, 95% qualityfacebook/bart-base: Smaller model, 60% faster, 90% qualitygoogle/pegasus-cnn_dailymail: Alternative architecture, comparable quality
Switching Models:
Simply set the MODEL_NAME environment variable:
docker run --gpus all -e MODEL_NAME=sshleifer/distilbart-cnn-12-6 ...
Model Selection Criteria:
- Quality Priority: BART-Large-CNN or PEGASUS
- Speed Priority: DistilBART or BART-Base
- Balanced: DistilBART-CNN-12-6 offers best quality/speed tradeoff
- Domain-Specific: Fine-tune on your corpus or use domain-specific models
Optimization Strategies
Mixed Precision & Quantization
Mixed Precision Inference (Python):
FP16 reduces memory usage by ~50% and increases throughput:
# Enable FP16
docker run -e DTYPE=fp16 ...
# Verify GPU supports FP16 efficiently:
nvidia-smi --query-gpu=compute_cap --format=csv
# Compute capability β₯ 7.0 recommended for FP16
Benefits:
- 50% memory reduction allows larger batches
- 1.5-2Γ faster inference on Tensor Core GPUs
- Negligible quality impact for most summarization tasks
Quantization (Advanced):
For even faster inference:
- INT8 quantization: 4Γ memory reduction, 2-3Γ speedup
- Tools: PyTorch quantization API, ONNX Runtime, TensorRT
- Tradeoff: 1-2% quality degradation
Batching Configuration
# Python
MAX_CONCURRENT_BATCH = 8 # Process up to 8 texts together
# Rust
// Processes entire input array as batch
let summaries = model.summarize(&texts);
Batching Best Practices:
- Start with batch size 4-8 for testing
- Monitor GPU memory usage (
nvidia-smi) - Increase until GPU memory ~80% utilized
- Balance latency vs throughput requirements
Cold Start Optimization
Pre-download Models in Docker
# Add to Dockerfile before COPY
RUN python -c "
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;
AutoTokenizer.from_pretrained('facebook/bart-large-cnn');
AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn');
"
Benefits:
- No download delay on first request
- Faster container startup
- More predictable deployment
Tradeoffs:
- Larger image size (~3-4GB)
- Rebuild required for model updates
- Storage costs in registry
Security Considerations
Input Sanitization
# Limit text length to prevent abuse
if any(len(text) > 10000 for text in payload.texts):
raise HTTPException(400, "Text too long")
# Limit batch size
if len(payload.texts) > MAX_CONCURRENT_BATCH:
payload.texts = payload.texts[:MAX_CONCURRENT_BATCH]
When to Choose Python vs Rust
Choose Python When:
Python Use Cases
1. Rapid Prototyping and Research:
- Experimenting with different models and architectures
- Quick iterations during development
- Research team primarily uses Python
2. Rich Ecosystem Requirements:
- Need latest model releases from Hugging Face
- Extensive data preprocessing with pandas/numpy
- Integration with ML training pipelines
3. Team Expertise:
- Team is proficient in Python
- Limited Rust experience
- Need to onboard new developers quickly
4. Simple Deployment:
- Low to moderate traffic
- Not latency-critical
- Infrastructure already Python-based
Python Example Use Case: A content management system that summarizes articles for preview cards. Traffic is moderate (100 req/min), and consistency matters more than raw speed.
Choose Rust When:
Rust Use Cases
1. High-Performance Requirements:
- Serving thousands of requests per second
- Strict latency SLAs (P99 < 100ms)
- Maximum throughput per GPU
2. Resource Efficiency:
- Cost optimization through better hardware utilization
- Edge deployment with limited resources
- Multiple services on same hardware
3. Production Stability:
- Long-running services requiring stability
- Memory safety is critical
- Predictable performance under load
4. Microservice Architecture:
- Service mesh with multiple small services
- Need for low-overhead HTTP/gRPC services
- Container orchestration at scale
Rust Example Use Case: A real-time news aggregation platform that summarizes breaking news articles. Must handle traffic spikes (1000+ req/min) with sub-second P95 latency.
Hybrid Approach
Hybrid Strategies
Many organizations use both:
Pattern 1: Development in Python, Deployment in Rust
- Prototype and train models in Python
- Export to ONNX or safetensors format
- Deploy with Rust for production inference
Pattern 2: Different Services Different Languages
- Complex preprocessing: Python (pandas, spaCy)
- Core inference: Rust (performance-critical)
- Post-processing: Python (flexibility)
Pattern 3: Progressive Migration
- Start with Python for all components
- Profile and identify bottlenecks
- Rewrite only performance-critical paths in Rust
Troubleshooting Common Issues
GPU Not Detected
Symptoms:
- Service logs show
device: cpu - Slow inference times
Solutions:
# Verify NVIDIA driver
nvidia-smi
# Check CUDA version compatibility
nvcc --version
# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Check container runtime
docker info | grep -i runtime
# Should show: Runtimes: nvidia runc
Common Issues:
- NVIDIA Container Toolkit not installed
- Docker daemon not configured for GPU
- CUDA version mismatch between host and container
Out of Memory Errors
Symptoms:
CUDA out of memoryerrors- Container crashes under load
Solutions:
# Monitor GPU memory
nvidia-smi dmon -s mu
# Reduce batch size
docker run -e MAX_CONCURRENT_BATCH=4 ...
# Enable FP16 (Python)
docker run -e DTYPE=fp16 ...
# Use smaller model
docker run -e MODEL_NAME=sshleifer/distilbart-cnn-12-6 ...
Memory Usage Guide:
- BART-Large-CNN FP32: ~1.6GB
- BART-Large-CNN FP16: ~800MB
- DistilBART FP16: ~400MB
- Plus ~200MB per batch item
Conclusion
Building production-ready transformer inference services requires careful consideration of performance, reliability, and operational concerns. Through our comprehensive exploration of both Python and Rust implementations, we've demonstrated:
Key Takeaways:
-
Both Languages Have Merit: Python excels in rapid development and ecosystem maturity, while Rust provides superior performance and resource efficiency.
-
Performance Matters: Real-world load testing shows Rust can deliver 30-50% lower latency and 35-81% higher throughput, with the advantage increasing under load and with batching.
-
Production Readiness Requires More Than Code: Health checks, rate limiting, proper error handling, monitoring, and security are essential components often overlooked in tutorials.
-
GPU Optimization is Critical: Proper FP16 configuration, batching strategies, and CUDA optimization can double throughput without additional hardware costs.
-
Architecture Decisions Impact Operations: Multi-stage Docker builds, thoughtful concurrency models, and autoscaling strategies significantly affect deployment complexity and costs.
Decision Framework:
- Start with Python if your team is Python-focused, you need rapid iteration, or you're still in research/prototyping phase
- Choose Rust if you have strict performance requirements, need maximum efficiency, or are building large-scale production systems
- Consider a hybrid approach for the best of both worlds: develop in Python, deploy in Rust
Ready to Optimize Your AI/ML Infrastructure?
At NavyaAI, we specialize in building production-grade, high-performance AI/ML solutions that deliver real business impact. If the performance optimizations in this guide resonate with your needs, we can help take your infrastructure to the next level.
Our Core Services
π Model Inference Optimization
Transform your ML models from research prototypes to production powerhouses. We optimize inference pipelines for maximum throughput and minimum latency, leveraging advanced techniques like quantization, model pruning, and hardware-specific acceleration.
Key Capabilities:
- GPU/TPU optimization and acceleration
- Multi-model serving architectures
- Real-time inference pipeline design
- Cost reduction through efficient resource utilization
π§ Model Surgery & Optimization
We perform deep architectural optimizations on your models without sacrificing accuracy. Our model surgery techniques reduce computational overhead while maintaining or even improving model performance.
Services Include:
- Model compression and distillation
- Architecture-specific optimizations
- Custom CUDA kernel development
- Inference-optimized model creation and training
β‘ DevOps & MLOps Excellence
Build robust, scalable infrastructure for your AI/ML workloads. We design and implement DevOps and MLOps pipelines that enable rapid iteration while maintaining production stability.
Expertise:
- CI/CD for ML pipelines
- Containerization and orchestration (Docker, Kubernetes)
- Infrastructure as Code (Terraform, Ansible)
- Monitoring, logging, and observability for ML systems
π Sprint AI Applications
Need to go from concept to production fast? Our sprint methodology delivers fully-functional AI applications in record time, without compromising on quality or performance.
Rapid Delivery:
- 2-4 week MVP development cycles
- Production-ready architectures from day one
- Scalable foundations for future growth
- Performance benchmarking and optimization
Technology Stack
We work across the full spectrum of modern AI/ML technologies:
Languages: Rust, Python, Mojo, Golang, C Domains: AI/ML, HPC (High-Performance Computing), MLOps, DevOps Focus: Production-grade, optimized solutions for enterprise-scale deployments
Whether you're building real-time inference systems, distributed training infrastructure, or edge AI applications, we deliver true sprint production-grade optimized HPC and AI/ML solutions tailored to your specific requirements.
Let's Build Together
Ready to optimize your AI infrastructure? Whether you need help with model optimization, infrastructure scaling, or building production ML systems from scratch, we're here to help.
Get in touch to discuss how we can accelerate your AI/ML initiatives.
Additional Resources
Official Documentation
Performance Optimization
Deployment Tools
Related Sections:
- π Python Implementation - Detailed Python code walkthrough
- π¦ Rust Implementation - Complete Rust implementation guide
- π€ Decision Framework - Choosing the right technology
- π Production Considerations - Deployment best practices