Back to Blog
EngineeringFeatured

Building Production-Ready GPU-Accelerated Transformer Summarization Services: Python vs Rust

A comprehensive comparison of Python (FastAPI + Hugging Face) versus Rust (Axum + rust-bert) for production transformer inference. Load testing reveals Rust delivers 30-50% lower latency and 35-81% higher throughput.

Srujan Landeri - AI ML Engineer
25 min read
Machine LearningRustPythonPerformanceGPUTransformersDevOps
Building Production-Ready GPU-Accelerated Transformer Summarization Services: Python vs Rust

Building Production-Ready GPU-Accelerated Transformer Summarization Services: Python vs Rust

Table of Contents

πŸ“‹ Navigation

Quick Navigation:


Executive Summary

TL;DR: Rust delivers 30-50% lower latency and 35-81% higher throughput compared to Python for GPU-accelerated transformer inference, with the advantage increasing under load and batch processing scenarios.

This comprehensive guide compares production-ready Python (FastAPI + Hugging Face) versus Rust (Axum + rust-bert) implementations for transformer inference services. Through extensive load testing on identical GPU infrastructure, we reveal when performance differences matter and provide complete deployment-ready solutions.

Key Findings:

  • Latency: Rust consistently achieves 30-50% lower response times
  • Throughput: 35-81% higher request processing capacity
  • Scalability: Performance advantage increases with concurrent load
  • Resource Efficiency: Better GPU utilization in batch scenarios

Introduction

The choice between Python and Rust for deploying machine learning models in production has sparked countless debates in the engineering community. Python, with its rich ecosystem and rapid development cycle, has been the undisputed champion for ML workflows. But as organizations scale their inference services to handle millions of requests, a critical question emerges: Is Python's convenience worth the performance trade-offs?

This comprehensive guide presents a head-to-head comparison of Python (FastAPI + Hugging Face Transformers) versus Rust (Axum + rust-bert) for production transformer inference services. We've built functionally equivalent microservices exposing a /summarize endpoint powered by BART-Large-CNN, deployed them on identical GPU infrastructure, and conducted extensive load testing to measure real-world performance differences.

The Results May Surprise You: Our benchmarks reveal that Rust consistently delivers 30-50% lower latency and 35-81% higher throughput across various workload patterns. But the story is more nuanced than raw numbersβ€”we'll explore when these advantages matter, when they don't, and how to make informed decisions for your specific use case.


Why This Comparison Matters

The ML inference landscape is evolving rapidly. As transformer models move from research labs to production systems serving millions of users, engineering teams face a critical decision: optimize for development velocity or runtime performance?

Python's Dominance

Python has been the default choice for ML deployments, offering:

  • Vast ecosystem of ML libraries and frameworks
  • Rapid prototyping and development
  • Extensive community support and documentation
  • Seamless integration with training pipelines
Rust's Promise

Rust is gaining traction in production ML systems, claiming:

  • Significantly lower latency and higher throughput
  • Better resource utilization and cost efficiency
  • Memory safety without garbage collection overhead
  • Predictable performance under load

The Reality: Most articles discussing this topic rely on microbenchmarks or theoretical analysis. This guide differs by providing complete, production-ready implementations and comprehensive load testing on the same hardware, same models, and same GPU infrastructure. The results reveal both the magnitude of performance differences and the scenarios where they matter most.

The Architecture We're Testing

Both implementations follow production-ready patterns:

  • GPU-accelerated inference using NVIDIA CUDA
  • RESTful API with health checks and error handling
  • Rate limiting to prevent abuse
  • Batch processing for improved throughput
  • Docker containerization with multi-stage builds
  • Comprehensive logging and observability

By keeping everything identical except the language and runtime, we can isolate the true performance impact of choosing Python versus Rust for ML inference services.


System Requirements and Platform Compatibility

Before diving into the implementation, it's crucial to understand the hardware and software requirements for running GPU-accelerated transformer models.

GPU Requirements

This guide requires an NVIDIA GPU with CUDA support. The services are designed for GPU acceleration and will not perform optimally on CPU-only systems.

Minimum Requirements
  • NVIDIA GPU with CUDA Compute Capability 6.0 or higher
  • 8GB+ GPU memory (VRAM) recommended for BART-Large-CNN
  • NVIDIA Driver version compatible with CUDA 12.1 or higher

Check Your GPU:

# Linux/Windows with NVIDIA driver installed
nvidia-smi

# Expected output shows GPU model, driver version, and CUDA version

Platform Compatibility

βœ… Supported Platforms
  1. Linux (Recommended)

    • Native CUDA support
    • Best performance and compatibility
    • Required: NVIDIA Driver + NVIDIA Container Toolkit
    • Tested on: Ubuntu 20.04+, CentOS 8+, RHEL 8+
  2. Windows (with WSL2)

    • CUDA support via WSL2 (Windows Subsystem for Linux 2)
    • Requires: Windows 10/11 with WSL2 + NVIDIA Driver for Windows
    • Docker Desktop with WSL2 backend
    • Performance: ~5-10% slower than native Linux
    • See: NVIDIA CUDA on WSL2 Guide
❌ Not Supported
  1. macOS
    • CUDA is NOT available on macOS (Apple discontinued NVIDIA GPU support after macOS 10.13)
    • Even with NVIDIA eGPU, CUDA drivers are not available
    • Alternatives for Mac users:
      • Use CPU-only mode (significantly slower, 10-50Γ— latency increase)
      • Deploy to cloud GPU instances (AWS, GCP, Azure)
      • Use Apple Silicon with Metal Performance Shaders (requires different implementation)
      • Remote development on Linux/Windows machines

Verifying Your Setup

# Check NVIDIA driver
nvidia-smi

# Check Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Check CUDA version compatibility
nvcc --version  # Should show CUDA 12.x or compatible

# Check available GPU memory
nvidia-smi --query-gpu=memory.free --format=csv

Project Architecture Overview

The project implements two functionally equivalent services with the following structure:

summarizer-service/
β”œβ”€ python-fastapi/          # Python implementation
β”‚  β”œβ”€ app.py                # Main FastAPI application
β”‚  β”œβ”€ models.py             # Pydantic models for validation
β”‚  β”œβ”€ requirements.txt      # Python dependencies
β”‚  β”œβ”€ Dockerfile            # GPU-enabled container
β”‚  └─ start.sh              # Entry point script
β”œβ”€ rust-axum/               # Rust implementation
β”‚  β”œβ”€ Cargo.toml            # Rust dependencies
β”‚  β”œβ”€ src/
β”‚  β”‚  └─ main.rs            # Main Axum application
β”‚  β”œβ”€ Dockerfile            # GPU-enabled container
β”‚  └─ build.sh              # Build script
β”œβ”€ compose.gpu.yml          # Docker Compose for both services
β”œβ”€ README.md                # Deployment documentation
└─ loadtest.md              # Load testing guide

Both services provide identical API endpoints:

  • GET /health: Service health and status information
  • POST /summarize: Text summarization with configurable parameters

Deep Dive: Python Implementation (FastAPI + Hugging Face Transformers)

Design Philosophy

The Python implementation leverages the Hugging Face Transformers library, which has become the de facto standard for working with transformer models. FastAPI provides a modern, high-performance framework with automatic API documentation, type validation, and async support.

Core Dependencies

The Python service relies on several key libraries:

View Dependencies
fastapi==0.115.0              # Modern web framework
transformers==4.44.0          # Hugging Face transformers
torch                         # PyTorch for model inference
uvicorn==0.30.0              # ASGI server
accelerate==0.33.0           # Training/inference optimization
pydantic==2.8.2              # Data validation

The full requirements.txt includes exact version pins for reproducibility, with PyTorch configured for CUDA 12.1:

--extra-index-url https://download.pytorch.org/whl/cu121

Important Note: The CUDA version in the wheel URL (cu121) must match your host GPU driver compatibility. For different CUDA versions, adjust both the index URL and version suffix accordingly.

Data Models and Validation

Pydantic models provide automatic request validation and API documentation:

View Python Models
from pydantic import BaseModel, Field
from typing import List, Optional

class SummarizeIn(BaseModel):
    texts: List[str] = Field(..., description="List of input documents to summarize")
    max_length: int = Field(128, ge=16, le=512)
    min_length: int = Field(32, ge=8, le=256)
    do_sample: bool = False
    temperature: float = Field(1.0, ge=0.1, le=2.0)
    top_p: float = Field(0.9, ge=0.1, le=1.0)

class Summary(BaseModel):
    text: str

class SummarizeOut(BaseModel):
    summaries: List[Summary]
    model: str
    device: str
    latency_ms: float

class HealthOut(BaseModel):
    status: str
    model: str
    device: str

These models ensure:

  • Type Safety: Automatic type checking and conversion
  • Constraint Validation: Length and range constraints on parameters
  • Documentation: Automatic OpenAPI schema generation
  • Client Generation: Easy client library creation from schemas

Application Implementation

View Python Application Code
import os
import time
import logging
from typing import List

import torch
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import HTMLResponse
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from models import SummarizeIn, SummarizeOut, Summary, HealthOut

# ---------------- Config ----------------
MODEL_NAME = os.environ.get("MODEL_NAME", "facebook/bart-large-cnn")
MAX_CONCURRENT_BATCH = int(os.environ.get("MAX_CONCURRENT_BATCH", "8"))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = os.environ.get("DTYPE", "fp16")   # "fp16" or "fp32"
GEN_KW = {"num_beams": int(os.environ.get("NUM_BEAMS", "4"))}

# --------------- Logging ----------------
logging.basicConfig(
    level=os.environ.get("LOG_LEVEL", "INFO"),
    format="%(asctime)s %(levelname)s %(name)s - %(message)s"
)
log = logging.getLogger("summarizer")
print("DEVICE", DEVICE)

# ----------- Load model/tokenizer -------
log.info(f"Loading model {MODEL_NAME} on {DEVICE} ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

if DEVICE == "cuda":
    if DTYPE == "fp16":
        model = model.half()
    model = model.to(DEVICE)

model.eval()
log.info("Model loaded.")

Key Implementation Details:

  1. Environment-Driven Configuration: All parameters (model name, batch size, precision) are configurable via environment variables, enabling easy deployment variations.

  2. Mixed Precision Support: The DTYPE=fp16 option reduces GPU memory usage by ~50% and increases throughput on modern GPUs with Tensor Cores, without significant accuracy loss.

  3. Device Detection: Automatic fallback to CPU if CUDA isn't available, useful for development and testing.

  4. Evaluation Mode: Setting model.eval() disables dropout and batch normalization training behavior, ensuring consistent inference results.

Rate Limiting Implementation

View Rate Limiting Code
from time import monotonic
from collections import deque

RATE_PER_SEC = float(os.environ.get("RATE_PER_SEC", "20"))
WINDOW_SEC = 1.0
bucket = deque()

def rate_limit_ok() -> bool:
    now = monotonic()
    while bucket and now - bucket[0] > WINDOW_SEC:
        bucket.popleft()
    if len(bucket) < RATE_PER_SEC:
        bucket.append(now)
        return True
    return False

This token bucket implementation:

  • Maintains a sliding window of request timestamps
  • Allows bursts up to the rate limit
  • Provides configurable limits per deployment environment
  • Uses efficient deque operations for O(1) updates

API Endpoints

View API Implementation

Health Check Endpoint

@app.get("/health", response_model=HealthOut)
def health():
    return HealthOut(status="ok", model=MODEL_NAME, device=DEVICE)

The health endpoint provides:

  • Service Status: Quick readiness check
  • Configuration Visibility: Model and device information
  • Load Balancer Integration: Standard endpoint for orchestration tools

Summarization Endpoint

@app.post("/summarize", response_model=SummarizeOut)
def summarize(payload: SummarizeIn, request: Request):
    if not rate_limit_ok():
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    texts = payload.texts[:MAX_CONCURRENT_BATCH]
    if not texts:
        raise HTTPException(status_code=400, detail="No texts provided")

    start = time.time()
    with torch.inference_mode():
        batch = tokenizer(
            texts, padding=True, truncation=True, max_length=1024, return_tensors="pt"
        )
        if DEVICE == "cuda":
            batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
        outputs = model.generate(
            **batch,
            max_length=payload.max_length,
            min_length=payload.min_length,
            do_sample=payload.do_sample,
            temperature=payload.temperature,
            top_p=payload.top_p,
            **GEN_KW
        )
        decoded: List[str] = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    latency_ms = (time.time() - start) * 1000.0

    return SummarizeOut(
        summaries=[Summary(text=s) for s in decoded],
        model=MODEL_NAME,
        device=DEVICE,
        latency_ms=round(latency_ms, 2),
    )

Performance Optimizations:

  1. Batch Processing: Processing multiple texts together maximizes GPU utilization
  2. Inference Mode: torch.inference_mode() disables gradient tracking, reducing memory and computation
  3. Non-blocking Transfers: non_blocking=True overlaps CPU-GPU data transfer with computation
  4. Padding and Truncation: Ensures consistent tensor shapes for efficient batching

Docker Configuration

View Python Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PIP_NO_CACHE_DIR=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.10 python3.10-venv python3-pip git curl ca-certificates && \
    ln -sf /usr/bin/python3.10 /usr/bin/python && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt /app/
RUN python -m pip install --upgrade pip && pip install -r requirements.txt

COPY app.py models.py start.sh /app/
RUN chmod +x /app/start.sh

ENV MODEL_NAME=facebook/bart-large-cnn \
    NUM_BEAMS=4 \
    DTYPE=fp16 \
    MAX_CONCURRENT_BATCH=8 \
    RATE_PER_SEC=50

EXPOSE 8000
CMD ["/app/start.sh"]

Docker Best Practices:

  • Base Image Selection: Using CUDA runtime (not devel) reduces image size while maintaining inference capability
  • Layer Caching: Copying requirements before code enables faster rebuilds
  • Cleanup: Removing apt lists reduces image size
  • Default Configuration: Environment variables provide sensible defaults

Deep Dive: Rust Implementation (Axum + rust-bert)

Design Philosophy

The Rust implementation prioritizes performance, memory safety, and predictable latency. Using Axum for the web framework and rust-bert for model inference provides a type-safe, high-performance alternative to Python.

Dependencies Configuration

View Rust Dependencies
[package]
name = "summarizer-axum"
version = "0.1.0"
edition = "2021"

[dependencies]
axum = "0.7"
tokio = { version = "1.39", features = ["rt-multi-thread", "macros"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tower = "0.5"
tower-http = { version = "0.5", features = ["trace", "cors"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
anyhow = "1.0"
once_cell = "1.19"

# rust-bert + tch (libtorch) bindings
rust-bert = "0.23"
tch = "0.17"

# simple rate limiting
governor = { version = "0.6", features = ["std"] }

# Explicitly add console with std feature
console = { version = "0.16", features = ["std"] }

Key Library Choices:

  • Axum: Modern, ergonomic web framework built on Tokio
  • rust-bert: Rust bindings for transformer models
  • tch: Rust bindings to libtorch (PyTorch's C++ backend)
  • governor: High-performance rate limiting
  • tokio: Async runtime for concurrent request handling

Application Structure

View Rust Application Code
use std::{net::SocketAddr, sync::Arc, time::Instant, num::NonZeroU32};
use anyhow::Result;
use axum::{
    extract::State,
    http::{HeaderValue, Method, StatusCode},
    routing::{get, post},
    Json, Router,
};
use governor::{
    clock::DefaultClock,
    state::{InMemoryState, NotKeyed},
    Quota,
    RateLimiter,
};
use once_cell::sync::Lazy;
use rust_bert::pipelines::summarization::SummarizationModel;
use serde::{Deserialize, Serialize};
use tokio::sync::Mutex;
use tower_http::{cors::CorsLayer, trace::TraceLayer};
use tracing::{error, info};

Rust Advantages:

  • Zero-cost Abstractions: Serialization code is generated at compile time
  • Type Safety: Impossible to construct invalid responses
  • Performance: No runtime validation overhead

Request/Response Models

View Rust Models
#[derive(Debug, Deserialize)]
struct SummarizeIn {
    texts: Vec<String>,
    max_length: Option<i64>,
    min_length: Option<i64>,
    do_sample: Option<bool>,
    temperature: Option<f64>,
    top_p: Option<f64>,
}

#[derive(Debug, Serialize)]
struct Summary {
    text: String,
}

#[derive(Debug, Serialize)]
struct SummarizeOut {
    summaries: Vec<Summary>,
    model: String,
    device: String,
    latency_ms: f64,
}

Application State and Initialization

View Rust Main Function
#[derive(Clone)]
struct AppState {
    model: Arc<Mutex<SummarizationModel>>,
    model_name: String,
    device: String,
}

#[tokio::main]
async fn main() -> Result<()> {
    // Setup logging
    tracing_subscriber::registry()
        .with(
            tracing_subscriber::EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| "info,tower_http=info".into()),
        )
        .with(tracing_subscriber::fmt::layer())
        .init();

    println!("CUDA available: {}", tch::Cuda::is_available());
    println!("CUDNN available: {}", tch::Cuda::cudnn_is_available());
        
    // Load summarization model
    let model_name =
        std::env::var("MODEL_NAME").unwrap_or_else(|_| "facebook/bart-large-cnn".into());
    info!("Loading model: {}", &model_name);
    let model = SummarizationModel::new(Default::default())?;
    let device = if tch::Cuda::is_available() {
        "cuda".into()
    } else {
        "cpu".into()
    };
    info!("Device: {}", device);

    // Shared application state
    let state = AppState {
        model: Arc<new(Mutex::new(model)),
        model_name: model_name.clone(),
        device,
    };

    // Setup CORS
    let cors = CorsLayer::new()
        .allow_origin(HeaderValue::from_static("*"))
        .allow_methods([Method::GET, Method::POST])
        .allow_headers([axum::http::header::CONTENT_TYPE]);

    // Build routes
    let app = Router::new()
        .route("/health", get(health))
        .route("/summarize", post(summarize))
        .with_state(state)
        .layer(TraceLayer::new_for_http())
        .layer(cors);

    // Start server
    let port: u16 = std::env::var("PORT")
        .ok()
        .and_then(|s| s.parse().ok())
        .unwrap_or(8000);
    let addr = SocketAddr::from(([0, 0, 0, 0], port));
    info!("Listening on {}", addr);
    axum::serve(tokio::net::TcpListener::bind(addr).await?, app).await?;

    Ok(())
}

Implementation Highlights:

  1. Structured Concurrency: Tokio's async runtime manages thousands of concurrent connections efficiently
  2. Shared State: Arc<Mutex<>> allows safe concurrent access to the model
  3. Layer Composition: Middleware layers provide logging, CORS, and tracing
  4. CUDA Detection: Runtime checks ensure GPU availability

Rate Limiting

View Rust Rate Limiting
static RLIM: Lazy<RateLimiter<NotKeyed, InMemoryState, DefaultClock>> = Lazy::new(|| {
    let q = std::env::var("RATE_PER_SEC")
        .ok()
        .and_then(|s| s.parse::<u32>().ok())
        .unwrap_or(50);
    RateLimiter::direct(Quota::per_second(NonZeroU32::new(q).unwrap()))
});

The governor crate provides:

  • Low Overhead: Atomic operations for rate tracking
  • Flexibility: Per-second, per-minute, or custom quotas
  • Global State: Single rate limiter instance across all requests

API Endpoints Implementation

View Rust API Endpoints
async fn health(State(state): State<AppState>) -> Json<serde_json::Value> {
    Json(serde_json::json!({
        "status": "ok",
        "model": state.model_name,
        "device": state.device
    }))
}

async fn summarize(
    State(state): State<AppState>,
    Json(input): Json<SummarizeIn>,
) -> Result<Json<SummarizeOut>, (StatusCode, String)> {
    if RLIM.check().is_err() {
        return Err((StatusCode::TOO_MANY_REQUESTS, "rate limit".into()));
    }
    if input.texts.is_empty() {
        return Err((StatusCode::BAD_REQUEST, "no texts".into()));
    }

    let start = Instant::now();
    let m = state.model.clone();
    let texts = input.texts.clone();

    let outputs = tokio::task::spawn_blocking(move || {
        let mut m = m.blocking_lock();
        m.summarize(&texts)
    })
    .await
    .map_err(internal)?;

    let summaries = outputs
        .into_iter()
        .flatten()
        .map(|t| Summary { text: t })
        .collect::<Vec<_>>();
    let latency_ms = start.elapsed().as_secs_f64() * 1000.0;

    Ok(Json(SummarizeOut {
        summaries,
        model: state.model_name.clone(),
        device: state.device.clone(),
        latency_ms: (latency_ms * 100.0).round() / 100.0,
    }))
}

Critical Design Pattern: The spawn_blocking call is essential. Model inference is CPU/GPU-intensive and synchronous, which would block the async runtime. Moving it to a blocking thread pool prevents blocking other requests.

Rust Docker Configuration

View Rust Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 AS builder

ARG LIBTORCH_URL=https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu121.zip
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl unzip ca-certificates pkg-config build-essential git \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
ENV PATH=/root/.cargo/bin:$PATH

WORKDIR /opt
RUN curl -L "$LIBTORCH_URL" -o libtorch.zip && unzip libtorch.zip && rm libtorch.zip
ENV LIBTORCH=/opt/libtorch
ENV LD_LIBRARY_PATH=/opt/libtorch/lib:$LD_LIBRARY_PATH

WORKDIR /build
COPY Cargo.toml /build/
COPY src /build/src

RUN cargo build --release

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates libssl3 \
    && rm -rf /var/lib/apt/lists/*

COPY --from=builder /opt/libtorch /opt/libtorch
ENV LIBTORCH=/opt/libtorch
ENV LD_LIBRARY_PATH=/opt/libtorch/lib:$LD_LIBRARY_PATH

ENV RUST_LOG=info
ENV MODEL_NAME=facebook/bart-large-cnn
ENV RATE_PER_SEC=80
ENV PORT=8000

COPY --from=builder /build/target/release/summarizer-axum /usr/local/bin/summarizer-axum

EXPOSE 8000
CMD ["/usr/local/bin/summarizer-axum"]

Multi-Stage Build Benefits:

  1. Smaller Final Image: Runtime image excludes build tools (often 50% size reduction)
  2. Security: Fewer packages means smaller attack surface
  3. Performance: Only runtime dependencies affect startup time
  4. Clarity: Separation of build and runtime concerns

GPU Orchestration with Docker Compose

View Docker Compose Configuration
version: "3.9"
services:
  py-summarizer:
    build:
      context: ./python-fastapi
    image: py-summarizer:cuda
    environment:
      - MODEL_NAME=facebook/bart-large-cnn
      - DTYPE=fp16
      - NUM_BEAMS=4
      - MAX_CONCURRENT_BATCH=8
      - RATE_PER_SEC=100
    ports:
      - "8001:8000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia

  rs-summarizer:
    build:
      context: ./rust-axum
    image: rs-summarizer:cuda
    environment:
      - MODEL_NAME=facebook/bart-large-cnn
      - RATE_PER_SEC=120
      - RUST_LOG=info
    ports:
      - "8002:8000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia

Configuration Notes:

  • Port Mapping: Different host ports (8001, 8002) allow simultaneous testing
  • GPU Access: runtime: nvidia requires NVIDIA Container Toolkit
  • Resource Limits: Deploy section reserves GPU resources
  • Environment-Specific Settings: Each service can have different rate limits

API Usage Examples

Making Requests

View Example API Calls

Both services expose identical APIs. Here's how to interact with them:

# Python service (port 8001)
curl -X POST http://localhost:8001/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "Transformers are fantastic. They allow attention-based sequence modeling 
       which has revolutionized natural language processing. These models can 
       process entire sequences in parallel, unlike recurrent neural networks."
    ],
    "max_length": 80,
    "min_length": 32,
    "do_sample": false
  }'

# Rust service (port 8002)
curl -X POST http://localhost:8002/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "Climate change is one of the most pressing challenges facing humanity. 
       Rising temperatures and extreme weather events are becoming more frequent."
    ],
    "max_length": 60,
    "min_length": 20
  }'

Response Format

View Example Response

Both services return identical JSON structure:

{
  "summaries": [
    {
      "text": "Transformers use attention-based sequence modeling. They process sequences in parallel, unlike RNNs."
    }
  ],
  "model": "facebook/bart-large-cnn",
  "device": "cuda",
  "latency_ms": 945.12
}

Health Checks

View Health Check Examples
# Check Python service
curl http://localhost:8001/health

# Check Rust service
curl http://localhost:8002/health

# Response format:
{
  "status": "ok",
  "model": "facebook/bart-large-cnn",
  "device": "cuda"
}

Building and Deploying

Individual Service Deployment

Python Service Deployment
cd summarizer-service/python-fastapi

# Build image
docker build -t py-summarizer:cuda .

# Run with GPU
docker run --rm -it --gpus all -p 8001:8000 \
  -e MODEL_NAME=facebook/bart-large-cnn \
  -e DTYPE=fp16 \
  -e NUM_BEAMS=4 \
  -e MAX_CONCURRENT_BATCH=8 \
  -e RATE_PER_SEC=100 \
  py-summarizer:cuda
Rust Service Deployment
cd summarizer-service/rust-axum

# Build image
docker build -t rs-summarizer:cuda .

# Run with GPU
docker run --rm -it --gpus all -p 8002:8000 \
  -e MODEL_NAME=facebook/bart-large-cnn \
  -e RATE_PER_SEC=120 \
  -e RUST_LOG=info \
  rs-summarizer:cuda

Compose Deployment

docker compose -f compose.gpu.yml up --build

Comprehensive Load Testing Analysis

Load testing reveals the real-world performance characteristics of both implementations. The following results were obtained on identical hardware with an NVIDIA GPU.

Testing Methodology

All tests use hey, a modern HTTP load generator:

Installing Hey Load Generator

Installation

# macOS
brew install hey

# Linux
sudo apt update
sudo apt install golang-go -y
source ~/.profile
go install github.com/rakyll/hey@latest
echo 'export PATH=$PATH:$(go env GOPATH)/bin' >> ~/.bashrc
source ~/.bashrc
hey -version

# Windows
# Download Go from https://go.dev/dl/
# Install MSI, restart PowerShell
go install github.com/rakyll/hey@latest
# Add %USERPROFILE%\go\bin to PATH if needed
hey -version

Test parameters:

  • Duration: 30-60 seconds per test
  • Concurrency: 8-16 concurrent connections
  • Request Body: Consistent 100+ word text samples
  • Metrics: Throughput (req/sec), latency (percentiles), error rates

Performance Comparison Charts

The following charts visualize the performance differences across various test scenarios:

Latency Comparison

Latency Comparison Chart Figure 1: Average latency comparison across different test scenarios. Lower values indicate better performance.

Throughput Comparison

Throughput Comparison Chart Figure 2: Request throughput comparison showing requests per second. Higher values indicate better performance.

Performance Advantages Overview

Performance Advantages Chart Figure 3: Summary of Rust's performance advantages over Python across all test scenarios.

Detailed Test Results

Test 1: Light Load - Single Text (8 concurrent, 30s)

Python FastAPI Service:

hey -z 30s -c 8 -m POST \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
    "max_length": 80,
    "min_length": 32
  }' \
  http://localhost:8001/summarize

Results:

Summary:
  Total:        39.1539 secs
  Slowest:      13.2829 secs
  Fastest:      12.7253 secs
  Average:      13.0247 secs
  Requests/sec: 0.6130

Response time histogram:
  P50: 13.0765 secs
  P75: 13.1394 secs
  P90: 13.2120 secs
  P95: 13.2829 secs

Status code distribution:
  [200] 24 responses

Rust Axum Service:

hey -z 30s -c 8 -m POST \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
    "max_length": 80,
    "min_length": 32
  }' \
  http://localhost:8002/summarize

Results:

Summary:
  Total:        38.7155 secs
  Slowest:      10.1396 secs
  Fastest:      1.5438 secs
  Average:      8.6167 secs
  Requests/sec: 0.8265

Response time histogram:
  P50: 9.4874 secs
  P75: 9.6208 secs
  P90: 9.8464 secs
  P95: 10.1396 secs

Status code distribution:
  [200] 32 responses

Analysis:

  • Throughput: Rust achieved 35% higher throughput (0.83 vs 0.61 req/sec)
  • Latency: Rust's average latency was 34% lower (8.6s vs 13.0s)
  • Consistency: Rust showed better latency distribution (1.5s-10.1s vs 12.7s-13.3s)
  • Completed Requests: Rust completed 33% more requests (32 vs 24)
Test 2: Moderate Load - Single Text (16 concurrent, 60s)

Python FastAPI Service:

hey -z 60s -c 16 -m POST -t 30 \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
    "max_length": 80,
    "min_length": 32
  }' \
  http://localhost:8001/summarize

Results:

Summary:
  Total:        77.3369 secs
  Slowest:      26.3156 secs
  Fastest:      25.1225 secs
  Average:      25.7400 secs
  Requests/sec: 0.6207

Response time histogram:
  P50: 25.7552 secs
  P75: 25.9583 secs
  P90: 26.1405 secs
  P95: 26.1553 secs

Status code distribution:
  [200] 48 responses

Rust Axum Service:

hey -z 60s -c 16 -m POST -t 30 \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models, particularly transformers, have shown remarkable performance across various natural language processing tasks including summarization, translation, and question answering."],
    "max_length": 80,
    "min_length": 32
  }' \
  http://localhost:8002/summarize

Results:

Summary:
  Total:        78.7227 secs
  Slowest:      21.4153 secs
  Fastest:      3.0054 secs
  Average:      17.6692 secs
  Requests/sec: 0.8003

Response time histogram:
  P50: 19.4342 secs
  P75: 19.6269 secs
  P90: 19.7422 secs
  P95: 20.0248 secs

Status code distribution:
  [200] 63 responses

Analysis:

  • Throughput: Rust maintained 29% higher throughput (0.80 vs 0.62 req/sec)
  • Latency: Rust's average latency was 31% lower (17.7s vs 25.7s)
  • Scalability: Rust handled increased concurrency better (63 vs 48 completed)
  • Latency Variance: Rust showed wider variance (3-21s vs 25-26s), suggesting better request parallelization
Test 3: Batch Processing (4 texts, 8 concurrent, 30s)

Python FastAPI Service:

hey -z 30s -c 8 -m POST \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models have shown remarkable performance.",
      "Climate change is one of the most pressing challenges facing humanity. Rising temperatures and extreme weather events are becoming more frequent.",
      "The global economy has seen significant changes in recent years with the rise of digital currencies and blockchain technology.",
      "Space exploration continues to advance with new missions to Mars and plans for lunar bases in the coming decades."
    ],
    "max_length": 60,
    "min_length": 20
  }' \
  http://localhost:8001/summarize

Results:

Summary:
  Total:        35.7881 secs
  Slowest:      19.5253 secs
  Fastest:      16.2402 secs
  Average:      17.8389 secs
  Requests/sec: 0.4471

Response time histogram:
  P50: 18.8195 secs
  P75: 19.2788 secs
  P90: 19.5253 secs

Status code distribution:
  [200] 16 responses
  Total summaries: 64 (16 requests Γ— 4 texts)

Rust Axum Service:

hey -z 30s -c 8 -m POST \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "Artificial intelligence and machine learning have revolutionized the way we process and understand data. Deep learning models have shown remarkable performance.",
      "Climate change is one of the most pressing challenges facing humanity. Rising temperatures and extreme weather events are becoming more frequent.",
      "The global economy has seen significant changes in recent years with the rise of digital currencies and blockchain technology.",
      "Space exploration continues to advance with new missions to Mars and plans for lunar bases in the coming decades."
    ],
    "max_length": 60,
    "min_length": 20
  }' \
  http://localhost:8002/summarize

Results:

Summary:
  Total:        38.3619 secs
  Slowest:      10.9850 secs
  Fastest:      2.6813 secs
  Average:      8.8654 secs
  Requests/sec: 0.8081

Response time histogram:
  P50: 9.6433 secs
  P75: 9.7069 secs
  P90: 9.8297 secs
  P95: 10.9850 secs

Status code distribution:
  [200] 31 responses
  Total summaries: 124 (31 requests Γ— 4 texts)

Analysis:

  • Batch Efficiency: Rust processed 94% more total summaries (124 vs 64)
  • Throughput: Rust achieved 81% higher request throughput (0.81 vs 0.45 req/sec)
  • Latency: Rust's average latency was 50% lower (8.9s vs 17.8s)
  • GPU Utilization: Better parallelization in Rust translates to more efficient GPU usage

Performance Summary Table

Metric Python (Single) Rust (Single) Python (Batch) Rust (Batch)
Avg Latency 13.0s 8.6s (-34%) 17.8s 8.9s (-50%)
P95 Latency 13.3s 10.1s (-24%) 19.5s 11.0s (-44%)
Throughput 0.61 req/s 0.83 req/s (+35%) 0.45 req/s 0.81 req/s (+81%)
Completed (30s) 24 32 (+33%) 16 31 (+94%)

Key Insights from Load Testing

1. Rust Performance Advantage:

  • Consistently 30-50% lower latency across all test scenarios
  • Better throughput, especially under increased load
  • More efficient request handling and GPU utilization

2. Batching Benefits:

  • Both services show improved efficiency with batched requests
  • Rust's advantage is more pronounced with batching (81% vs 35% improvement)
  • Proper batching can significantly increase throughput without additional hardware

3. Scalability Patterns:

  • Python shows more consistent latency but lower throughput
  • Rust shows wider latency distribution but higher overall throughput
  • Both services maintain stability without errors under tested loads

4. Production Considerations:

  • For high-throughput scenarios: Rust offers clear advantages
  • For predictable latency requirements: Python's consistency may be valuable
  • Batching should be enabled in production for both implementations

Production Deployment Considerations

Model Selection Strategy

Available Model Options

The default facebook/bart-large-cnn provides excellent summarization quality but isn't the only option:

Standard Models:

  • facebook/bart-large-cnn: Best quality, slower inference (~13s Python, ~9s Rust)
  • sshleifer/distilbart-cnn-12-6: Distilled version, 40% faster, 95% quality
  • facebook/bart-base: Smaller model, 60% faster, 90% quality
  • google/pegasus-cnn_dailymail: Alternative architecture, comparable quality

Switching Models: Simply set the MODEL_NAME environment variable:

docker run --gpus all -e MODEL_NAME=sshleifer/distilbart-cnn-12-6 ...

Model Selection Criteria:

  • Quality Priority: BART-Large-CNN or PEGASUS
  • Speed Priority: DistilBART or BART-Base
  • Balanced: DistilBART-CNN-12-6 offers best quality/speed tradeoff
  • Domain-Specific: Fine-tune on your corpus or use domain-specific models

Optimization Strategies

Mixed Precision & Quantization

Mixed Precision Inference (Python):

FP16 reduces memory usage by ~50% and increases throughput:

# Enable FP16
docker run -e DTYPE=fp16 ...

# Verify GPU supports FP16 efficiently:
nvidia-smi --query-gpu=compute_cap --format=csv
# Compute capability β‰₯ 7.0 recommended for FP16

Benefits:

  • 50% memory reduction allows larger batches
  • 1.5-2Γ— faster inference on Tensor Core GPUs
  • Negligible quality impact for most summarization tasks

Quantization (Advanced):

For even faster inference:

  • INT8 quantization: 4Γ— memory reduction, 2-3Γ— speedup
  • Tools: PyTorch quantization API, ONNX Runtime, TensorRT
  • Tradeoff: 1-2% quality degradation
Batching Configuration
# Python
MAX_CONCURRENT_BATCH = 8  # Process up to 8 texts together

# Rust
// Processes entire input array as batch
let summaries = model.summarize(&texts);

Batching Best Practices:

  • Start with batch size 4-8 for testing
  • Monitor GPU memory usage (nvidia-smi)
  • Increase until GPU memory ~80% utilized
  • Balance latency vs throughput requirements

Cold Start Optimization

Pre-download Models in Docker
# Add to Dockerfile before COPY
RUN python -c "
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;
AutoTokenizer.from_pretrained('facebook/bart-large-cnn');
AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn');
"

Benefits:

  • No download delay on first request
  • Faster container startup
  • More predictable deployment

Tradeoffs:

  • Larger image size (~3-4GB)
  • Rebuild required for model updates
  • Storage costs in registry

Security Considerations

Input Sanitization
# Limit text length to prevent abuse
if any(len(text) > 10000 for text in payload.texts):
    raise HTTPException(400, "Text too long")

# Limit batch size
if len(payload.texts) > MAX_CONCURRENT_BATCH:
    payload.texts = payload.texts[:MAX_CONCURRENT_BATCH]

When to Choose Python vs Rust

Choose Python When:

Python Use Cases

1. Rapid Prototyping and Research:

  • Experimenting with different models and architectures
  • Quick iterations during development
  • Research team primarily uses Python

2. Rich Ecosystem Requirements:

  • Need latest model releases from Hugging Face
  • Extensive data preprocessing with pandas/numpy
  • Integration with ML training pipelines

3. Team Expertise:

  • Team is proficient in Python
  • Limited Rust experience
  • Need to onboard new developers quickly

4. Simple Deployment:

  • Low to moderate traffic
  • Not latency-critical
  • Infrastructure already Python-based

Python Example Use Case: A content management system that summarizes articles for preview cards. Traffic is moderate (100 req/min), and consistency matters more than raw speed.

Choose Rust When:

Rust Use Cases

1. High-Performance Requirements:

  • Serving thousands of requests per second
  • Strict latency SLAs (P99 < 100ms)
  • Maximum throughput per GPU

2. Resource Efficiency:

  • Cost optimization through better hardware utilization
  • Edge deployment with limited resources
  • Multiple services on same hardware

3. Production Stability:

  • Long-running services requiring stability
  • Memory safety is critical
  • Predictable performance under load

4. Microservice Architecture:

  • Service mesh with multiple small services
  • Need for low-overhead HTTP/gRPC services
  • Container orchestration at scale

Rust Example Use Case: A real-time news aggregation platform that summarizes breaking news articles. Must handle traffic spikes (1000+ req/min) with sub-second P95 latency.

Hybrid Approach

Hybrid Strategies

Many organizations use both:

Pattern 1: Development in Python, Deployment in Rust

  • Prototype and train models in Python
  • Export to ONNX or safetensors format
  • Deploy with Rust for production inference

Pattern 2: Different Services Different Languages

  • Complex preprocessing: Python (pandas, spaCy)
  • Core inference: Rust (performance-critical)
  • Post-processing: Python (flexibility)

Pattern 3: Progressive Migration

  • Start with Python for all components
  • Profile and identify bottlenecks
  • Rewrite only performance-critical paths in Rust

Troubleshooting Common Issues

GPU Not Detected

Symptoms:

  • Service logs show device: cpu
  • Slow inference times

Solutions:

# Verify NVIDIA driver
nvidia-smi

# Check CUDA version compatibility
nvcc --version

# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Check container runtime
docker info | grep -i runtime
# Should show: Runtimes: nvidia runc

Common Issues:

  • NVIDIA Container Toolkit not installed
  • Docker daemon not configured for GPU
  • CUDA version mismatch between host and container
Out of Memory Errors

Symptoms:

  • CUDA out of memory errors
  • Container crashes under load

Solutions:

# Monitor GPU memory
nvidia-smi dmon -s mu

# Reduce batch size
docker run -e MAX_CONCURRENT_BATCH=4 ...

# Enable FP16 (Python)
docker run -e DTYPE=fp16 ...

# Use smaller model
docker run -e MODEL_NAME=sshleifer/distilbart-cnn-12-6 ...

Memory Usage Guide:

  • BART-Large-CNN FP32: ~1.6GB
  • BART-Large-CNN FP16: ~800MB
  • DistilBART FP16: ~400MB
  • Plus ~200MB per batch item

Conclusion

Building production-ready transformer inference services requires careful consideration of performance, reliability, and operational concerns. Through our comprehensive exploration of both Python and Rust implementations, we've demonstrated:

Key Takeaways:

  1. Both Languages Have Merit: Python excels in rapid development and ecosystem maturity, while Rust provides superior performance and resource efficiency.

  2. Performance Matters: Real-world load testing shows Rust can deliver 30-50% lower latency and 35-81% higher throughput, with the advantage increasing under load and with batching.

  3. Production Readiness Requires More Than Code: Health checks, rate limiting, proper error handling, monitoring, and security are essential components often overlooked in tutorials.

  4. GPU Optimization is Critical: Proper FP16 configuration, batching strategies, and CUDA optimization can double throughput without additional hardware costs.

  5. Architecture Decisions Impact Operations: Multi-stage Docker builds, thoughtful concurrency models, and autoscaling strategies significantly affect deployment complexity and costs.

Decision Framework:

  • Start with Python if your team is Python-focused, you need rapid iteration, or you're still in research/prototyping phase
  • Choose Rust if you have strict performance requirements, need maximum efficiency, or are building large-scale production systems
  • Consider a hybrid approach for the best of both worlds: develop in Python, deploy in Rust

Ready to Optimize Your AI/ML Infrastructure?

At NavyaAI, we specialize in building production-grade, high-performance AI/ML solutions that deliver real business impact. If the performance optimizations in this guide resonate with your needs, we can help take your infrastructure to the next level.

Our Core Services

πŸš€ Model Inference Optimization

Transform your ML models from research prototypes to production powerhouses. We optimize inference pipelines for maximum throughput and minimum latency, leveraging advanced techniques like quantization, model pruning, and hardware-specific acceleration.

Key Capabilities:

  • GPU/TPU optimization and acceleration
  • Multi-model serving architectures
  • Real-time inference pipeline design
  • Cost reduction through efficient resource utilization
πŸ”§ Model Surgery & Optimization

We perform deep architectural optimizations on your models without sacrificing accuracy. Our model surgery techniques reduce computational overhead while maintaining or even improving model performance.

Services Include:

  • Model compression and distillation
  • Architecture-specific optimizations
  • Custom CUDA kernel development
  • Inference-optimized model creation and training
⚑ DevOps & MLOps Excellence

Build robust, scalable infrastructure for your AI/ML workloads. We design and implement DevOps and MLOps pipelines that enable rapid iteration while maintaining production stability.

Expertise:

  • CI/CD for ML pipelines
  • Containerization and orchestration (Docker, Kubernetes)
  • Infrastructure as Code (Terraform, Ansible)
  • Monitoring, logging, and observability for ML systems
πŸƒ Sprint AI Applications

Need to go from concept to production fast? Our sprint methodology delivers fully-functional AI applications in record time, without compromising on quality or performance.

Rapid Delivery:

  • 2-4 week MVP development cycles
  • Production-ready architectures from day one
  • Scalable foundations for future growth
  • Performance benchmarking and optimization

Technology Stack

We work across the full spectrum of modern AI/ML technologies:

Languages: Rust, Python, Mojo, Golang, C Domains: AI/ML, HPC (High-Performance Computing), MLOps, DevOps Focus: Production-grade, optimized solutions for enterprise-scale deployments

Whether you're building real-time inference systems, distributed training infrastructure, or edge AI applications, we deliver true sprint production-grade optimized HPC and AI/ML solutions tailored to your specific requirements.

Let's Build Together

Ready to optimize your AI infrastructure? Whether you need help with model optimization, infrastructure scaling, or building production ML systems from scratch, we're here to help.

Get in touch to discuss how we can accelerate your AI/ML initiatives.


Additional Resources

Official Documentation
Performance Optimization
Deployment Tools
Community
---

Related Sections: