Job preferences


Utku Olcar

is open to work

Job titles

Robotics Engineer · Robotics Software Engineer · Robotics Specialist · Mechatronics Engineer · Embedded Software Engineer

Start date

Immediately, I'm actively applying

Job types

Full-time (I can work remotely)


If you are hiring, leave an email address, please.

Deploying a Production-Grade AI Service on Azure Kubernetes with GPU

Deploying a Production-Grade AI Service on Azure Kubernetes with GPU

Details

🇹🇷 Türkçe Oku

A comprehensive guide covering AKS GPU cluster setup, Python API architecture, Docker GPU builds, Kubernetes deployment, async queue with Service Bus, 3-tier auto-scaling with KEDA, and the complete end-to-end request flow.

Visual Overview

Architecture Flow Diagrams

Deploying AI models to production is fundamentally different from running them in a notebook. GPU memory management, container orchestration, async processing, and auto-scaling all introduce challenges that most tutorials skip entirely.

This article is a comprehensive, end-to-end guide based on a real production system. It covers everything from AKS cluster creation to the moment a processed result lands in the user's hands — seven layers deep.


1. Setting Up AKS for GPU Workloads

Part 1/7 - Setting Up AKS for GPU Workloads

The most critical decision when creating an AKS cluster for AI workloads is the node pool strategy. GPU workloads must run on a dedicated user pool — running GPUs in the system pool wastes money and creates resource contention.

Cluster Configuration

  • VM: Standard_NC8as_T4_v3 (NVIDIA T4, 16 GB VRAM)
  • Min nodes: 3 (prevents cold start — GPU spin-up takes 3–5 minutes)
  • Max nodes: 12 (cost ceiling)
  • Region: France Central (GPU availability varies by region)
  • Autoscaler: Cluster autoscaler with least-waste expander

The least-waste expander picks the node type that wastes the fewest resources. When you have multiple VM options, it chooses the most efficient one.

GPU Cluster Configuration

Common Pitfalls

  • GPU nodes aren't available in every region. Run az vm list-skus before you commit.
  • Spot instances for GPU? Risky. A preempted node mid-inference means lost work.
  • Min 3 nodes is not optional. Users won't wait 3–5 minutes for a cold GPU to spin up.

2. Turning an AI Model into a Production API

Part 2/7 - Turning an AI Model into a Production API

We use FastAPI + Uvicorn — async support, auto-generated OpenAPI docs, and faster than Flask. But the framework isn't where people fail. It's the GPU concurrency model.

Four Critical Decisions

Single Worker: uvicorn --workers 1

GPU memory isn't thread-safe. Concurrent inference causes memory collisions and crashes. Scale with more containers, not more threads.

Semaphore(1)

Only one inference per container at a time. A second request waits. Simple, predictable, stable.

Model Loading at Startup

Load models once at container start, not per request. ONNX Runtime GPU gives 30–40% faster inference compared to raw PyTorch.

Temp File Management

UUID-based file names per request with cleanup after processing. Skip this and /tmp fills up — the container dies silently.

Health Endpoints — Keep Them Separate

  • /workLiveness (is the container alive?)
  • /healthReadiness (is it busy or free?)

If you combine them, Kubernetes can't distinguish dead from busy. It will restart healthy containers mid-inference.

FastAPI Config for GPU

3. Building a GPU Docker Container

Part 3/7 - Building a GPU Docker Container

A GPU container is not a normal Docker image. CUDA, cuDNN, PyTorch, ONNX Runtime — all interdependent. Version mismatch is the #1 failure mode.

Base Image and Dependencies

  • Base: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
  • Python: 3.10
  • PyTorch: 2.0.1+cu118
  • ONNX Runtime GPU: Separate install (pip default is CPU-only!)

CUDA Version Mismatch — Why It Works

CUDA 11.8 PyTorch on a CUDA 12.1 runtime? Yes. PyTorch bundles its own CUDA libraries. The host runtime handles driver-level operations. They coexist. This trips up a lot of people.

The Most Important Rule: Download Models at Build Time

Add RUN python download.py in your Dockerfile. Bake model weights into the image.

  • Without: Every pod spends 2–5 min downloading before serving.
  • With: Cold start drops to under 10 seconds.
Download Models at Build Time

Image size will be 12–18 GB. Normal for GPU containers. Multi-stage builds help remove build tools, but GPU libraries are large. Don't fight it.

docker build --platform linux/amd64 -t registry/app:latest .

4. GPU-Specific Kubernetes Deployment

Part 4/7 - GPU-Specific Kubernetes Config

Your readiness probe fires at second 10. Your model finishes loading at second 55. Traffic arrives. Pod crashes. This is where most GPU deployments fail.

Resource Requests

  • nvidia.com/gpu: 1 — guarantee one GPU per pod
  • Memory: 4Gi request, 8Gi limit
  • nodeSelector: target your GPU node pool

Probes (Where Most People Fail)

  • Readiness: GET /work, initialDelay: 60s
  • Liveness: GET /work, initialDelay: 120s, period: 30s

Why 60s readiness? The model needs time to load into GPU memory. An early probe routes traffic to an unready pod — instant 500 errors. Why 120s liveness? Same reason, plus a buffer. Don't let Kubernetes kill a healthy pod that's still initializing.

Other Critical Settings

  • Rolling update: maxSurge 25%, maxUnavailable 25%
  • terminationGracePeriod: 30s (let in-flight requests finish)
  • tmpfs volume: emptyDir for fast temp file I/O
  • Service: LoadBalancer, port 80 → 8000
docker push registry/app:latest
kubectl rollout restart deployment/app
kubectl rollout status deployment/app

5. Why Synchronous HTTP Fails — Service Bus Architecture

Part 5/7 - Why Sync HTTP Fails for GPU

User sends a request. Waits 30 seconds. Times out. Retries. Now you have 3 identical jobs burning GPU resources.

GPU inference takes 10–30 seconds. Synchronous HTTP is the wrong pattern.

The Problem

  1. Client sends HTTP request
  2. GPU processes for 15–25 seconds
  3. Client timeout (usually 30s)
  4. Client retries automatically
  5. 2–3 identical jobs running simultaneously
  6. GPU waste + duplicate results

The Fix: Azure Service Bus + Async Queue

  1. Client sends request → instant 202 Accepted + task ID
  2. Job goes to Service Bus queue
  3. Function App picks it up → sends to GPU
  4. Client polls with task ID until done

Why Service Bus?

  • Guaranteed delivery — messages don't get lost
  • KEDA integration — 1 message = 1 pod request
  • Dead-letter queue — failed jobs auto-separated
  • Peek lock — message locked during processing

The queue also provides natural backpressure. When GPUs are overwhelmed, jobs wait in the queue instead of crashing pods.

Sync HTTP vs Async Queue Comparison

6. 3-Tier Auto-Scaling: Zero GPU Waste

Part 6/7 - 3-Tier Auto-Scaling

A single HPA doesn't cut it for GPU workloads. We built a 3-tier system orchestrated by an Azure Function App — not just a queue consumer, but the brain of the entire operation.

Tier 1 — KEDA (Event-Driven)

Queue depth triggers pods. 1 message = 1 pod request. Response time: seconds.

Tier 2 — Reactive Monitor (Every 30s)

  • Reads active job count from the database
  • If jobs > pods → immediately patches KEDA minReplica
  • New nodes spin up within minutes
  • Updates daily peak metric

Tier 3 — Baseline Adjuster (Every 3h)

  • Analyzes peak load over the last 3 hours
  • Calculates new baseline (min 3, max 12 nodes)
  • Updates KEDA minReplicaCount via kubectl patch

Why 3 Tiers?

  • KEDA reacts to queue depth in seconds
  • Reactive Monitor catches spikes the queue misses
  • Baseline Adjuster prevents unnecessary scale-downs
The Three Scaling Tiers

Mechanism: Function App → Azure Management REST API → kubectl patch. SemaphoreSlim(1,1) prevents race conditions.

Result: Peak → scale up. Off-peak → 3 nodes. Zero GPU waste.


7. End-to-End: The Complete Request Flow

Part 7/7 - End-to-End Request Flow

Here's exactly what happens when a request hits the system:

  1. CLIENT → sends images + parameters to API
  2. .NET API GATEWAY → validates auth, checks credits
  3. API → creates task in MongoDB + enqueues to Service Bus → returns 202 + task ID instantly
  4. FUNCTION APP → picks up message → checks GPU capacity
  5. GPU POD → processes the job:
    • Downloads input files
    • Runs detection
    • Runs inference (ONNX GPU)
    • Enhances output quality
    • Returns stream
  6. FUNCTION APP → receives result → CDN upload → DB status: finished
  7. CLIENT → polls GET /result/{taskId} → gets CDN URL

Total time: 15–30 seconds.

Background Jobs Running Continuously

  • KEDA watches queue depth
  • Reactive Monitor checks every 30s
  • Baseline Adjuster recalculates every 3h
Inside the GPU Pipeline

The entire pipeline is model-agnostic. Swap the weights, adjust the endpoint, everything else stays. First model: 2 weeks to deploy. Second model: 2 days. Same Dockerfile. Same manifests. Different weights.


Conclusion

Building a production AI service on Kubernetes isn't just about getting a model to run. It's about building a system that handles failures gracefully, scales precisely, and doesn't burn money on idle GPUs.

The key architectural decisions:

  • Separate GPU node pools with proper autoscaler configuration
  • Single-worker, semaphore-guarded API containers
  • Models baked into Docker images at build time
  • Generous probe delays that respect GPU initialization time
  • Async queue architecture that eliminates timeout cascades
  • 3-tier scaling that reacts in seconds and optimizes over hours

Build the pipeline once. Reuse it forever.