Deploying a Production-Grade AI Service on Azure Kubernetes with GPU

Job preferences

Utku Olcar

is open to work

Job titles

Robotics Engineer · Robotics Software Engineer · Robotics Specialist · Mechatronics Engineer · Embedded Software Engineer

Start date

Immediately, I'm actively applying

Job types

Full-time (I can work remotely)

If you are hiring, leave an email address, please.

Deploying a Production-Grade AI Service on Azure Kubernetes with GPU

Details

Date : March 24, 2026
Category : Cloud & DevOps

🇹🇷 Türkçe Oku

A comprehensive guide covering AKS GPU cluster setup, Python API architecture, Docker GPU builds, Kubernetes deployment, async queue with Service Bus, 3-tier auto-scaling with KEDA, and the complete end-to-end request flow.

Visual Overview

Architecture Flow Diagrams

Deploying AI models to production is fundamentally different from running them in a notebook. GPU memory management, container orchestration, async processing, and auto-scaling all introduce challenges that most tutorials skip entirely.

This article is a comprehensive, end-to-end guide based on a real production system. It covers everything from AKS cluster creation to the moment a processed result lands in the user's hands — seven layers deep.

1. Setting Up AKS for GPU Workloads

Part 1/7 - Setting Up AKS for GPU Workloads

The most critical decision when creating an AKS cluster for AI workloads is the node pool strategy. GPU workloads must run on a dedicated user pool — running GPUs in the system pool wastes money and creates resource contention.

Cluster Configuration

VM: Standard_NC8as_T4_v3 (NVIDIA T4, 16 GB VRAM)
Min nodes: 3 (prevents cold start — GPU spin-up takes 3–5 minutes)
Max nodes: 12 (cost ceiling)
Region: France Central (GPU availability varies by region)
Autoscaler: Cluster autoscaler with least-waste expander

The least-waste expander picks the node type that wastes the fewest resources. When you have multiple VM options, it chooses the most efficient one.

Common Pitfalls

GPU nodes aren't available in every region. Run az vm list-skus before you commit.
Spot instances for GPU? Risky. A preempted node mid-inference means lost work.
Min 3 nodes is not optional. Users won't wait 3–5 minutes for a cold GPU to spin up.

2. Turning an AI Model into a Production API

Part 2/7 - Turning an AI Model into a Production API

We use FastAPI + Uvicorn — async support, auto-generated OpenAPI docs, and faster than Flask. But the framework isn't where people fail. It's the GPU concurrency model.

Four Critical Decisions

Single Worker: `uvicorn --workers 1`

GPU memory isn't thread-safe. Concurrent inference causes memory collisions and crashes. Scale with more containers, not more threads.

Semaphore(1)

Only one inference per container at a time. A second request waits. Simple, predictable, stable.

Model Loading at Startup

Load models once at container start, not per request. ONNX Runtime GPU gives 30–40% faster inference compared to raw PyTorch.

Temp File Management

UUID-based file names per request with cleanup after processing. Skip this and /tmp fills up — the container dies silently.

Health Endpoints — Keep Them Separate

/work → Liveness (is the container alive?)
/health → Readiness (is it busy or free?)

If you combine them, Kubernetes can't distinguish dead from busy. It will restart healthy containers mid-inference.

3. Building a GPU Docker Container

Part 3/7 - Building a GPU Docker Container

A GPU container is not a normal Docker image. CUDA, cuDNN, PyTorch, ONNX Runtime — all interdependent. Version mismatch is the #1 failure mode.

Base Image and Dependencies

Base: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
Python: 3.10
PyTorch: 2.0.1+cu118
ONNX Runtime GPU: Separate install (pip default is CPU-only!)

CUDA Version Mismatch — Why It Works

CUDA 11.8 PyTorch on a CUDA 12.1 runtime? Yes. PyTorch bundles its own CUDA libraries. The host runtime handles driver-level operations. They coexist. This trips up a lot of people.

The Most Important Rule: Download Models at Build Time

Add RUN python download.py in your Dockerfile. Bake model weights into the image.

Without: Every pod spends 2–5 min downloading before serving.
With: Cold start drops to under 10 seconds.

Image size will be 12–18 GB. Normal for GPU containers. Multi-stage builds help remove build tools, but GPU libraries are large. Don't fight it.

docker build --platform linux/amd64 -t registry/app:latest .

4. GPU-Specific Kubernetes Deployment

Part 4/7 - GPU-Specific Kubernetes Config

Your readiness probe fires at second 10. Your model finishes loading at second 55. Traffic arrives. Pod crashes. This is where most GPU deployments fail.

Resource Requests

nvidia.com/gpu: 1 — guarantee one GPU per pod
Memory: 4Gi request, 8Gi limit
nodeSelector: target your GPU node pool

Probes (Where Most People Fail)

Readiness: GET /work, initialDelay: 60s
Liveness: GET /work, initialDelay: 120s, period: 30s

Why 60s readiness? The model needs time to load into GPU memory. An early probe routes traffic to an unready pod — instant 500 errors. Why 120s liveness? Same reason, plus a buffer. Don't let Kubernetes kill a healthy pod that's still initializing.

Other Critical Settings

Rolling update: maxSurge 25%, maxUnavailable 25%
terminationGracePeriod: 30s (let in-flight requests finish)
tmpfs volume: emptyDir for fast temp file I/O
Service: LoadBalancer, port 80 → 8000

docker push registry/app:latest
kubectl rollout restart deployment/app
kubectl rollout status deployment/app

5. Why Synchronous HTTP Fails — Service Bus Architecture

User sends a request. Waits 30 seconds. Times out. Retries. Now you have 3 identical jobs burning GPU resources.

GPU inference takes 10–30 seconds. Synchronous HTTP is the wrong pattern.

The Problem

Client sends HTTP request
GPU processes for 15–25 seconds
Client timeout (usually 30s)
Client retries automatically
2–3 identical jobs running simultaneously
GPU waste + duplicate results

The Fix: Azure Service Bus + Async Queue

Client sends request → instant 202 Accepted + task ID
Job goes to Service Bus queue
Function App picks it up → sends to GPU
Client polls with task ID until done

Why Service Bus?

Guaranteed delivery — messages don't get lost
KEDA integration — 1 message = 1 pod request
Dead-letter queue — failed jobs auto-separated
Peek lock — message locked during processing

The queue also provides natural backpressure. When GPUs are overwhelmed, jobs wait in the queue instead of crashing pods.

6. 3-Tier Auto-Scaling: Zero GPU Waste

A single HPA doesn't cut it for GPU workloads. We built a 3-tier system orchestrated by an Azure Function App — not just a queue consumer, but the brain of the entire operation.

Tier 1 — KEDA (Event-Driven)

Queue depth triggers pods. 1 message = 1 pod request. Response time: seconds.

Tier 2 — Reactive Monitor (Every 30s)

Reads active job count from the database
If jobs > pods → immediately patches KEDA minReplica
New nodes spin up within minutes
Updates daily peak metric

Tier 3 — Baseline Adjuster (Every 3h)

Analyzes peak load over the last 3 hours
Calculates new baseline (min 3, max 12 nodes)
Updates KEDA minReplicaCount via kubectl patch

Why 3 Tiers?

KEDA reacts to queue depth in seconds
Reactive Monitor catches spikes the queue misses
Baseline Adjuster prevents unnecessary scale-downs

Mechanism: Function App → Azure Management REST API → kubectl patch. SemaphoreSlim(1,1) prevents race conditions.

Result: Peak → scale up. Off-peak → 3 nodes. Zero GPU waste.

7. End-to-End: The Complete Request Flow

Here's exactly what happens when a request hits the system:

CLIENT → sends images + parameters to API
.NET API GATEWAY → validates auth, checks credits
API → creates task in MongoDB + enqueues to Service Bus → returns 202 + task ID instantly
FUNCTION APP → picks up message → checks GPU capacity
GPU POD → processes the job:
- Downloads input files
- Runs detection
- Runs inference (ONNX GPU)
- Enhances output quality
- Returns stream
FUNCTION APP → receives result → CDN upload → DB status: finished
CLIENT → polls GET /result/{taskId} → gets CDN URL

Total time: 15–30 seconds.

Background Jobs Running Continuously

KEDA watches queue depth
Reactive Monitor checks every 30s
Baseline Adjuster recalculates every 3h

The entire pipeline is model-agnostic. Swap the weights, adjust the endpoint, everything else stays. First model: 2 weeks to deploy. Second model: 2 days. Same Dockerfile. Same manifests. Different weights.

Conclusion

Building a production AI service on Kubernetes isn't just about getting a model to run. It's about building a system that handles failures gracefully, scales precisely, and doesn't burn money on idle GPUs.

The key architectural decisions:

Separate GPU node pools with proper autoscaler configuration
Single-worker, semaphore-guarded API containers
Models baked into Docker images at build time
Generous probe delays that respect GPU initialization time
Async queue architecture that eliminates timeout cascades
3-tier scaling that reacts in seconds and optimizes over hours

Build the pipeline once. Reuse it forever.

Job preferences

Utku Olcar

is open to work

Job titles

Start date

Job types

If you are hiring, leave an email address, please.

Deploying a Production-Grade AI Service on Azure Kubernetes with GPU

Details

Visual Overview

Architecture Flow Diagrams

1. Setting Up AKS for GPU Workloads

Cluster Configuration

Common Pitfalls

2. Turning an AI Model into a Production API

Four Critical Decisions

Single Worker: uvicorn --workers 1

Semaphore(1)

Model Loading at Startup

Temp File Management

Health Endpoints — Keep Them Separate

3. Building a GPU Docker Container

Base Image and Dependencies

CUDA Version Mismatch — Why It Works

The Most Important Rule: Download Models at Build Time

4. GPU-Specific Kubernetes Deployment

Resource Requests

Probes (Where Most People Fail)

Other Critical Settings

5. Why Synchronous HTTP Fails — Service Bus Architecture

The Problem

The Fix: Azure Service Bus + Async Queue

Why Service Bus?

6. 3-Tier Auto-Scaling: Zero GPU Waste

Tier 1 — KEDA (Event-Driven)

Tier 2 — Reactive Monitor (Every 30s)

Tier 3 — Baseline Adjuster (Every 3h)

Why 3 Tiers?

7. End-to-End: The Complete Request Flow

Background Jobs Running Continuously

Conclusion

Single Worker: `uvicorn --workers 1`