Skip to main content

O5: AI Infrastructure

AI workloads are memory-bound, not CPU-bound. This single fact changes everything about how you provision, scale, and budget infrastructure. This module covers the hardware, sizing, and deployment patterns specific to AI. For the platform that manages AI deployments, see O4: Azure AI Foundry. For cost optimization strategies, see the Cost Optimization WAF pillar.

:::warning The Biggest Mistake Treating AI inference like a traditional web service. Web apps scale with CPU cores and RAM. AI inference scales with VRAM โ€” a completely different resource with different provisioning rules, costs, and constraints. :::

Traditional App vs AI Workloadโ€‹

DimensionTraditional Web AppAI InferenceAI Training
BottleneckCPU, network I/OVRAM, memory bandwidthVRAM, interconnect
Scaling unitCPU cores / instancesGPU VRAMMulti-GPU clusters
Latency10โ€“100ms500msโ€“10s (depends on tokens)Hours to weeks
Cost driverCompute hoursGPU hours + token countGPU-hours ร— cluster size
StateMostly statelessKV-cache per requestCheckpoints, gradients
Cold startMillisecondsSeconds to minutes (model loading)N/A (long-running)

CPU vs GPU vs TPU vs NPUโ€‹

ProcessorDesignAI RoleAnalogy
CPUFew powerful cores, general-purposePreprocessing, light inferenceMaster chef (1 person, any dish)
GPUThousands of simple cores, parallel mathTraining + inferenceKitchen brigade (1000 cooks, one recipe)
TPUGoogle's custom AI chip, matrix-optimizedGoogle Cloud training/inferencePurpose-built pasta machine
NPUOn-device neural processorEdge AI, mobile inferenceMicrowave (quick, limited, local)

For AI workloads, GPU is the default choice. CPUs work for small models (under 1B params) or when latency isn't critical.

NVIDIA GPU Lineup for AIโ€‹

GPUVRAMMemory BWFP16 TFLOPSBest ForAzure VM
A1024 GB600 GB/s125Small models, inferenceNC A10 v4
A10040/80 GB2 TB/s312Training + large inferenceND A100 v4
H10080 GB3.35 TB/s990Large-scale training + inferenceND H100 v5
H200141 GB4.8 TB/s990Very large models, high throughputComing
B200192 GB8 TB/s2,250Next-gen trainingComing
info

Memory bandwidth matters as much as VRAM capacity. A model that fits in VRAM but starves on bandwidth will be slow. The H100's 3.35 TB/s bandwidth is why it dominates inference.

VRAM Sizing Formulaโ€‹

VRAM (GB) โ‰ˆ Parameters (Billions) ร— Bytes per Parameter
PrecisionBytes/Param7B Model13B Model70B Model
FP32428 GB52 GB280 GB
FP16 / BF16214 GB26 GB140 GB
INT817 GB13 GB70 GB
INT40.53.5 GB6.5 GB35 GB

Add ~20% overhead for KV-cache, activations, and framework memory.

Precision Formatsโ€‹

FormatBitsQualitySpeedWhen to Use
FP3232PerfectSlowestTraining (loss calculation)
FP1616Near-perfect2ร— fasterStandard inference
BF1616Near-perfect (better range than FP16)2ร— fasterTraining + inference (preferred)
INT88Minimal loss4ร— fasterProduction inference
INT44Some quality loss8ร— fasterEdge/mobile, cost-optimized inference
tip

For most production workloads: train in BF16, serve in INT8. You get 4ร— memory savings with minimal quality loss. Always benchmark against your evaluation suite before deploying quantized models โ€” see O4: Azure AI Foundry.

PTU vs PAYG on Azure OpenAIโ€‹

DimensionPAYG (Pay-As-You-Go)PTU (Provisioned Throughput Units)
BillingPer 1K tokensFixed monthly per PTU
LatencyVariable (shared infra)Consistent (reserved capacity)
ThroughputRate-limited (tokens/min)Guaranteed minimum
Cost at scaleLinear with usageFlat โ€” cheaper above breakeven
Best forDev/test, variable workloadsProduction with predictable volume
CommitmentNoneMonthly or annual reservation

Breakeven rule of thumb: if you consistently spend >$5K/month on PAYG for a single model, evaluate PTU.

GPU Sizing Guideโ€‹

Model SizeMin VRAM (INT8)Recommended GPUAzure VM SeriesEst. Cost/hr
1-3B3 GBA10 (24 GB)NC4as T4 v3~$0.53
7B7 GBA10 (24 GB)NC A10 v4~$0.91
13B13 GBA10 (24 GB)NC A10 v4~$0.91
34B34 GBA100 40GBND A100 v4~$3.67
70B70 GBA100 80GBND A100 v4~$3.67
70B (fast)70 GBH100 80GBND H100 v5~$10.32
405B200+ GBMulti-GPU H100ND H100 v5 ร—4~$41.28

Container Patterns for AIโ€‹

AKS with GPU Node Poolsโ€‹

# GPU node pool configuration
apiVersion: v1
kind: Pod
metadata:
name: llm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
env:
- name: MODEL
value: "meta-llama/Llama-3.1-8B-Instruct"
nodeSelector:
kubernetes.io/os: linux
gpu-type: a100

Container Apps (Serverless GPU โ€” Preview)โ€‹

Azure Container Apps now supports GPU workloads without managing node pools โ€” ideal for inference endpoints with variable traffic.

PatternAKS + GPUContainer Apps GPUAzure OpenAI
ControlFull KubernetesManaged serverlessFully managed
ScalingManual/KEDAAutomaticAutomatic
Model choiceAny modelAny modelOpenAI catalog
Cost modelVM hoursPer-second GPUPer-token
ComplexityHighMediumLow

Key Takeawaysโ€‹

  1. AI workloads are VRAM-bound โ€” plan infrastructure around GPU memory, not CPU
  2. Use the VRAM formula: Params (B) ร— Bytes/Param + 20% overhead
  3. Train in BF16, serve in INT8 โ€” the sweet spot for quality vs cost
  4. PTU beats PAYG above ~$5K/month for consistent workloads
  5. Start with Azure OpenAI (managed), graduate to AKS + GPU only when you need custom models or full control
  6. Always benchmark quantized models against your evaluation suite before deploying