Skip to Content
Solution PlaysPlay 12: Play 12 β€” Model Serving AKS βš™οΈ

Play 12 β€” Model Serving AKS βš™οΈ

Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.

Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on GPU utilization, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency.

Quick Start

cd solution-plays/12-model-serving-aks # Provision AKS with GPU node pool az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json az aks get-credentials --resource-group $RG --name $CLUSTER code . # Use @builder for K8s/GPU, @reviewer for security audit, @tuner for cost

Architecture

πŸ“ Full architecture details

ServicePurpose
AKS (GPU nodes)Kubernetes cluster with NVIDIA A100/V100
vLLM / TGIHigh-throughput LLM inference serving
Azure Container RegistryModel container image storage
HPA + Cluster AutoscalerAuto-scaling on GPU utilization

Key Performance Targets

  • Throughput: β‰₯500 tokens/sec per GPU Β· TTFT: <500ms Β· GPU utilization: 70-90%

GPU Options

VMGPUVRAMBest ForCost/hr
NC6s_v31Γ— V10016GBModels <7B~$3.06
NC24ads_A100_v41Γ— A10080GBModels 7B-70B~$3.67
ND96asr_v48Γ— A100640GBModels 70B+~$27.20

DevKit (GPU/Kubernetes-Focused)

PrimitiveWhat It Does
3 agentsBuilder (AKS/GPU/vLLM), Reviewer (limits/security/health), Tuner (GPU util/spot/cost)
3 skillsDeploy (142 lines), Evaluate (101 lines), Tune (112 lines)
4 prompts/deploy (GPU cluster), /test (inference endpoints), /review (security), /evaluate (throughput)

Note: This is a GPU infrastructure + ML serving play. TuneKit covers quantization selection, vLLM config, K8s scaling rules, spot node pools, and cost per inference token β€” not AI quality metrics.

Cost Estimate

ServiceDev/PoCProductionEnterprise
Azure Kubernetes Service$550/mo$3,200/mo$9,500/mo
Azure Container Registry$5/mo$20/mo$50/mo
Azure OpenAI$30/mo$200/mo$800/mo
Blob Storage$5/mo$30/mo$100/mo
Azure Monitor$0/mo$40/mo$120/mo
Key Vault$1/mo$3/mo$10/mo
Virtual Network$0/mo$35/mo$80/mo
Load Balancer$0/mo$25/mo$60/mo
Total$591/mo$3,553/mo$10,720/mo

πŸ’° Full cost breakdown

πŸ“– Full docs Β· 🌐 frootai.dev/solution-plays/12-model-serving-aksΒ 

FAI Manifest

FieldValue
Play12-model-serving-aks
Version1.0.0
KnowledgeO5-GPU-Infra, T1-Fine-Tuning-MLOps, T3-Production-Patterns
WAF Pillarssecurity, reliability, cost-optimization, performance-efficiency
Groundednessβ‰₯ 85%
Safety0 violations max
Last updated on