Play 12 β Model Serving AKS βοΈ
Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.
Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on GPU utilization, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency.
Quick Start
cd solution-plays/12-model-serving-aks
# Provision AKS with GPU node pool
az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json
az aks get-credentials --resource-group $RG --name $CLUSTER
code . # Use @builder for K8s/GPU, @reviewer for security audit, @tuner for costArchitecture
| Service | Purpose |
|---|---|
| AKS (GPU nodes) | Kubernetes cluster with NVIDIA A100/V100 |
| vLLM / TGI | High-throughput LLM inference serving |
| Azure Container Registry | Model container image storage |
| HPA + Cluster Autoscaler | Auto-scaling on GPU utilization |
Key Performance Targets
- Throughput: β₯500 tokens/sec per GPU Β· TTFT: <500ms Β· GPU utilization: 70-90%
GPU Options
| VM | GPU | VRAM | Best For | Cost/hr |
|---|---|---|---|---|
| NC6s_v3 | 1Γ V100 | 16GB | Models <7B | ~$3.06 |
| NC24ads_A100_v4 | 1Γ A100 | 80GB | Models 7B-70B | ~$3.67 |
| ND96asr_v4 | 8Γ A100 | 640GB | Models 70B+ | ~$27.20 |
DevKit (GPU/Kubernetes-Focused)
| Primitive | What It Does |
|---|---|
| 3 agents | Builder (AKS/GPU/vLLM), Reviewer (limits/security/health), Tuner (GPU util/spot/cost) |
| 3 skills | Deploy (142 lines), Evaluate (101 lines), Tune (112 lines) |
| 4 prompts | /deploy (GPU cluster), /test (inference endpoints), /review (security), /evaluate (throughput) |
Note: This is a GPU infrastructure + ML serving play. TuneKit covers quantization selection, vLLM config, K8s scaling rules, spot node pools, and cost per inference token β not AI quality metrics.
Cost Estimate
| Service | Dev/PoC | Production | Enterprise |
|---|---|---|---|
| Azure Kubernetes Service | $550/mo | $3,200/mo | $9,500/mo |
| Azure Container Registry | $5/mo | $20/mo | $50/mo |
| Azure OpenAI | $30/mo | $200/mo | $800/mo |
| Blob Storage | $5/mo | $30/mo | $100/mo |
| Azure Monitor | $0/mo | $40/mo | $120/mo |
| Key Vault | $1/mo | $3/mo | $10/mo |
| Virtual Network | $0/mo | $35/mo | $80/mo |
| Load Balancer | $0/mo | $25/mo | $60/mo |
| Total | $591/mo | $3,553/mo | $10,720/mo |
π° Full cost breakdown
π Full docs Β· π frootai.dev/solution-plays/12-model-serving-aksΒ
FAI Manifest
| Field | Value |
|---|---|
| Play | 12-model-serving-aks |
| Version | 1.0.0 |
| Knowledge | O5-GPU-Infra, T1-Fine-Tuning-MLOps, T3-Production-Patterns |
| WAF Pillars | security, reliability, cost-optimization, performance-efficiency |
| Groundedness | β₯ 85% |
| Safety | 0 violations max |
Last updated on