Play 44 β Foundry Local On-Device
On-device AI inference with Azure AI Foundry Local SDK β hardware-aware model selection, hybrid cloud/local routing, offline caching, complexity-based query classification, and cost optimization through local-first inference.
Architecture
| Component | Technology | Purpose |
|---|---|---|
| Local Inference | Foundry Local SDK | On-device model loading and inference |
| Local Models | Phi-4, Phi-4-mini, Phi-3-mini | SLMs optimized for device hardware |
| Cloud Fallback | Azure OpenAI (GPT-4o) | Complex queries beyond local capability |
| Complexity Router | Python classifier | Route simpleβlocal, complexβcloud |
| Model Cache | Local disk (~2-8GB) | Cached models for instant offline inference |
| Telemetry | Local JSONL logs | Track local vs cloud usage and costs |
π Full architecture details
How It Differs from Related Plays
| Aspect | Play 19 (Edge AI) | Play 44 (Foundry Local) | Play 34 (Edge Deployment) |
|---|---|---|---|
| Runtime | Custom ONNX container | Foundry Local SDK | IoT Hub + ONNX Runtime |
| Devices | IoT/edge devices | Developer PCs + laptops | IoT fleet (sensors, gateways) |
| Model Source | Custom fine-tuned | Foundry model catalog | Custom ONNX models |
| Management | IoT Hub fleet mgmt | Single-device self-managed | IoT Hub device twin |
| Network | Can be intermittent | Local-first, cloud optional | Cloud sync required |
| Use Case | Industrial/IoT | Developer productivity, privacy | Manufacturing, retail |
DevKit Structure
44-foundry-local-on-device/
βββ agent.md # Root orchestrator with handoffs
βββ .github/
β βββ copilot-instructions.md # Domain knowledge (<150 lines)
β βββ agents/
β β βββ builder.agent.md # SDK setup + hybrid router
β β βββ reviewer.agent.md # Hardware compat + offline
β β βββ tuner.agent.md # Model selection + cost
β βββ prompts/
β β βββ deploy.prompt.md # Configure local models
β β βββ test.prompt.md # Test local + fallback
β β βββ review.prompt.md # Audit hardware + offline
β β βββ evaluate.prompt.md # Compare local vs cloud
β βββ skills/
β β βββ deploy-foundry-local-on-device/ # SDK setup + model download + router
β β βββ evaluate-foundry-local-on-device/ # Quality, latency, cost, offline
β β βββ tune-foundry-local-on-device/ # Model profiles, router, prompts, cost
β βββ instructions/
β βββ foundry-local-on-device-patterns.instructions.md
βββ config/ # TuneKit
β βββ openai.json # Model profiles, cloud fallback
β βββ guardrails.json # Offline mode, hardware limits
β βββ agents.json # Routing rules, fallback config
βββ infra/ # Bicep IaC (cloud fallback only)
β βββ main.bicep
β βββ parameters.json
βββ spec/ # SpecKit
βββ fai-manifest.jsonQuick Start
# 1. Install SDK and download models
/deploy
# 2. Test local inference and offline mode
/test
# 3. Audit hardware compatibility
/review
# 4. Compare local vs cloud quality and cost
/evaluateKey Metrics
| Metric | Target | Description |
|---|---|---|
| Local Accuracy | > 80% | Response correctness for simple queries |
| Quality Parity | > 0.75 | Local quality / cloud quality ratio |
| Local Inference Rate | > 60% | Queries handled locally (free) |
| Offline Success | > 95% | Queries answered without network |
| Routing Accuracy | > 85% | Correct source for query complexity |
| Cost Savings | > 50% | Reduction vs cloud-only inference |
Estimated Cost
| Service | Dev/mo | Prod/mo | Enterprise/mo |
|---|---|---|---|
| Azure OpenAI | $30 | $200 | $800 |
| Azure IoT Hub | $0 | $25 | $250 |
| Azure Monitor | $0 | $30 | $100 |
| Blob Storage | $2 | $15 | $50 |
| Azure Container Registry | $5 | $20 | $50 |
| Key Vault | $1 | $5 | $15 |
| Azure Functions | $0 | $10 | $120 |
| Total | $38 | $305 | $1,385 |
Estimates based on Azure retail pricing. Actual costs vary by region, usage, and enterprise agreements.
π° Full cost breakdown
WAF Alignment
| Pillar | Implementation |
|---|---|
| Cost Optimization | Local inference = $0 API cost, target 60%+ local rate |
| Performance Efficiency | Hardware-aware model selection, INT4/INT8/FP16 quantization |
| Reliability | Offline capability, graceful degradation, cloud fallback |
| Security | Data stays on device for local queries, no network exposure |
| Operational Excellence | Telemetry logging, model cache management, auto warmup |
| Responsible AI | Same quality standards for local and cloud responses |
FAI Manifest
| Field | Value |
|---|---|
| Play | 44-foundry-local-on-device |
| Version | 1.0.0 |
| Knowledge | O5-GPU-Infra, F2-LLM-Selection, T3-Production-Patterns, R3-Deterministic-AI, F1-GenAI-Foundations |
| WAF Pillars | security, reliability, performance-efficiency, cost-optimization |
| Groundedness | β₯ 85% |
| Safety | 0 violations max |
Last updated on