Skip to Content
Solution PlaysPlay 96: Play 96 — Realtime Voice Agent V2 📞

Play 96 — Realtime Voice Agent V2 📞

Next-gen voice AI — WebSocket streaming STT/TTS, function calling mid-conversation, barge-in detection, emotion analysis, multi-language switching.

Build a real-time voice agent with sub-500ms time-to-first-byte. GPT-4o Realtime API streams responses token-by-token, Azure Speech provides streaming STT/TTS with SSML, barge-in detection lets users interrupt naturally, prosody-based emotion analysis adapts agent tone to caller state, and 6-language live switching reconfigures STT/TTS per utterance.

Quick Start

cd solution-plays/96-realtime-voice-agent-v2 az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json code . # Use @builder to implement, @reviewer to audit, @tuner to optimize

Architecture

📐 Full architecture details

ServicePurpose
Azure OpenAI (Realtime)GPT-4o Realtime API for streaming conversation
Azure Speech ServiceStreaming STT + Neural TTS with SSML
Azure Communication ServicesPhone/PSTN integration
Azure Redis CacheSession state + function call cache
Azure Content SafetyReal-time content moderation
Container AppsWebSocket server (HTTP/2 + WS)

Pre-Tuned Defaults

  • Latency: TTFT < 500ms · STT 200ms · LLM 200ms · TTS 100ms · phrase-level buffering
  • Barge-in: VAD sensitivity 0.5 · 300ms min speech · escalate after 3 interruptions
  • Functions: 4 tools · filler phrases during API calls · 3s timeout · 60s result cache
  • Emotion: Prosody-based (pitch, rate, volume, pauses) · adapt tone on frustrated/confused/angry

DevKit (AI-Assisted Development)

PrimitiveWhat It Does
agent.mdRoot orchestrator with builder→reviewer→tuner handoffs
copilot-instructions.mdVoice V2 domain (streaming, barge-in, function calling, emotion)
3 agentsBuilder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skillsDeploy (230+ lines), Evaluate (115+ lines), Tune (240+ lines)
4 prompts/deploy, /test, /review, /evaluate with agent routing

Cost Estimate

ServiceDev/moProd/moEnterprise/mo
Azure AI Voice Live$30 (PAYG)$400 (PAYG)$1,500 (Committed)
Azure OpenAI$35 (PAYG)$500 (PAYG)$1,800 (PTU Reserved)
Container Apps$15 (Consumption)$300 (Dedicated)$900 (Dedicated HA)
Azure Functions$0 (Consumption)$200 (Premium EP2)$500 (Premium EP3)
Cosmos DB$5 (Serverless)$280 (5000 RU/s)$750 (15000 RU/s)
Key Vault$1 (Standard)$5 (Standard)$20 (Premium HSM)
Application Insights$0 (Free)$50 (Pay-per-GB)$160 (Pay-per-GB)
Total$86$1,735$5,630

💰 Full cost breakdown

vs. Play 04 (Call Center Voice AI)

AspectPlay 04Play 96
ArchitectureRequest-response STT→LLM→TTSFull streaming WebSocket
Latency2-3s per turn< 500ms TTFT
Barge-inNot supportedFull duplex with VAD
Function callingN/AMid-conversation API calls
EmotionN/AProsody-based detection + adaptation

📖 Full documentation · 🌐 frootai.dev/solution-plays/96-realtime-voice-agent-v2  · 📦 FAI Protocol

FAI Manifest

FieldValue
Play96-realtime-voice-agent-v2
Version1.0.0
KnowledgeO2-AI-Agents, T3-Production-Patterns, O3-MCP-Tools
WAF Pillarsperformance-efficiency, reliability, responsible-ai, security
Last updated on