Skip to Content
Solution PlaysPlay 36: Play 36 β€” Multimodal Agent πŸŽ¨πŸŽ€πŸ“

Play 36 β€” Multimodal Agent πŸŽ¨πŸŽ€πŸ“

Unified agent that processes text + images + audio with cross-modal synthesis.

A general-purpose multimodal AI agent that handles any combination of text, images, and audio input. GPT-4o vision analyzes images, Azure Speech transcribes audio, and the agent synthesizes information across modalities to produce coherent responses. Content safety covers all input and output modalities.

Quick Start

cd solution-plays/36-multimodal-agent az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json code . # Use @builder for multimodal pipeline, @reviewer for safety audit, @tuner for cost

How It Differs from Play 15 (DocProc)

AspectPlay 15 (DocProc)Play 36 (Multimodal Agent)
FocusDocument processingAny modality combination
InputPDFs/images (documents)Text + images + audio + video
OutputStructured JSON extractionConversational responses
Agent typeBatch pipelineInteractive conversational

Architecture

ServicePurpose
Azure OpenAI (gpt-4o)Vision analysis, text generation, cross-modal
Azure Speech ServiceAudio transcription (STT) + voice output (TTS)
Content SafetyPer-modality content filtering
Container AppsMultimodal agent runtime

πŸ“ Full architecture details

Key Metrics

  • Image accuracy: β‰₯85% Β· Audio WER: <10% Β· Cross-modal: β‰₯80% Β· Safety: 100%

DevKit (Multimodal-Focused)

PrimitiveWhat It Does
3 agentsBuilder (modality routing/vision/synthesis), Reviewer (safety across modalities), Tuner (detail level/parallel/cost)
3 skillsDeploy (103 lines), Evaluate (104 lines), Tune (107 lines)
4 prompts/deploy (multimodal pipeline), /test (cross-modal), /review (per-modality safety), /evaluate (accuracy)

Cost

ServiceDevProdEnterprise
Azure OpenAI$60 (PAYG)$400 (PAYG)$1,200 (PTU)
Azure AI Vision$0 (Free)$50 (Standard S1)$150 (Standard S1)
Container Apps$10 (Consumption)$100 (Dedicated)$300 (Dedicated HA)
Azure AI Speech$0 (Free)$40 (Standard S0)$120 (Standard S0)
Blob Storage$3 (Hot LRS)$25 (Hot LRS)$80 (Hot GRS)
Redis Cache$15 (Basic C0)$55 (Standard C1)$200 (Premium P1)
Key Vault$1 (Standard)$3 (Standard)$10 (Premium HSM)
Application Insights$0 (Free)$25 (Pay-per-GB)$100 (Pay-per-GB)
Total$89/mo$698/mo$2,160/mo

πŸ’° Full cost breakdown

πŸ“– Full docs Β· 🌐 frootai.dev/solution-plays/36-multimodal-agentΒ 

FAI Manifest

FieldValue
Play36-multimodal-agent
Version1.0.0
KnowledgeF1-GenAI-Foundations, O2-Agent-Coding, R2-RAG-Architecture
WAF Pillarssecurity, reliability, cost-optimization, responsible-ai
Groundednessβ‰₯ 85%
Safety0 violations max
Last updated on