Play 95 — Multimodal Search V2 🔎

Unified multimodal search — text+image+audio+video indexing, cross-modal retrieval, late fusion ranking, personalized results.

Build a unified multimodal search engine. Per-modality encoders (text-embedding-3-large, CLIP, Whisper) create embeddings across 4 content types, cross-modal retrieval enables searching images with text queries (and vice versa), late fusion with reciprocal rank fusion merges and re-ranks results, and user preference models personalize result ordering.

Quick Start


cd solution-plays/95-multimodal-search-v2
az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json
code .
# Use @builder to implement, @reviewer to audit, @tuner to optimize

Architecture

📐 Full architecture details

Service	Purpose
Azure OpenAI	Text embedding + CLIP visual encoding
Azure AI Search (Standard)	4 vector indices (text, image, audio, video)
Azure Speech (Whisper)	Audio → transcript for embedding
Azure AI Vision	Video key frame extraction
Azure Storage + CDN	Media files + delivery
Cosmos DB (Serverless)	User preferences, search analytics

Pre-Tuned Defaults

Encoding: text-embedding-3-large (1536d) · CLIP ViT-L/14 (768d) · Whisper-large-v3
Fusion: Reciprocal rank · text 40% / image 30% / audio 20% / video 10% · multi-modal bonus
Cross-modal: CLIP shared space for text↔image · transcript embedding for text→audio/video
Video: 1 fps key frames · scene detection · dual embedding (visual + transcript)

DevKit (AI-Assisted Development)

Primitive	What It Does
`agent.md`	Root orchestrator with builder→reviewer→tuner handoffs
`copilot-instructions.md`	Multimodal domain (cross-modal, late fusion, video processing)
3 agents	Builder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skills	Deploy (225+ lines), Evaluate (110+ lines), Tune (225+ lines)
4 prompts	`/deploy`, `/test`, `/review`, `/evaluate` with agent routing

Cost Estimate

Service	Dev/mo	Prod/mo	Enterprise/mo
Azure AI Search	$75 (Basic)	$500 (Standard S2)	$2,000 (Standard S3)
Azure AI Vision	$5 (Free tier)	$200 (Standard S1)	$600 (Standard S1)
Azure AI Speech	$5 (Free tier)	$150 (Standard)	$500 (Standard + Custom)
Azure OpenAI	$25 (PAYG)	$350 (PAYG)	$1,200 (PTU Reserved)
Container Apps	$10 (Consumption)	$200 (Dedicated)	$600 (Dedicated HA)
Blob Storage	$5 (Hot LRS)	$80 (Hot LRS)	$250 (Hot GRS)
Key Vault	$1 (Standard)	$5 (Standard)	$15 (Premium HSM)
Application Insights	$0 (Free)	$45 (Pay-per-GB)	$150 (Pay-per-GB)
Total	$126	$1,530	$5,315

💰 Full cost breakdown

vs. Play 88 (Visual Product Search)

Aspect	Play 88	Play 95
Focus	Product image matching (retail)	Universal content search (all types)
Modalities	Image + text (2)	Text + image + audio + video (4)
Cross-Modal	Image↔text only	All combinations (text→any, image→video)
Fusion	Reranking weights	Late fusion with reciprocal rank

📖 Full documentation · 🌐 frootai.dev/solution-plays/95-multimodal-search-v2 · 📦 FAI Protocol

FAI Manifest

Field	Value
Play	`95-multimodal-search-v2`
Version	`1.0.0`
Knowledge	R2-RAG, F1-GenAI-Foundations, T3-Production-Patterns
WAF Pillars	performance-efficiency, reliability, responsible-ai, cost-optimization