Skip to Content
Solution PlaysPlay 95: Play 95 — Multimodal Search V2 🔎

Play 95 — Multimodal Search V2 🔎

Unified multimodal search — text+image+audio+video indexing, cross-modal retrieval, late fusion ranking, personalized results.

Build a unified multimodal search engine. Per-modality encoders (text-embedding-3-large, CLIP, Whisper) create embeddings across 4 content types, cross-modal retrieval enables searching images with text queries (and vice versa), late fusion with reciprocal rank fusion merges and re-ranks results, and user preference models personalize result ordering.

Quick Start

cd solution-plays/95-multimodal-search-v2 az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json code . # Use @builder to implement, @reviewer to audit, @tuner to optimize

Architecture

📐 Full architecture details

ServicePurpose
Azure OpenAIText embedding + CLIP visual encoding
Azure AI Search (Standard)4 vector indices (text, image, audio, video)
Azure Speech (Whisper)Audio → transcript for embedding
Azure AI VisionVideo key frame extraction
Azure Storage + CDNMedia files + delivery
Cosmos DB (Serverless)User preferences, search analytics

Pre-Tuned Defaults

  • Encoding: text-embedding-3-large (1536d) · CLIP ViT-L/14 (768d) · Whisper-large-v3
  • Fusion: Reciprocal rank · text 40% / image 30% / audio 20% / video 10% · multi-modal bonus
  • Cross-modal: CLIP shared space for text↔image · transcript embedding for text→audio/video
  • Video: 1 fps key frames · scene detection · dual embedding (visual + transcript)

DevKit (AI-Assisted Development)

PrimitiveWhat It Does
agent.mdRoot orchestrator with builder→reviewer→tuner handoffs
copilot-instructions.mdMultimodal domain (cross-modal, late fusion, video processing)
3 agentsBuilder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skillsDeploy (225+ lines), Evaluate (110+ lines), Tune (225+ lines)
4 prompts/deploy, /test, /review, /evaluate with agent routing

Cost Estimate

ServiceDev/moProd/moEnterprise/mo
Azure AI Search$75 (Basic)$500 (Standard S2)$2,000 (Standard S3)
Azure AI Vision$5 (Free tier)$200 (Standard S1)$600 (Standard S1)
Azure AI Speech$5 (Free tier)$150 (Standard)$500 (Standard + Custom)
Azure OpenAI$25 (PAYG)$350 (PAYG)$1,200 (PTU Reserved)
Container Apps$10 (Consumption)$200 (Dedicated)$600 (Dedicated HA)
Blob Storage$5 (Hot LRS)$80 (Hot LRS)$250 (Hot GRS)
Key Vault$1 (Standard)$5 (Standard)$15 (Premium HSM)
Application Insights$0 (Free)$45 (Pay-per-GB)$150 (Pay-per-GB)
Total$126$1,530$5,315

💰 Full cost breakdown

AspectPlay 88Play 95
FocusProduct image matching (retail)Universal content search (all types)
ModalitiesImage + text (2)Text + image + audio + video (4)
Cross-ModalImage↔text onlyAll combinations (text→any, image→video)
FusionReranking weightsLate fusion with reciprocal rank

📖 Full documentation · 🌐 frootai.dev/solution-plays/95-multimodal-search-v2  · 📦 FAI Protocol

FAI Manifest

FieldValue
Play95-multimodal-search-v2
Version1.0.0
KnowledgeR2-RAG, F1-GenAI-Foundations, T3-Production-Patterns
WAF Pillarsperformance-efficiency, reliability, responsible-ai, cost-optimization
Last updated on