Skip to Content
Solution PlaysPlay 15: Play 15 β€” Multi-Modal DocProc πŸ–ΌοΈ

Play 15 β€” Multi-Modal DocProc πŸ–ΌοΈ

Process documents with text + images using GPT-4o multi-modal vision.

GPT-4o’s vision capability processes documents containing images, charts, tables, and text together. Document Intelligence handles OCR, then GPT-4o interprets visual elements like graphs, stamps, and signatures. Outputs structured JSON. Intelligent routing sends text pages to OCR (cheap) and visual pages to GPT-4o vision (accurate).

Quick Start

cd solution-plays/15-multi-modal-docproc az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json code . # Use @builder for vision pipeline, @reviewer for accuracy audit, @tuner for routing/cost

Architecture

πŸ“ Full architecture details

ServicePurpose
Azure OpenAI (gpt-4o vision)Chart/graph/stamp/signature interpretation
Document IntelligenceOCR text extraction + table recognition
Azure FunctionsPage-level processing pipeline
Blob StorageDocument upload + page image cache
Cosmos DBStructured extraction results

Page Routing (Key Differentiator from Play 06)

Page TypeProcessing PathCost
Text-onlyDocument Intelligence OCR1x (cheapest)
TableDocument Intelligence table extraction1x
Chart/graphGPT-4o Vision10x
Photo/stampGPT-4o Vision10x
MixedOCR (text) + Vision (visual)5x

Key Metrics

  • Text extraction: β‰₯95% Β· Chart data: β‰₯85% Β· Page classification: β‰₯95% Β· Processing: <15s/page

DevKit (Multi-Modal Vision-Focused)

PrimitiveWhat It Does
3 agentsBuilder (vision+OCR pipeline), Reviewer (accuracy/PII in images), Tuner (routing/resolution/cost)
3 skillsDeploy (124 lines), Evaluate (100 lines), Tune (112 lines)
4 prompts/deploy (GPT-4o vision + Doc Intel), /test (vision pipeline), /review (multi-modal quality), /evaluate (accuracy)

Cost Estimate

ServiceDev/PoCProductionEnterprise
Azure Document Intelligence$0/mo$150/mo$500/mo
Azure OpenAI (GPT-4o)$50/mo$350/mo$1,200/mo
Blob Storage$3/mo$25/mo$80/mo
Azure Functions$0/mo$60/mo$150/mo
Cosmos DB$5/mo$60/mo$250/mo
Azure AI Search$0/mo$75/mo$250/mo
Key Vault$1/mo$3/mo$10/mo
Application Insights$0/mo$25/mo$80/mo
Total$59/mo$748/mo$2,520/mo

πŸ’° Full cost breakdown

πŸ“– Full docs Β· 🌐 frootai.dev/solution-plays/15-multi-modal-docprocΒ 

FAI Manifest

FieldValue
Play15-multi-modal-docproc
Version1.0.0
KnowledgeF1-GenAI-Foundations, R2-RAG-Architecture
WAF Pillarssecurity, reliability, responsible-ai
Groundednessβ‰₯ 85%
Safety0 violations max
Last updated on