Skip to Content
Solution PlaysPlay 98: Play 98 β€” Agent Evaluation Platform πŸ†

Play 98 β€” Agent Evaluation Platform πŸ†

AI agent benchmarking β€” multi-dimensional scoring (7 dimensions), LLM-as-judge with calibration, adversarial safety testing, leaderboard comparison.

Build an agent evaluation platform. Score agents across 7 dimensions (task completion, accuracy, tool use, safety, latency, cost, conversation quality), calibrate LLM-as-judge against human annotations, generate diverse test suites with adversarial probes, and publish leaderboard rankings with baseline comparison.

Quick Start

cd solution-plays/98-agent-evaluation-platform az deployment group create -g $RG -f infra/main.bicep -p infra/parameters.json code . # Use @builder to implement, @reviewer to audit, @tuner to optimize

Architecture

πŸ“ Full architecture details

ServicePurpose
Azure OpenAI (gpt-4o)LLM-as-judge + test case generation
Cosmos DB (Serverless)Test suites, eval results, leaderboard
Azure FunctionsEvaluation pipeline orchestration
Container AppsDashboard API + leaderboard UI

Pre-Tuned Defaults

  • Dimensions: 7 weighted (task 20%, accuracy 20%, tools 15%, safety 15%, latency 10%, cost 10%, quality 10%)
  • Judge: gpt-4o Β· temperature 0 Β· calibrated against 50+ human annotations Β· safety veto rule
  • Tests: 5 types (single-turn, multi-turn, tool use, adversarial, edge cases) Β· monthly rotation
  • Leaderboard: Per-dimension breakdown Β· baseline comparison Β· version tracking

DevKit (AI-Assisted Development)

PrimitiveWhat It Does
agent.mdRoot orchestrator with builder→reviewer→tuner handoffs
copilot-instructions.mdEval domain (multi-dimensional scoring, judge bias, adversarial testing)
3 agentsBuilder (gpt-4o), Reviewer (gpt-4o-mini), Tuner (gpt-4o-mini)
3 skillsDeploy (240+ lines), Evaluate (115+ lines), Tune (240+ lines)
4 prompts/deploy, /test, /review, /evaluate with agent routing

Cost Estimate

ServiceDev/moProd/moEnterprise/mo
Azure OpenAI$40 (PAYG)$600 (PAYG)$2,000 (PTU Reserved)
Container Apps$10 (Consumption)$350 (Dedicated)$1,000 (Dedicated HA)
Azure Cosmos DB$5 (Serverless)$280 (5000 RU/s)$750 (15000 RU/s)
Azure Machine Learning$0 (Basic)$300 (Standard)$900 (Standard HA)
Azure Functions$0 (Consumption)$200 (Premium EP2)$500 (Premium EP3)
Key Vault$1 (Standard)$5 (Standard)$20 (Premium HSM)
Application Insights$0 (Free)$45 (Pay-per-GB)$150 (Pay-per-GB)
Total$56$1,780$5,320

πŸ’° Full cost breakdown

vs. Play 32 (Testing Expert)

AspectPlay 32Play 98
FocusCode test generation (unit/integration)AI agent quality benchmarking
TargetSource codeAgent endpoints
ScoringTest pass/fail + coverage7-dimension weighted score + leaderboard
SafetySecurity scanningAdversarial probes (jailbreak, injection)

πŸ“– Full documentation Β· 🌐 frootai.dev/solution-plays/98-agent-evaluation-platformΒ  Β· πŸ“¦ FAI Protocol

FAI Manifest

FieldValue
Play98-agent-evaluation-platform
Version1.0.0
KnowledgeT3-Production-Patterns, T2-Responsible-AI, O2-AI-Agents
WAF Pillarsresponsible-ai, reliability, operational-excellence, cost-optimization
Last updated on