Skip to Content
Solution PlaysPlay 47: Play 47 β€” Synthetic Data Factory

Play 47 β€” Synthetic Data Factory

Privacy-safe synthetic data generation factory β€” LLM-based text generation, CTGAN statistical tabular data, differential privacy, schema-driven pipelines, distribution fidelity validation, PII marker enforcement, and downstream ML utility testing.

Architecture

ComponentTechnologyPurpose
Text GenerationAzure OpenAI (GPT-4o)Schema-driven synthetic text/records
Tabular GenerationCTGAN / SDVStatistical synthetic tabular data
Differential PrivacySDV + DP mechanismsMathematical privacy guarantees
Rule-BasedFaker + custom rulesStructured PII-like fields (names, addresses)
ValidationSciPy + PresidioDistribution fidelity + PII leakage scan
StorageAzure Blob StorageDataset output (CSV, Parquet, JSONL)
OrchestratorAzure Container AppsGeneration pipeline hosting

πŸ“ Full architecture details

AspectPlay 13 (Fine-Tuning)Play 47 (Synthetic Data)Play 46 (Healthcare AI)
PurposeTrain models on real dataGenerate privacy-safe training dataClinical decision support
InputReal labeled datasetSchema + constraints (or real sample)Patient records (FHIR)
OutputFine-tuned modelSynthetic dataset (CSV/Parquet)Clinical recommendations
PrivacyData access controlsFormal privacy (DP, PII-free)HIPAA de-identification
ValidationModel eval metricsDistribution fidelity vs real dataClinical accuracy
Use CaseWhen real data existsWhen real data is sensitive/scarceHealthcare compliance

DevKit Structure

47-synthetic-data-factory/ β”œβ”€β”€ agent.md # Root orchestrator with handoffs β”œβ”€β”€ .github/ β”‚ β”œβ”€β”€ copilot-instructions.md # Domain knowledge (<150 lines) β”‚ β”œβ”€β”€ agents/ β”‚ β”‚ β”œβ”€β”€ builder.agent.md # LLM + CTGAN + validation pipelines β”‚ β”‚ β”œβ”€β”€ reviewer.agent.md # Privacy + PII + re-identification β”‚ β”‚ └── tuner.agent.md # Diversity + fidelity + cost β”‚ β”œβ”€β”€ prompts/ β”‚ β”‚ β”œβ”€β”€ deploy.prompt.md # Deploy generation pipelines β”‚ β”‚ β”œβ”€β”€ test.prompt.md # Generate + validate sample β”‚ β”‚ β”œβ”€β”€ review.prompt.md # Privacy audit β”‚ β”‚ └── evaluate.prompt.md # Fidelity + utility metrics β”‚ β”œβ”€β”€ skills/ β”‚ β”‚ β”œβ”€β”€ deploy-synthetic-data-factory/ # LLM gen + CTGAN + validation β”‚ β”‚ β”œβ”€β”€ evaluate-synthetic-data-factory/# Fidelity, privacy, diversity, utility β”‚ β”‚ └── tune-synthetic-data-factory/ # Temperature, epochs, DP budget, cost β”‚ └── instructions/ β”‚ └── synthetic-data-factory-patterns.instructions.md β”œβ”€β”€ config/ # TuneKit β”‚ β”œβ”€β”€ openai.json # Generation model, temperature, batching β”‚ β”œβ”€β”€ guardrails.json # Privacy controls, PII markers, DP budget β”‚ └── agents.json # Output format, storage, schema definitions β”œβ”€β”€ infra/ # Bicep IaC β”‚ β”œβ”€β”€ main.bicep β”‚ └── parameters.json └── spec/ # SpecKit └── fai-manifest.json

Quick Start

# 1. Deploy generation pipelines /deploy # 2. Generate sample dataset + validate /test # 3. Run privacy audit /review # 4. Measure fidelity + downstream utility /evaluate

Key Metrics

MetricTargetDescription
KS Statistic< 0.1Distribution similarity per column
PII Leakage0%No real PII in synthetic output
Re-identification Risk< 0.1%Can’t link synthetic to real records
Uniqueness Rate> 95%Low duplicate synthetic records
TSTR Accuracy Parity> 85%ML model trained on synthetic β‰ˆ real-trained
Cost per 1K Records< $0.30LLM generation + validation

Estimated Cost

ServiceDev/moProd/moEnterprise/mo
Azure OpenAI$60$500$1,800
Azure Machine Learning$0$200$800
Azure Blob Storage$5$40$150
Azure Functions$0$120$350
Azure Data Factory$10$100$300
Key Vault$1$5$15
Application Insights$0$25$80
Total$76$990$3,495

Estimates based on Azure retail pricing. Actual costs vary by region, usage, and enterprise agreements.

πŸ’° Full cost breakdown

WAF Alignment

PillarImplementation
SecurityPII markers, no real data in output, differential privacy
Responsible AIPrivacy-preserving generation, re-identification risk testing
ReliabilityStatistical validation (KS test), downstream utility testing
Cost OptimizationCTGAN for tabular (free), gpt-4o-mini for simple schemas
Performance EfficiencyBatch generation, cached schemas, Faker for structured PII
Operational ExcellenceSchema-driven pipelines, reproducible with seed, metadata audit

FAI Manifest

FieldValue
Play47-synthetic-data-factory
Version1.0.0
KnowledgeF1-GenAI-Foundations, F2-LLM-Selection, T1-Fine-Tuning-MLOps, T2-Responsible-AI, T3-Production-Patterns
WAF Pillarssecurity, responsible-ai, cost-optimization, operational-excellence
Groundednessβ‰₯ 85%
Safety0 violations max
Last updated on