Skip to main content

T1: Fine-Tuning & MLOps

Fine-tuning is the most misused tool in AI. Most teams reach for it when prompt engineering + RAG would solve 90% of their problems at 1% of the cost. This module covers the full customization spectrum so you pick the right technique β€” and when you do need fine-tuning, you do it efficiently with LoRA/QLoRA. For prompting techniques, see R1: Prompt Engineering. For RAG patterns, see R2: RAG Architecture.

:::tip Always Try Prompt Engineering First Before investing in fine-tuning (days of work, thousands of examples, GPU costs), exhaust simpler options. The progression: better prompts β†’ few-shot examples β†’ RAG grounding β†’ system instructions. Only fine-tune when these fail to meet quality requirements. :::

The Customization Spectrum​

Each technique trades effort/cost for control. Start from the left; move right only when needed:

TechniqueData NeededCostTimeWhen to Use
Prompt Engineering0 examples~$0MinutesFirst approach for everything
Few-Shot Prompting3-10 examples~$0MinutesFormat/style consistency
RAGYour docs$50-500/moHoursDomain knowledge grounding
System Instructions0-5 examples~$0MinutesBehavioral constraints
LoRA / QLoRA100-10K examples$10-100HoursStyle, format, domain adaptation
Full Fine-Tuning10K-100K examples$100-10KDaysDeep behavioral changes
Continued Pre-Training1M+ tokens$10K+Days-WeeksNew domain vocabulary
Training From ScratchBillions of tokens$1M+MonthsNovel architecture needs

When to Fine-Tune: Decision Framework​

Fine-tune when all of these are true:

  1. Prompt engineering + RAG tried and insufficient β€” documented evidence of failures
  2. Consistent format/style needed β€” the model must always output in a specific structure
  3. Latency matters β€” you need shorter prompts (fine-tuned models need less instruction)
  4. You have quality training data β€” minimum 100 curated examples, ideally 1K+
  5. The task is narrow β€” classification, extraction, or domain-specific generation

Don't fine-tune when:

  • You need up-to-date knowledge β†’ use RAG
  • You need factual accuracy β†’ use grounding + citations
  • You have fewer than 50 examples β†’ use few-shot prompting
  • The base model already performs well with good prompts

LoRA & QLoRA: Parameter-Efficient Fine-Tuning​

Instead of updating all model parameters (billions), LoRA freezes the original weights and trains small rank-decomposition matrices injected into attention layers:

Original Weight Matrix W (d Γ— d): Frozen ❄️
LoRA Adapter: W' = W + B Γ— A
where A is (d Γ— r) and B is (r Γ— d), r << d (rank 4-64)
AspectFull Fine-TuningLoRAQLoRA
Parameters trained100%0.1-1%0.1-1%
GPU memory80-160 GB16-24 GB4-8 GB
Adapter sizeFull model (10-100 GB)10-100 MB10-100 MB
Training cost$100-10K$10-100$5-50
QualityBest95-99% of full90-97% of full

QLoRA adds 4-bit quantization of the base model during training, cutting memory usage further while maintaining quality through double quantization and paged optimizers.

Data Preparation​

Training data uses JSONL format with conversation turns:

{"messages": [
{"role": "system", "content": "You are a legal contract analyzer."},
{"role": "user", "content": "Summarize the key obligations in this clause: ..."},
{"role": "assistant", "content": "Key obligations:\n1. Payment within 30 days\n2. ..."}
]}

Data quality checklist:

  • βœ… Minimum 100 examples (1K+ recommended for production)
  • βœ… Diverse inputs covering edge cases
  • βœ… Consistent output format across all examples
  • βœ… No PII unless required and compliant
  • βœ… 80/10/10 split: training / validation / test
  • βœ… Human-reviewed β€” never use synthetic data alone

Azure AI Foundry Fine-Tuning Workflow​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Upload Data │───▢│ Configure │───▢│ Train β”‚
β”‚ (JSONL) β”‚ β”‚ (model, β”‚ β”‚ (managed β”‚
β”‚ β”‚ β”‚ epochs, β”‚ β”‚ compute) β”‚
β”‚ β”‚ β”‚ LR, batch) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ Deploy │◀───│ Evaluate │◀───│ Review β”‚
β”‚ (endpoint) β”‚ β”‚ (metrics, β”‚ β”‚ (loss curve, β”‚
β”‚ β”‚ β”‚ A/B test) β”‚ β”‚ samples) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key hyperparameters:

ParameterDefaultGuidance
Epochs3-5More data β†’ fewer epochs; watch for overfitting
Learning rate2e-5Lower (1e-5) for larger models
Batch size4-16Larger = faster but more memory
LoRA rank8-16Higher rank = more capacity, more memory

Evaluation Metrics​

Track these during and after training:

  • Training/validation loss β€” should decrease smoothly; divergence signals overfitting
  • Task-specific metrics β€” accuracy, F1, BLEU/ROUGE depending on task
  • A/B testing β€” compare fine-tuned vs base model on held-out test set
  • Human evaluation β€” expert review of output quality on real-world queries

:::warning Overfitting Detection If validation loss starts increasing while training loss continues decreasing, stop training. The model is memorizing training data rather than learning patterns. Reduce epochs or increase training data diversity. :::

MLOps for LLMs​

Production fine-tuned models need the same rigor as traditional ML:

PracticeWhatWhy
Model RegistryVersion every fine-tuned model with metadataRollback, audit, comparison
Data VersioningTrack training data versions alongside modelsReproducibility
Automated EvaluationRun eval suite on every new model versionPrevent regressions
Deployment PipelinesBlue-green deployment with traffic splittingSafe rollout
MonitoringTrack drift, latency, quality metrics post-deployCatch degradation early
Retraining TriggersQuality drops below threshold β†’ retrainMaintain accuracy over time
# Example: Azure AI Foundry CLI workflow
az ai model fine-tune create \
--model gpt-4o-mini \
--training-data ./train.jsonl \
--validation-data ./val.jsonl \
--hyperparameters '{"n_epochs": 3, "learning_rate_multiplier": 1.0}'

# Monitor training
az ai model fine-tune show --id <job-id>

# Deploy when ready
az ai model deployment create \
--model <fine-tuned-model-id> \
--name my-fine-tuned-deployment

Key Takeaways​

  1. Exhaust simpler options first β€” prompt engineering + RAG solve 90% of use cases
  2. LoRA/QLoRA is the default β€” 95%+ quality at 1% of the cost of full fine-tuning
  3. Data quality > data quantity β€” 500 excellent examples beat 10K noisy ones
  4. Evaluate rigorously β€” A/B test against the base model before deploying
  5. Treat models as artifacts β€” version, register, monitor, retrain

Next: T2: Responsible AI β€” ensuring your AI systems are safe, fair, and transparent.