T1: Fine-Tuning & Model Customization

> Duration: 60–90 minutes | Level: Deep-Dive > Part of: 🍎 FROOT Transformation Layer > Prerequisites: F1 (GenAI Foundations), F2 (LLM Landscape) > Last Updated: March 2026

T1.1 The Customization Spectrum
T1.2 When to Fine-Tune (and When Not To)
T1.3 Fine-Tuning Methods
T1.4 LoRA & QLoRA — The Practical Revolution
T1.5 Alignment: RLHF & DPO
T1.6 Data Preparation
T1.7 The Fine-Tuning Pipeline
T1.8 Evaluation — Did It Work?
T1.9 Azure AI Foundry Fine-Tuning
T1.10 MLOps for LLMs
Key Takeaways

T1.1 The Customization Spectrum

Not every AI problem needs fine-tuning. Understanding the full spectrum of customization — from simple to complex — lets you pick the right tool:

Method	Effort	Data Needed	Cost	When to Use
Prompt Engineering	Minutes	0	Free	Always start here
Few-Shot Examples	Hours	5–20 examples	Minimal	When output format matters
RAG	Days	Your knowledge base	Medium	When accuracy on specific data matters
System Instructions	Hours	Rules & constraints	Free	When behavioral alignment matters
LoRA Fine-Tuning	Days-Weeks	100–10K examples	$50–$500	When domain language/style matters
Full Fine-Tuning	Weeks	10K–100K examples	$1K–$50K	When deep behavioral change needed
Continued Pre-Training	Weeks-Months	Millions of tokens	$10K–$1M	New domain/language
Training From Scratch	Months-Years	Trillions of tokens	$1M–$100M+	Only for frontier labs

T1.2 When to Fine-Tune (and When Not To)

The Decision Framework

Fine-Tune When

Use Case	Why Fine-Tuning Helps	Example
Domain-specific language	Model needs to understand jargon	Medical terminology, legal language
Consistent output format	Need reliable structure despite varying inputs	Always output XML in a specific schema
Behavioral alignment	Model should act in a specific way consistently	Always formal, always cites sources
Cost reduction	Replace long prompts with trained behavior	2000-token system message → fine-tuned, save $$
Latency reduction	Shorter prompts = faster responses	Remove few-shot examples from every call

Don’t Fine-Tune When

Situation	Better Alternative
Model needs current/changing data	RAG
You have <100 examples	Few-shot prompting
The issue is prompt clarity	Better prompts
You need quick iteration	Prompt engineering
You want to add new knowledge	RAG, not fine-tuning

> Critical Insight: Fine-tuning teaches the model how to respond, not what to know. For new knowledge, use RAG. For new behavior, use fine-tuning. For both, combine RAG + fine-tuning.

T1.3 Fine-Tuning Methods

Full Fine-Tuning

Updates all model parameters. Most powerful but most expensive.

Aspect	Detail
Parameters updated	All (7B = 7 billion, 70B = 70 billion)
GPU requirements	4–8x A100/H100 (80GB) for 7B, 32+ for 70B
Training data	10K–100K examples ideal
Risk	Catastrophic forgetting — model loses general capabilities
When to use	When you need deep behavioral change AND have enough data

LoRA (Low-Rank Adaptation)

The most practical fine-tuning technique. Freezes original weights and trains small adapter matrices.

LoRA Parameter	Typical Value	Effect
Rank (r)	8–64	Higher = more expressive, more compute
Alpha	16–128	Scaling factor, usually alpha = 2 × rank
Target modules	q_proj, v_proj, k_proj, o_proj	Which attention matrices to adapt
Dropout	0.05–0.1	Regularization to prevent overfitting

QLoRA (Quantized LoRA)

LoRA applied to a 4-bit quantized base model. The democratizer of fine-tuning.

Comparison	LoRA	QLoRA
Base model format	FP16/BF16	NF4 (4-bit)
Memory for 7B model	~14 GB	~4 GB
Memory for 70B model	~140 GB	~35 GB
Quality loss	None vs full FT	<1% vs LoRA
GPU required	A100 40GB+	RTX 4090 (24GB)

T1.4 LoRA & QLoRA — The Practical Revolution

Why LoRA Changed Everything

Before LoRA (2023): fine-tuning a 7B model required 4x A100 GPUs ($30K+ in hardware). Fine-tuning 70B? Enterprise-only.

After LoRA: fine-tune 7B on a single consumer GPU. Fine-tune 70B on a single A100. Adapters are 10-100MB (instead of the full model at 14-140GB).


Full Fine-Tuning:  70B × 2 bytes (FP16) = 140 GB VRAM  → Need 32x A100
LoRA:              70B frozen + 0.1B trained = ~140 GB + ~200 MB → Need 4x A100
QLoRA:             70B in 4-bit + 0.1B trained = ~35 GB + ~200 MB → Need 1x A100

LoRA Best Practices

Practice	Recommendation	Why
Start with low rank	r=8 or r=16	Most tasks don’t need high rank
Target attention layers	q_proj, v_proj minimum	Most impactful for behavioral change
Set alpha = 2 × rank	alpha=32 for r=16	Good balance of adapter influence
Use learning rate	1e-4 to 2e-4	Lower than full FT to avoid instability
Train for 1-3 epochs	More risks overfitting	Monitor eval loss, stop when it rises
Validate frequently	Every 100-500 steps	Catch overfitting early

T1.5 Alignment: RLHF & DPO

RLHF (Reinforcement Learning from Human Feedback)

The technique that turned GPT-3 into ChatGPT:

DPO (Direct Preference Optimization)

A simpler alternative that skips the reward model:

Aspect	RLHF	DPO
Complexity	3-step pipeline	Single training step
Reward model	Required (separate training)	Not needed
Stability	Can be finicky (PPO training)	More stable
Data needed	Preference pairs + reward model data	Preference pairs only
Quality	Gold standard	Comparable (sometimes better)
Adoption	ChatGPT, Claude	Llama 3, Zephyr, many open models

Preference Data Format


{
  "prompt": "Explain what a token is in the context of LLMs.",
  "chosen": "A token is the smallest unit of text that a language model processes. Rather than reading individual characters, models break text into subword units called tokens using algorithms like Byte-Pair Encoding (BPE). For example, 'unbelievable' might become ['un', 'believ', 'able']. In GPT-4, one token averages about 4 English characters. Token count matters because it determines both the cost and the context window usage.",
  "rejected": "A token is basically a word. Like when you type 'hello world' that's 2 tokens. Tokens are used by AI."
}

T1.6 Data Preparation

Data quality is more important than data quantity for fine-tuning. A thousand excellent examples outperform ten thousand mediocre ones.

Data Format (Chat Format)


{"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "What's the best VM for a SQL workload?"}, {"role": "assistant", "content": "For SQL workloads, I recommend..."}]}
{"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "Explain Azure Front Door."}, {"role": "assistant", "content": "Azure Front Door is a global..."}]}

Data Quality Checklist

Check	Why It Matters	How to Verify
Diverse inputs	Prevent overfitting to patterns	Cluster queries by type, ensure coverage
High-quality outputs	Model learns from examples	Expert review of assistant messages
Consistent format	Reinforces desired structure	Schema validation on outputs
No contradictions	Confuses the model	Deduplication and consistency check
Balanced classes	Prevents bias toward over-represented types	Category distribution analysis
Length variety	Handles both short and long responses	Histogram of response lengths
Edge cases	Teaches robust behavior	Include tricky/ambiguous examples (10-20%)

How Much Data?

Model Size	Minimum	Good	Excellent
7B	100 examples	500–1,000	5,000+
13B	200 examples	1,000–2,000	10,000+
70B	500 examples	2,000–5,000	20,000+

T1.7 The Fine-Tuning Pipeline

T1.8 Evaluation — Did It Work?

Evaluation Framework

Metric	What It Measures	How to Compute	Target
Loss (eval)	Training convergence	Built into training	Decreasing, not diverging
Accuracy	Correct answers on test set	Compare to ground truth	>85% for most tasks
Format compliance	Follows output format	Schema validation	>98%
Regression check	Didn’t break general capability	Test on general benchmarks	<5% degradation
Human preference	Humans prefer fine-tuned vs base	A/B blind evaluation	>60% prefer fine-tuned
Task-specific	Domain-relevant metrics	Task-dependent	Improvement vs base

A/B Evaluation Pattern


1. Prepare 100 test prompts (not in training data)
2. Generate responses from BOTH base model and fine-tuned model
3. Randomize order (evaluator doesn't know which is which)
4. Expert evaluators score each response 1-5 on:
   - Accuracy
   - Relevance
   - Format compliance
   - Helpfulness
5. Compare aggregate scores
6. Fine-tuned should win on task-specific metrics
   without significant regression on general metrics

T1.9 Azure AI Foundry Fine-Tuning

Azure AI Foundry supports fine-tuning several model families:

Model	Fine-Tuning Support	Method	Min Data
GPT-4o	✅ Managed	Full FT	10 examples (50+ recommended)
GPT-4o-mini	✅ Managed	Full FT	10 examples
Phi-4	✅ Managed	LoRA	100 examples
Llama 3.1	✅ Managed	LoRA	100 examples
Mistral	✅ Managed	LoRA	100 examples

Azure Fine-Tuning Flow

Cost Estimation

Model	Training Cost	Hosting Cost
GPT-4o-mini	~$3.00/1M training tokens	Standard PAYG pricing
GPT-4o	~$25.00/1M training tokens	Standard PAYG pricing
Phi-4 (14B)	Compute hours (GPU reservation)	Managed compute pricing

T1.10 MLOps for LLMs

The LLMOps Lifecycle

Key LLMOps Tools

Tool	Purpose	Integration
Azure AI Foundry	Model management, fine-tuning, evaluation	Native Azure
MLflow	Experiment tracking, model registry	Open source, Azure-integrated
Prompt Flow	Prompt development and evaluation	Azure AI Foundry, VS Code
GitHub Actions	CI/CD for model pipelines	Azure-integrated
Azure Monitor	Production observability	Native Azure
LangSmith / LangFuse	LLM-specific tracing and evaluation	Open source alternatives

Key Takeaways

Start with prompts, not fine-tuning. 80% of problems are solved with better prompts + RAG. Fine-tune only when you’ve exhausted simpler options.
LoRA is the default. Unless you have a specific reason for full fine-tuning, LoRA or QLoRA gives you 95% of the quality at 5% of the cost.
Data quality > data quantity. 500 expert-curated examples beat 50,000 mediocre ones. Invest in data preparation.
Always evaluate rigorously. Don’t just check if fine-tuning “feels better” — measure accuracy, format compliance, and regression on general capability.
Fine-tuning teaches HOW, not WHAT. For new knowledge, use RAG. For new behavior/style/format, use fine-tuning. For both, combine them.

> FrootAI T1 — Fine-tuning is the art of teaching a model your language. LoRA made it accessible. QLoRA made it affordable. Good data makes it work.