Skip to main content

R3: Deterministic AI

The core challenge: you want the same input to produce the same output, but LLMs are inherently probabilistic. True bit-for-bit determinism is impossible โ€” but functional determinism (consistent, accurate, verifiable outputs) is achievable. This module shows you how. See R1: Prompt Engineering for the prompting techniques that underpin these strategies.

The Determinism Spectrumโ€‹

Not all tasks need the same level of consistency. Choose the right target:

LevelDefinitionExampleAcceptable?
ExactBit-for-bit identical output every time2 + 2 = 4โœ… Math, lookups
SemanticSame meaning, different wording"The sky is blue" / "Blue is the color of the sky"โœ… Most production tasks
IntentSame intent and action, varying detailTwo valid but differently-worded emailsโœ… Creative, drafting
ApproximateRoughly correct, may vary in accuracySummarizations that emphasize different pointsโš ๏ธ Acceptable with review
ChaoticContradictory or fabricated across runsDifferent answers to the same factual questionโŒ Never acceptable

:::info GPU Floating-Point Non-Determinism Even with temperature=0 and seed set, identical API calls can produce different outputs. GPU floating-point arithmetic is non-associative โ€” (a + b) + c โ‰  a + (b + c) at float precision. Different GPU routing, batching, or hardware can shift token probabilities enough to change the selected token at decision boundaries. This is a hardware-level limitation, not a bug. :::

Why AI Hallucinatesโ€‹

Hallucination is a feature of autoregressive generation, not a bug. Understanding the mechanics helps you defend against it:

  1. Next-token prediction โ€” the model picks the most probable continuation, not the most truthful
  2. No knowledge boundary โ€” the model doesn't know what it doesn't know
  3. Training data noise โ€” conflicting or outdated information in training data
  4. Context window pressure โ€” as conversations grow, earlier context gets compressed
  5. Sycophancy bias โ€” models tend to agree with the user's stated position

Key insight: You can't eliminate hallucination โ€” you can only reduce its frequency and catch it before it reaches users.

The Determinism Toolkitโ€‹

Control Lever 1: Temperature & Samplingโ€‹

ParameterValueEffectUse Case
temperature0Greedy decoding โ€” always picks highest-probability tokenFactual Q&A, classification, extraction
temperature0.1โ€“0.3Near-deterministic with slight variationRAG, structured output
temperature0.7โ€“1.0Balanced creativity and coherenceDrafting, brainstorming
top_p0.1Considers only top 10% probability massVery focused output
top_k5โ€“10Considers only top-k tokens at each stepNot available in Azure OpenAI (use top_p)
seedintegerBest-effort reproducibility (same seed โ†’ same output)A/B testing, regression testing
# Maximum determinism settings
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
seed=42,
max_tokens=500,
messages=[...],
)
# Check: response.system_fingerprint โ€” if it changes, outputs may differ

Control Lever 2: Structured Outputโ€‹

Force the model into a constrained output space where hallucination is structurally impossible:

from pydantic import BaseModel

class ClassificationResult(BaseModel):
category: str # constrained to enum in JSON Schema
confidence: float
reasoning: str

response = client.beta.chat.completions.parse(
model="gpt-4o",
temperature=0,
response_format=ClassificationResult,
messages=[
{"role": "system", "content": "Classify the support ticket."},
{"role": "user", "content": "My VM won't start after resizing."},
],
)
result = response.choices[0].message.parsed
tip

JSON Schema constraints act as guardrails at the decoding level. The model physically cannot output tokens that violate the schema. This is stronger than prompting "respond in JSON" โ€” which is a suggestion, not a constraint.

Control Lever 3: Groundingโ€‹

Connect the model to real data so it generates from evidence, not memory:

StrategyHow It WorksDeterminism Gain
RAGRetrieve relevant docs, inject into contextHigh โ€” answer bounded by retrieved content
System message constraints"Only use information from the provided context"Medium โ€” relies on model compliance
Citation requirements"Cite sources as [1], [2]"Medium โ€” forces traceability
Tool/function callingModel calls APIs for real-time dataHigh โ€” answers come from verified sources
Knowledge cutoff awareness"If you don't know, say 'I don't know'"Medium โ€” reduces fabrication

See R2: RAG Architecture for full implementation details.

Multi-Layer Defense Architectureโ€‹

No single technique is sufficient. Production systems stack multiple layers:

Layer 1: INPUT VALIDATION
โ†’ Sanitize prompts, detect injection attempts, enforce length limits

Layer 2: PROMPT DESIGN
โ†’ System message constraints, role prompting, few-shot examples

Layer 3: GENERATION CONTROLS
โ†’ temperature=0, seed, max_tokens, structured output / JSON mode

Layer 4: GROUNDING
โ†’ RAG retrieval, tool calling, citation requirements

Layer 5: OUTPUT VALIDATION
โ†’ Groundedness check (โ‰ฅ4.0/5.0), content safety filter, schema validation

Layer 6: HUMAN OVERSIGHT
โ†’ Confidence thresholds for escalation, feedback loops, audit logging
warning

Never rely on a single layer. Even with temperature=0 + RAG + structured output, the model can still produce incorrect-but-plausible content. Layer 5 (output validation) is your last line of defense before the response reaches the user.

Evaluation: Measuring Determinismโ€‹

MetricTargetMeasurement
Groundednessโ‰ฅ 4.0 / 5.0Does the response use only retrieved evidence?
Consistencyโ‰ฅ 95% semantic matchRun same query 10ร—, compare outputs
Accuracyโ‰ฅ 90% factual correctnessHuman review against known-correct answers
Refusal rateโ‰ค 5% false refusalsModel says "I don't know" when it should answer
Hallucination rateโ‰ค 2%Model fabricates facts not in context

FrootAI Play 03 (Deterministic Agent) implements all six defense layers with automated evaluation. Play 01 (Enterprise RAG) applies grounding-focused determinism with groundedness scoring.

Key Takeawaysโ€‹

  1. True determinism is impossible โ€” target functional determinism (semantic consistency)
  2. Hallucination is inherent to autoregressive generation โ€” defend in depth
  3. Stack all six layers โ€” no single technique is sufficient alone
  4. Structured output is the strongest single lever โ€” constrain at the decoding level
  5. Measure with groundedness scores and consistency checks, not vibes

For orchestration patterns that manage these controls at scale, see O1: Semantic Kernel.