Skip to main content

F1: GenAI Foundations

This module covers the core mechanics of Generative AI. Understanding these fundamentals is essential before diving into RAG, agents, or any FrootAI solution play.

What Is a Large Language Model?โ€‹

An LLM is a statistical next-token predictor. Given a sequence of tokens, it calculates probability distributions over the entire vocabulary and samples the next token. Repeat this thousands of times and you get coherent text, code, or structured output.

Input: "The capital of France is"
Output: " Paris" (probability: 0.97)

:::info Key Insight LLMs don't "understand" โ€” they learn statistical patterns from trillions of tokens of training data. This is why grounding (connecting to real data) and guardrails (constraining outputs) are critical. See the AI Glossary for formal definitions. :::

Tokens โ€” The Currency of AIโ€‹

Tokens are sub-word units produced by Byte-Pair Encoding (BPE). They are how models read, think, and charge.

TextToken CountRatio
"Hello, world!"4 tokens1 token โ‰ˆ 1.3 words
"Antidisestablishmentarianism"6 tokens1 token โ‰ˆ 0.17 words
{"name": "Alice"}7 tokensJSON is token-expensive
Average English prose~100 tokens1 token โ‰ˆ 0.75 words

Cost formula: cost = (input_tokens ร— input_price) + (output_tokens ร— output_price)

:::tip Token Budget Always set max_tokens in production. An unbounded response can burn through your budget on a single runaway generation. FrootAI solution plays configure this in config/openai.json. :::

Key Generation Parametersโ€‹

ParameterRangeDefaultEffect
temperature0โ€“21.0Controls randomness. 0 = deterministic, 1 = balanced, 2 = creative chaos
top_p0โ€“11.0Nucleus sampling โ€” considers tokens within cumulative probability p
max_tokens1โ€“128KModel limitHard cap on output length
seedintNoneEnables reproducible outputs (same seed + temp 0 = same result)
frequency_penalty-2โ€“20Reduces repetition of already-used tokens
warning

Never set both temperature and top_p to non-default values simultaneously โ€” they interact unpredictably. Pick one.

Context Windows โ€” Model Memoryโ€‹

The context window is the maximum number of tokens a model can process in a single request (input + output combined).

ModelContext Window~Pages of Text
GPT-4o128K~200 pages
GPT-4o-mini128K~200 pages
GPT-4.11M~1,500 pages
Claude Sonnet 4200K~300 pages
Llama 3.1 405B128K~200 pages
Gemini 2.5 Pro1M~1,500 pages

Exceeding the context window causes truncation โ€” the model silently drops older tokens. RAG (see F2) solves this by retrieving only relevant chunks.

Model Parameters & VRAMโ€‹

When someone says "a 7B model," they mean 7 billion trainable weights. More parameters generally means better reasoning but higher infrastructure cost.

VRAM formula: VRAM โ‰ˆ params ร— bytes_per_param ร— 1.2 (overhead)

Model SizeFP32FP16INT8INT4
7B28 GB14 GB7 GB3.5 GB
13B52 GB26 GB13 GB6.5 GB
70B280 GB140 GB70 GB35 GB
405B1.6 TB810 GB405 GB203 GB

Quantization โ€” Shrinking Modelsโ€‹

Quantization reduces precision of model weights to lower VRAM and increase throughput:

  • FP32 โ€” Full precision, baseline quality, 4 bytes/param
  • FP16/BF16 โ€” Half precision, negligible quality loss, 2 bytes/param (production standard)
  • INT8 โ€” 8-bit integers, ~1% quality loss, 1 byte/param
  • INT4 (GPTQ/AWQ) โ€” Aggressive compression, noticeable quality loss on complex reasoning
tip

For self-hosted models, start with INT8 quantization โ€” it offers the best quality-to-cost ratio. Only go to INT4 if VRAM is severely constrained. See FrootAI Play 12 for AKS model serving patterns.

Embeddings โ€” Semantic Vectorsโ€‹

Embeddings convert text into dense vectors (e.g., 1536 or 3072 dimensions) where semantic similarity = vector proximity.

embed("king") - embed("man") + embed("woman") โ‰ˆ embed("queen")

Used for: semantic search, RAG retrieval, clustering, anomaly detection, recommendation. See cosine similarity in the glossary.

Training vs Inferenceโ€‹

AspectTrainingInference
GoalLearn weights from dataGenerate outputs from learned weights
ComputeThousands of GPUs, weeks/monthsSingle GPU or API call, milliseconds
Cost$2Mโ€“$100M+ per frontier model$0.15โ€“$60 per 1M tokens
Who does itOpenAI, Meta, Google, AnthropicYou, via API or self-hosted
info

99% of FrootAI solution plays use inference only โ€” calling pre-trained models via API. Plays 13 (Fine-Tuning) and 12 (Model Serving) cover the exceptions.

Practical Example โ€” Azure OpenAI Chat Completionโ€‹

from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com",
api_version="2024-12-01-preview",
azure_deployment="gpt-4o",
# Uses DefaultAzureCredential via AZURE_CLIENT_ID โ€” never hardcode keys
)

response = client.chat.completions.create(
model="gpt-4o",
temperature=0.7,
max_tokens=500,
seed=42,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in 3 sentences."},
],
)

print(response.choices[0].message.content)
# Usage: response.usage.prompt_tokens, response.usage.completion_tokens

Key Takeawaysโ€‹

  1. Tokens are the universal unit โ€” understand them for cost, latency, and context management
  2. Temperature 0 + seed gives deterministic outputs for reproducible pipelines
  3. Context window โ‰  quality โ€” more context doesn't mean better answers (noise hurts)
  4. Quantization makes self-hosting viable โ€” INT8 is the sweet spot
  5. Always set max_tokens โ€” unbounded generation is a cost and safety risk

Next: F2: LLM Landscape โ†’