Interview

Large Language Models (LLMs) such as Llama, GPT, Mistral, and Gemma are pretrained on massive datasets containing trillions of tokens. These models already understand language, grammar, reasoning patterns, coding concepts, and a significant amount of world knowledge.

However, enterprises often need models that specialize in specific domains such as insurance, healthcare, legal services, finance, or customer support. This is where fine-tuning comes into play.

A common question in AI and GenAI interviews is:

“What actually changes inside a model during fine-tuning?”

In this article, we’ll explore what happens internally when a pretrained model is adapted to a new task and understand the roles of optimizers, learning-rate schedulers, layer freezing, and modern techniques like LoRA.


Starting Point: A Pretrained Model

Consider a model such as Llama 3.

Before fine-tuning, it has already learned:

  • Language structure
  • Grammar and syntax
  • Reasoning patterns
  • Programming concepts
  • General world knowledge

Instead of training from scratch, we continue training the model using domain-specific data such as:

  • Insurance policies
  • Medical records
  • Legal contracts
  • Customer support conversations
  • Enterprise documentation

The goal is not to teach language again, but to adapt existing knowledge to a specific use case.


What Parameters Exist Inside an LLM?

A transformer-based LLM contains billions of parameters spread across:

  • Token embeddings
  • Attention layers
  • Feed-forward neural networks (MLPs)
  • Output projection layers
  • Bias terms

These parameters collectively determine how the model processes inputs and predicts the next token.

Fine-tuning modifies some or all of these parameters.


Full Fine-Tuning

The most straightforward approach is Full Fine-Tuning.

In this approach, every trainable parameter is updated.

This includes:

  • Embedding layers
  • Attention weights
  • Feed-forward layers
  • Output head

For example, a 70-billion parameter model would have all 70 billion parameters updated during training.

Advantages

  • Maximum adaptation to the target domain
  • Potentially highest task performance

Disadvantages

  • Extremely expensive
  • Requires significant GPU memory
  • Longer training times
  • Higher risk of catastrophic forgetting

Because of these limitations, full fine-tuning is becoming less common in enterprise environments.


Parameter-Efficient Fine-Tuning (PEFT)

Today, most organizations use Parameter-Efficient Fine-Tuning techniques.

Popular approaches include:

  • LoRA
  • QLoRA
  • Adapters
  • Prefix Tuning

Instead of updating billions of parameters, PEFT updates only a small fraction of the model.

For example:

  • Base Model: 70B parameters
  • Trainable Parameters: 20M–100M

This dramatically reduces:

  • GPU requirements
  • Training costs
  • Storage requirements

while maintaining strong performance.


The Role of an Optimizer

Once the model performs a forward pass and calculates a loss, it must determine how to update its weights.

This is the optimizer’s job.

The training loop typically looks like

An infographic illustrating the machine learning training loop, featuring four key steps: Forward Pass, Loss Calculation, Backpropagation, and Optimizer Updates Weights. Each step is accompanied by visuals and concise descriptions.

The optimizer decides:

“How much should each parameter change?”


Why AdamW Is the Industry Standard

Most LLM fine-tuning pipelines use AdamW.

AdamW improves traditional gradient descent through:

  • Adaptive learning rates
  • Momentum
  • Weight decay regularization

At a high level:

New Weight = Old Weight − Learning Rate × Gradient

AdamW enhances this process by adjusting updates based on historical gradient information.

Benefits of AdamW

  • Stable convergence
  • Faster training
  • Better performance on large transformer models
  • Reduced overfitting through weight decay

This makes AdamW the default optimizer for most LLM training and fine-tuning workflows.


Why Learning Rate Schedulers Matter

A common mistake is assuming that the learning rate should remain constant throughout training.

For example:

Learning Rate = 0.0001

for every training step.

In practice, this often leads to:

  • Training instability
  • Overshooting optimal solutions
  • Poor convergence

A scheduler dynamically adjusts the learning rate during training.


Warmup and Decay Strategy

Most LLM fine-tuning jobs use:

  1. Warmup Phase
  2. Peak Learning Rate
  3. Gradual Decay

The learning rate starts near zero, increases gradually, and then slowly decreases.

This approach provides:

  • Stable early training
  • Faster convergence
  • Better final performance

Warmup is particularly important because gradients can be highly unstable during the first few training steps.

Without warmup, large updates can damage pretrained knowledge.


Understanding Layer Freezing

Another common fine-tuning strategy is Layer Freezing.

Consider a transformer model containing:

  • Embedding Layer
  • Transformer Block 1
  • Transformer Block 2
  • Transformer Block 32
  • Output Layer
A diagram illustrating a transformer model architecture, featuring an embedding layer, multiple transformer blocks (1 to 32), and an output layer. The diagram includes elements like positional encoding, layer normalization, multi-head self-attention, and a feed-forward network.

Instead of training every layer, we can freeze some layers.

A frozen layer is configured with:

requires_grad = False

This means:

  • No gradient computation
  • No parameter updates
  • No memory spent on optimization

The layer remains unchanged throughout training.


Why Freeze Layers?

Research and practical experience show that lower transformer layers often learn:

  • Grammar
  • Syntax
  • General language patterns

Higher layers capture:

  • Domain knowledge
  • Task-specific behavior
  • Specialized reasoning

Because language fundamentals are already learned during pretraining, retraining them is often unnecessary.

A common strategy is:

  • Freeze early layers
  • Fine-tune later layers

For example:

  • Freeze first 20 layers
  • Train last 12 layers

Benefits

  • Faster training
  • Lower memory consumption
  • Reduced overfitting
  • Better preservation of general capabilities

Real-World Example: Insurance Chatbot

Imagine building an insurance assistant.

The pretrained model already understands:

  • English language
  • General reasoning
  • Conversation flow

What it lacks is deep insurance knowledge.

Instead of retraining the entire model:

  • Freeze lower layers
  • Adapt upper layers

The model learns:

  • Policy terminology
  • Claims processes
  • Insurance-specific workflows

while preserving general language understanding.


LoRA: The Most Important Modern Fine-Tuning Technique

LoRA (Low-Rank Adaptation) is one of the most widely adopted fine-tuning approaches today.

Instead of modifying the original weight matrix W directly, LoRA learns a small update matrix:

ΔW = A × B

The original weight matrix remains frozen.

The final effective weight becomes:

W_final = W + ΔW

Only the small matrices A and B are trained.

Why LoRA Works

LoRA assumes that task-specific knowledge can be represented as a low-rank update rather than rewriting the entire model.

Advantages

  • Millions of trainable parameters instead of billions
  • Smaller checkpoints
  • Faster training
  • Lower GPU requirements
  • Reduced catastrophic forgetting

This is why LoRA has become the preferred approach for enterprise fine-tuning.


Typical Fine-Tuning Hyperparameters

Some commonly used settings include:

HyperparameterTypical Range
Learning Rate1e-5 to 5e-5
Batch Size8–128
Epochs1–5
Warmup Ratio5–10%
Weight Decay0.01
OptimizerAdamW
SchedulerLinear or Cosine Decay

The exact values depend on:

  • Model size
  • Dataset size
  • Hardware constraints
  • Task complexity

What Changes Internally During Fine-Tuning?

The following components may be updated:

Updated Components

  • Attention matrices (Q, K, V projections)
  • Feed-forward network weights
  • Output prediction layer
  • LoRA adapter weights (when using LoRA)

These updates help the model specialize for the target domain.


What Usually Does Not Change?

Unless performing advanced model surgery, the following remain unchanged:

  • Tokenizer
  • Model architecture
  • Number of transformer layers
  • Hidden dimensions
  • Attention heads

Fine-tuning adapts behavior without redesigning the model itself.


Understanding Catastrophic Forgetting

One risk of aggressive fine-tuning is catastrophic forgetting.

Imagine a model originally skilled at:

  • Programming
  • Science
  • History
  • Mathematics

After excessive legal-domain fine-tuning, it may become highly specialized in legal reasoning while losing performance in unrelated domains.

This phenomenon is known as catastrophic forgetting.

Techniques such as:

  • LoRA
  • Layer freezing
  • Smaller learning rates

help preserve the model’s original capabilities while adding new knowledge.


Final Thoughts

Fine-tuning is fundamentally about adapting a pretrained model to a specialized task without rebuilding its knowledge from scratch.

During fine-tuning:

  • Optimizers such as AdamW determine how weights are updated.
  • Learning-rate schedulers control the pace of learning.
  • Layer freezing preserves foundational language capabilities.
  • LoRA and other PEFT methods dramatically reduce training costs.

Modern enterprise AI systems increasingly rely on parameter-efficient approaches because they provide an excellent balance between performance, cost, and scalability.

Understanding these concepts is essential not only for building production-grade GenAI systems but also for succeeding in LLM, MLOps, and AI engineering interviews.

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading