Large Language Models (LLMs) such as Llama, GPT, Mistral, and Gemma are pretrained on massive datasets containing trillions of tokens. These models already understand language, grammar, reasoning patterns, coding concepts, and a significant amount of world knowledge.
However, enterprises often need models that specialize in specific domains such as insurance, healthcare, legal services, finance, or customer support. This is where fine-tuning comes into play.
A common question in AI and GenAI interviews is:
“What actually changes inside a model during fine-tuning?”
In this article, we’ll explore what happens internally when a pretrained model is adapted to a new task and understand the roles of optimizers, learning-rate schedulers, layer freezing, and modern techniques like LoRA.
Starting Point: A Pretrained Model
Consider a model such as Llama 3.
Before fine-tuning, it has already learned:
- Language structure
- Grammar and syntax
- Reasoning patterns
- Programming concepts
- General world knowledge
Instead of training from scratch, we continue training the model using domain-specific data such as:
- Insurance policies
- Medical records
- Legal contracts
- Customer support conversations
- Enterprise documentation
The goal is not to teach language again, but to adapt existing knowledge to a specific use case.
What Parameters Exist Inside an LLM?
A transformer-based LLM contains billions of parameters spread across:
- Token embeddings
- Attention layers
- Feed-forward neural networks (MLPs)
- Output projection layers
- Bias terms
These parameters collectively determine how the model processes inputs and predicts the next token.
Fine-tuning modifies some or all of these parameters.
Full Fine-Tuning
The most straightforward approach is Full Fine-Tuning.
In this approach, every trainable parameter is updated.
This includes:
- Embedding layers
- Attention weights
- Feed-forward layers
- Output head
For example, a 70-billion parameter model would have all 70 billion parameters updated during training.
Advantages
- Maximum adaptation to the target domain
- Potentially highest task performance
Disadvantages
- Extremely expensive
- Requires significant GPU memory
- Longer training times
- Higher risk of catastrophic forgetting
Because of these limitations, full fine-tuning is becoming less common in enterprise environments.
Parameter-Efficient Fine-Tuning (PEFT)
Today, most organizations use Parameter-Efficient Fine-Tuning techniques.
Popular approaches include:
- LoRA
- QLoRA
- Adapters
- Prefix Tuning
Instead of updating billions of parameters, PEFT updates only a small fraction of the model.
For example:
- Base Model: 70B parameters
- Trainable Parameters: 20M–100M
This dramatically reduces:
- GPU requirements
- Training costs
- Storage requirements
while maintaining strong performance.
The Role of an Optimizer
Once the model performs a forward pass and calculates a loss, it must determine how to update its weights.
This is the optimizer’s job.
The training loop typically looks like

The optimizer decides:
“How much should each parameter change?”
Why AdamW Is the Industry Standard
Most LLM fine-tuning pipelines use AdamW.
AdamW improves traditional gradient descent through:
- Adaptive learning rates
- Momentum
- Weight decay regularization
At a high level:
New Weight = Old Weight − Learning Rate × Gradient
AdamW enhances this process by adjusting updates based on historical gradient information.
Benefits of AdamW
- Stable convergence
- Faster training
- Better performance on large transformer models
- Reduced overfitting through weight decay
This makes AdamW the default optimizer for most LLM training and fine-tuning workflows.
Why Learning Rate Schedulers Matter
A common mistake is assuming that the learning rate should remain constant throughout training.
For example:
Learning Rate = 0.0001
for every training step.
In practice, this often leads to:
- Training instability
- Overshooting optimal solutions
- Poor convergence
A scheduler dynamically adjusts the learning rate during training.
Warmup and Decay Strategy
Most LLM fine-tuning jobs use:
- Warmup Phase
- Peak Learning Rate
- Gradual Decay
The learning rate starts near zero, increases gradually, and then slowly decreases.
This approach provides:
- Stable early training
- Faster convergence
- Better final performance
Warmup is particularly important because gradients can be highly unstable during the first few training steps.
Without warmup, large updates can damage pretrained knowledge.
Understanding Layer Freezing
Another common fine-tuning strategy is Layer Freezing.
Consider a transformer model containing:
- Embedding Layer
- Transformer Block 1
- Transformer Block 2
- …
- Transformer Block 32
- Output Layer

Instead of training every layer, we can freeze some layers.
A frozen layer is configured with:
requires_grad = False
This means:
- No gradient computation
- No parameter updates
- No memory spent on optimization
The layer remains unchanged throughout training.
Why Freeze Layers?
Research and practical experience show that lower transformer layers often learn:
- Grammar
- Syntax
- General language patterns
Higher layers capture:
- Domain knowledge
- Task-specific behavior
- Specialized reasoning
Because language fundamentals are already learned during pretraining, retraining them is often unnecessary.
A common strategy is:
- Freeze early layers
- Fine-tune later layers
For example:
- Freeze first 20 layers
- Train last 12 layers
Benefits
- Faster training
- Lower memory consumption
- Reduced overfitting
- Better preservation of general capabilities
Real-World Example: Insurance Chatbot
Imagine building an insurance assistant.
The pretrained model already understands:
- English language
- General reasoning
- Conversation flow
What it lacks is deep insurance knowledge.
Instead of retraining the entire model:
- Freeze lower layers
- Adapt upper layers
The model learns:
- Policy terminology
- Claims processes
- Insurance-specific workflows
while preserving general language understanding.
LoRA: The Most Important Modern Fine-Tuning Technique
LoRA (Low-Rank Adaptation) is one of the most widely adopted fine-tuning approaches today.
Instead of modifying the original weight matrix W directly, LoRA learns a small update matrix:
ΔW = A × B
The original weight matrix remains frozen.
The final effective weight becomes:
W_final = W + ΔW
Only the small matrices A and B are trained.
Why LoRA Works
LoRA assumes that task-specific knowledge can be represented as a low-rank update rather than rewriting the entire model.
Advantages
- Millions of trainable parameters instead of billions
- Smaller checkpoints
- Faster training
- Lower GPU requirements
- Reduced catastrophic forgetting
This is why LoRA has become the preferred approach for enterprise fine-tuning.
Typical Fine-Tuning Hyperparameters
Some commonly used settings include:
| Hyperparameter | Typical Range |
|---|---|
| Learning Rate | 1e-5 to 5e-5 |
| Batch Size | 8–128 |
| Epochs | 1–5 |
| Warmup Ratio | 5–10% |
| Weight Decay | 0.01 |
| Optimizer | AdamW |
| Scheduler | Linear or Cosine Decay |
The exact values depend on:
- Model size
- Dataset size
- Hardware constraints
- Task complexity
What Changes Internally During Fine-Tuning?
The following components may be updated:
Updated Components
- Attention matrices (Q, K, V projections)
- Feed-forward network weights
- Output prediction layer
- LoRA adapter weights (when using LoRA)
These updates help the model specialize for the target domain.
What Usually Does Not Change?
Unless performing advanced model surgery, the following remain unchanged:
- Tokenizer
- Model architecture
- Number of transformer layers
- Hidden dimensions
- Attention heads
Fine-tuning adapts behavior without redesigning the model itself.
Understanding Catastrophic Forgetting
One risk of aggressive fine-tuning is catastrophic forgetting.
Imagine a model originally skilled at:
- Programming
- Science
- History
- Mathematics
After excessive legal-domain fine-tuning, it may become highly specialized in legal reasoning while losing performance in unrelated domains.
This phenomenon is known as catastrophic forgetting.
Techniques such as:
- LoRA
- Layer freezing
- Smaller learning rates
help preserve the model’s original capabilities while adding new knowledge.
Final Thoughts
Fine-tuning is fundamentally about adapting a pretrained model to a specialized task without rebuilding its knowledge from scratch.
During fine-tuning:
- Optimizers such as AdamW determine how weights are updated.
- Learning-rate schedulers control the pace of learning.
- Layer freezing preserves foundational language capabilities.
- LoRA and other PEFT methods dramatically reduce training costs.
Modern enterprise AI systems increasingly rely on parameter-efficient approaches because they provide an excellent balance between performance, cost, and scalability.
Understanding these concepts is essential not only for building production-grade GenAI systems but also for succeeding in LLM, MLOps, and AI engineering interviews.