Arpan

So here's the thing: everyone knows that reinforcement learning (RL) makes language models better at reasoning. Models like OpenAI's o1 and DeepSeek-R1 absolutely crush math problems after RL training. But here's what bugged me: what's actually happening inside the model?

We know RL works, but we don't really know why it works. Does it completely rewrite the model's brain? Does it just tweak a few things? Is it memorizing everything or actually learning to reason?

My collaborator Rahul and I decided to find out. We took two open-source models (Qwen3-1.7B and LLaMA-3.2-1B), fine-tuned them on GSM8K math problems using both supervised fine-tuning (SFT) and RL (specifically GRPO), and then we went full detective mode on their internal representations.

The Context: Why This Matters

Large language models have become the foundation of modern NLP. Beyond standard pretraining, two main strategies have emerged to specialize them: supervised fine-tuning (SFT) and reinforcement learning (RL). SFT uses next-token prediction from human demonstrations, while RL methods like RLHF optimize behavior via scalar rewards, enabling models to follow instructions better, reduce harmful outputs, and reason more effectively (Ouyang et al., 2022).

Recent RLHF-trained models have shown massive improvements. OpenAI's o1 uses RL to generate long latent reasoning chains before responding and achieves breakthroughs on STEM benchmarks (OpenAI, 2024). DeepSeek-R1 uses a critic-free RL algorithm (GRPO) from scratch without any supervised fine-tuning to achieve reasoning capabilities comparable to o1 (DeepSeek-AI, 2025).

Despite these advances, the inner workings of RL fine-tuning remain opaque. Existing research primarily evaluates final outputs (accuracy, latency, etc.) and neglects how internal representations evolve. This gap is critical because RLHF algorithms typically include KL-divergence penalties intended to constrain drift from a base model, suggesting that internal structure is changing but in unclear ways.

The Setup: Math Problems and Two Training Methods

We chose GSM8K (grade school math word problems) for a few reasons: (1) you can verify if answers are correct, which makes RL training straightforward, and (2) it's way easier to interpret what the model is doing when it's solving math vs. something subjective like human preferences.

We used two base models:

Qwen/Qwen3-1.7B: A 1.7B parameter model
unsloth/Llama-3.2-1B-Instruct: An instruction-finetuned version of LLaMA-3.2-1B (we used the instruction-tuned version because the base model is poor at instruction following, making it hard to verify answers with GRPO)

We trained both models for 4 epochs using:

SFT (Supervised Fine-Tuning): The classic approach: just predict the next token from example solutions using cross-entropy loss
RL with GRPO: Group Relative Policy Optimization estimates advantages by comparing sampled outputs within a group, avoiding the need for a separate critic model
RL with and without KL regularization: To see how much that "stay close to the base model" constraint matters

GRPO is particularly cool because it cuts memory usage roughly in half compared to PPO and simplifies training dynamics. Instead of training a separate value function, it estimates advantages by comparing outputs within a group:

A_i = (R(γ_i) - mean(G)) / std(G), where γ_i ∈ G

This group-based comparison aligns well with reward-based ranking and enabled DeepSeek-R1-Zero to learn reasoning strategies via RL alone without supervised warm-up (DeepSeek-AI, 2025).

The Results: RL Crushes It (But Why?)

First, the obvious stuff: RL performed way better. On Qwen, RL with KL hit 83.55% accuracy vs. 62.70% for SFT. That's a 20% gap! On LLaMA, the gap was even bigger: about 24%. The base model scores were 20.77% for Qwen and 48.67% for LLaMA, so both methods improved things, but RL did way better.

Method	Eval Score (%)
Epoch →	1	2	3	4
Qwen3-1.7B RL (KL)	80.74	81.88	83.55	81.43
Qwen3-1.7B RL (no KL)	82.64	82.87	83.47	81.40
Qwen3-1.7B SFT	54.40	63.00	62.90	62.70
LLaMA-3.2 1B RL (KL)	54.36	58.38	58.30	58.07
LLaMA-3.2 1B RL (no KL)	28.35	50.11	38.89	51.10
LLaMA-3.2 1B SFT	26.99	32.15	34.80	35.63

Base model scores: Qwen = 20.77%, LLaMA = 48.67%. Green rows = RL methods.

RL models also generated longer completions, often using the full 512-token budget. At first, I thought this was wasteful, but it turns out they're being thorough: exploring different solution paths and then confidently landing on the answer. This is a phenomenon that's been observed before in the DeepSeek-R1 paper: GRPO incentivizes exploration, which often leads to longer and better answers.

For Qwen, training with and without KL-divergence resulted in no distinguishable difference in final performance. For LLaMA, training with KL-divergence led to better results (~7% gap). Interestingly, for LLaMA-3.2 1B, SFT actually made the model worse than the base. This is likely because the base model itself has been fine-tuned for a long duration.

But here's where it gets interesting: when we looked at what changed inside the models, RL barely moved anything. SFT, on the other hand, went wild.

High-Level Analysis: What Changed in the Weights?

L2 Distance: How Far Did We Drift?

We measured how far the fine-tuned weights drifted from the base model using L2 (Frobenius) distance. The results were shocking:

L2 distance comparison showing RL stays closer to base model

SFT caused massive drift. The weights moved way further from the original model. RL models stayed surprisingly close to the base. Even more interesting: RL with KL regularization stayed even closer, but RL without KL still did way better than SFT while staying closer to the base.

Rank Change: How Much Information Are We Storing?

We also looked at the "rank" of the weight matrices (how much information they store). A higher rank means more "information", whereas a lower rank means the matrix stores less "information" since it can be decomposed into a smaller one with only minor loss.

To measure this, we first found the rank (K) of the base model weight which accounts for 99% of information using SVD. For example, for a Q (query) matrix (2048 x 2048), K might be 2000. We then found out how much information the same K top eigenvalues preserve in the fine-tuned (RL/SFT) weights and subtracted the two values. A negative value means the rank increased during training (the same vectors now hold less information than they did previously).

Rank comparison showing SFT increases rank while RL preserves it

SFT training led to weights with higher rank than the base models. RL models preserved or even reduced the rank. This suggests SFT is memorizing more information, while RL is learning more general patterns.

Think of it like this: imagine you're teaching someone to add numbers. SFT is like making them memorize "2+2=4, 4+4=8, 8+10=18..." for every possible combination. RL is like teaching them the general concept of addition so they can figure out any problem. The first approach needs way more storage (higher rank), but the second generalizes better.

Key Insight

Base models already have the reasoning ability to solve these problems. They have the circuits and knowledge built in. RL just gives them a gentle nudge in the right direction, activating the specific circuits/weights that matter. SFT, on the other hand, tries to hardcode everything.

Token-Level Analysis: Where Does the Model Look?

Self-Attention Matrices: The Big Picture

We plotted the difference between the self-attention matrix of the fine-tuned model (RL/SFT) and the base model for a particular prompt. Similar to the weight analysis, both the self-attention pattern and the scores remained much closer in "RL vs. Pre" than in "SFT vs. Pre".

Important Note

The scales for "SFT vs. Pre" have a much higher range (0.1 to -0.1) than the scales for the RL rows. So visually it might look like "RL vs. Pre" has similar differences, but it's not the case. SFT causes way bigger changes.

Interestingly, the difference between KL and no KL is more prominent here (especially in LLaMA), where self-attention scores of models trained with KL, as expected, remain closer to the base model.

Per-Token Attention: What Gets Focused On?

We visualized the difference in attention received by each token in the prompt between the base and fine-tuned models. For a given self-attention matrix A ∈ ℝ^N×N, where A_ij represents the attention paid by token i to token j, we computed the total attention received by each token j as the column-wise sum s_j = Σ_i=1^N A_ij. To mitigate the influence of the first token (which often acts as an attention sink (Barbero et al., 2025)), we discarded s₁ and normalized the remaining vector.

One observation that remains consistent across models and examples: RL pays much more attention to the formatting token (\boxed) than SFT. In fact, SFT pays less attention to it than even the base trained model (after normalization). This is likely due to two reasons:

For GRPO, the reward strictly depends on the right answer being within \boxed, so the model tries to always output it and also outputs it multiple times (often repeating the answer)
Given what we know from other results (e.g., small L2 difference) and our hypothesis that RL-trained models only slightly nudge the model in the right direction, the per-token attention pattern means that RL pays more attention to just a few tokens, and these tokens are sufficient for improving performance

Another observation that supports this: if you look at the attention difference in tokens which make up the actual contents of the math problem (e.g., "Grace weights 25 pounds, Alex weights 2 pounds less than 4 times..."), you see that the difference between RL and the base model ("Pre") is small (smaller than between SFT and base). Ideally, you'd expect that since the RL model performs much better than the base, it either pays more attention or uses a different pattern of attention to these "logical tokens", but this is not the case. This is another observation supporting the theory that base models often contain the circuits/knowledge to solve these problems and RL only slightly nudges them in the correct direction.

The Entropy Story: Exploration vs. Overconfidence

This was my favorite finding. We tracked the "entropy" (uncertainty) of the model as it generates each token. High entropy = the model is exploring, low entropy = it's confident.

More specifically, e_t is calculated as:

e_t = Σ_v=1^V p_t,v log p_t,v

where p_t,v is the softmax probability of the v-th vocabulary word for the t-th token. A flatter, more uniform distribution means a higher negative value of e_t. A sharper peaked distribution (where only one word/token is highly probable) means e_twill be closer to 0.

RL entropy plot showing exploration then confidence

RL: Explores broadly, then gets confident

SFT entropy plot showing early overconfidence

SFT: Overconfident early, then uncertain

There are two key things we observe:

RL-trained models are more exploratory in the initial phase (they have higher negative values) while SFT models start out being much more confident. This is due to the fundamental difference in how RL and SFT work: RL encourages exploration of different paths during training and rewards all paths that lead to correct answers, while SFT forces the model to collapse to one mode.
As RL models get closer to the answer, they gradually become more confident (the line gets closer to 0 and flatter). On the other hand, for SFT models, there's a dip in confidence just before they answer. For harder tasks, having higher entropy (larger negative values of e_t) becomes important since it helps the model explore and backtrack during inference to come to the correct solution. RL helping the model avoid a mode/entropy collapse is likely one of the most important reasons why RL-trained models are much better at harder tasks.

Discussion & Limitations

Our study focuses on two open-source models and a single math benchmark, which may limit generalization to other domains or model scales. While our structural analyses provide clear trends, they do not capture all possible forms of internal drift. Future work could expand to additional architectures, tasks, and interpretability tools for broader validation.

A broader impact of the project is that most RL research usually focuses on better algorithms, evaluations, and environments. Our project moves in a different direction and helps us understand why RL works.

What This All Means

So here's the big picture:

RL doesn't rewrite the model. It makes small, targeted changes that preserve the base model's reasoning abilities.
SFT memorizes. It stores more information (higher rank) but doesn't generalize as well.
RL encourages exploration. Models stay uncertain longer, try different paths, and then confidently commit to answers.
Base models already know how to reason. RL just activates the right circuits and sharpens focus on what matters.
RL preserves attention structure. Especially with KL regularization, attention patterns stay close to the base model.
RL focuses on key tokens. It pays more attention to formatting tokens like \boxed without needing to change attention to problem content.

This has huge implications for alignment and interpretability. If RL makes small, interpretable changes rather than massive rewrites, we can better understand and control what models are doing. We can design training that preserves the good stuff while fixing the bad stuff.

The Takeaway

RL fine-tuning is like a precision tool: it makes surgical changes rather than swinging a sledgehammer. It nudges models in the right direction without breaking what already works. And that's probably why it's so effective at improving reasoning while maintaining generalization.

These results suggest that RL fine-tuning encourages efficient and selective adaptation, while SFT drives more global but less structured changes. Understanding these internal differences is key to designing future alignment strategies that balance performance with interpretability.

References & Resources

Papers & Resources:

Ouyang et al. (2022) - Training language models to follow instructions with human feedback
Shao et al. (2024) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO)
DeepSeek-AI (2025) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Cobbe et al. (2021) - Training Verifiers to Solve Math Word Problems (GSM8K)
Barbero et al. (2025) - Why do LLMs attend to the first token?
OpenAI (2024) - o1 System Card

Code & Plots:

This work was done with Rahul Chand as part of CS224R at Stanford. Shoutout to the open-source community for making models like Qwen and LLaMA available for research!