On 'fantastic reasoning behaviors and where to find them'

03 Jan, 2026

I've read 'Fantastic Reasoning Behaviors and Where to Find Them' by google deepmind:) arXiv on my old kobo.

I liked the title ('cause Harry Potter universe, obsvly). Tbh, I said to myself that until next week, I won't read anything except fiction (I have to do my goodreads challenge:)), 50% discipline, 50% fatigue, 0% availability. Dang, sometimes I break my own rules when I accidentally see papers about ~discovering reasoning behaviors ~ & stuff like that.

My notes:

'Building on the linear representation hypothesis (Park et al., 2023), we define fine-grained reasoning behaviors as linear directions in the activation space, which we refer to as Reasoning Vectors.'[p.2]
'atomic reasoning behavior can be mapped to a specific direction in the activation space, which we define as a reasoning vector.' [p.4]

=> Park et al., 2023 also used 'concepts' generated by gpt-4 and some cvs files(?) to 'demonstrate' the 'linear representation'.
They used pairs, for ex: king/queen, male/female, leader/lady and/or wife, monarh/her etc.
Than they've discovered that languages are complicated:

'Therefore, a word can have another meaning other than the meaning of the exact word' [p.16]

also,

'In the experiment for intervention notion, for a concept W,Z, we sample texts which Y(0,0) (e.g., “king”) should follow, via ChatGPT-4. We discard the contexts such that Y(0,0) is not the top 1 next word. Table 4 present the contexts we use.' [p.17]

Seems to me that is more data contamination (internet common crawl, wikipedia etc) than proving 'linear representation'.
This is the base on which deepmind paper builds on. Ok.

Returning to the paper:
=> 'Reasoning vectors' that supposedly are 'directions in the latent space' that correspond to distinct reasoning and by training a 'Sparse Auto-Encoder', one can get 'disentangled' features

=> 'reflection' (revised steps) or 'backtracking' (change the current approach) => 'atomic reasoning behaviour' in the hypothetical linear representation.

Bc 'atomic' reminded me of this article [arXiv:2410.19750v2 = 'the geometry of concepts'], it says that neural networks are multidimensional. If you try to project them in one direction, you always lose something: 'We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman::king:queen).' [p.1]

Since it's unclear why the linear representation hypothesis was picked, it's debatable. Some say it's multidimensional, some say linear. Sure, let's talk about complex concepts with 98% confidence under the hypothesis that we are not sure about it.

If we look at Park et al., 2023 methodology, we see that authors used it also:

'The classification is performed using an LLM-as-a-judge approach, where for each reasoning step associated with ${h_{i}^{l}}$ , we prompt the LLM (i.e., GPT-5) with precise definitions of the behaviors: reflection (re-examining earlier steps), backtracking (switching to a new approach), and other.' [p.6]

=> / $W_{d e c o d e r}$ / (the decoder matrix) -> up to here, 'the discovery' is unsupervised. Then the authors introduced the 'supervised' reflection/ backtracking. They use gpt-5 as a 'judge' to label each step as reflection, backtracking, or 'other' (uuups:) in the abstract:

'By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.' [p.1]

=> So 'other' is everything else?:)

'In contrast, for the other category, which potentially encompasses a mixture of behaviors, the top-active vectors are less centered and more dispersed across the entire SAE space.' [p.6]
'This suggests that reflection and backtracking occupy more overlapping representational subspaces, while both are more clearly distinguished from the residual “other” category.' [p.6]

=> 'other' category = 'residual category' = everything we can't explain

While reading it, I started thinking about the 'objective' of the paper:

'Since our goal is to model various reasoning behaviors, whose complexity is much lower than modeling raw language structure, we adopt a relatively small hidden dimension of 𝐷 = 2048. The SAE is trained with a batch size of 1024 and a learning rate of 1 ×10−4, with a warm-up over the initial 10% of training. We use the Adam optimizer (Kingma, 2014) with cosine annealing learning rate decay. To encourage sparsity, we apply a sparsity strength of 𝜆= 2 ×10−3.'

It's unclear for me. adam + cosine decay + warmup = math constraints

vs implementation:

loss = ((h_hat - h) ** 2).mean() + lambda_ * torch.abs(z).mean()

??? :) Probably L1 with standard backprop, but we lack details of the actual implementation in the paper.

Anyway, they use small models and they bring a bigger one to be the judge and label them using 'human-defined concepts', without gpt-5 putting labels, there will be no interpretations over the 'discovered vectors':)

The funny thing is that they intervene through ablation & editing. The model does more reflection if they increase projection on that 'wait, let me think again'. If they decrease that 'vector', the model doesn't 'reflect' anymore. If you cut the 'fantastic reasoning' and you get the same correct result => maybe, the 'fantastic reasoning' is less causal to how the model gets to the correct answer? The label 'other' is undistinguished. Reflection/backtracking in 2D are the same as 'confidence' in the latent space.

'Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner.'[p1]

They criticise the 'analyse reasoning in a supervised manner', but they do the same. Reflection/backtracking etc. were looked at using gpt to label them, while humans select the columns in a csv:).

'report the agreement ratio (defined as the proportion of steps receiving the same annotation relative to all steps) exceeding 85% for each pair of methods. (…) Overall, the consistency is high across most comparisons, with GPT-5 and GPT-4o exhibiting particularly strong agreement at approximately 94%. These results validate the reliability and consistency of our annotation methodology.' [p.17]

That means that there is 6% - 15% bias in labelling because each model has its own bias. When the model uses 'wait, let me think again', it's explicit, in the tokens space. Cultural things. Ppl say all the time 'wait, let me think again' because it's more acceptable socially than being blunt (that offends people more).

'While we observe a slight performance drop from 23.33% to 20.00%, this corresponds to only a single question out of 30 and is therefore not statistically significant. More importantly, the intervention substantially alters the reasoning style, with large reductions in reflection and backtracking.' [p.10]

When the model lacks confidence, the ~ interventions ~ mask it. With steering, they did ~cherry picking~. They say that the way they ~ tune ~ it in real time is more efficient than others. Ok. They 'discovered' how the model mimics ~ thinking ~.
Even if, let's say RISE discovers 'vectors'
=> is inefficient practically on the existing gpu's (too much overhead, isn't it?). In any case, the model still gets the correct result.