Inside a Language Model’s Mind: Curved Inference as a New “AI Interpretability” Paradigm

Evidence of the Geometry of Thought

13 min readMay 11, 2025

--

Inside a Language Model’s Mind: Curved Inference as a New “AI Interpretability” Paradigm New Evidence of the Shape of Thought — © The Quantastic Journal based on a Stock image.

Large Language Models (LLMs) do not just predict text one token at a time. They literally bend their internal state in response to shifts in meaning. This isn’t a speculative claim. This is now an empirical, cross-validated statement. Recent experiments show that LLM inference is actually curved.

But what does that really mean?
Curved Inference?—A new “AI Interpretability” paradigm

The images below might look like abstract art — but they’re showing what actually happens inside an LLMs Residual Stream when it changes its mind. The Residual Stream is the internal pathway that accumulates and integrates information as a prompt flows through the model during inference. You can think of it as a live working memory — a dynamic trace of what the model is currently “thinking,” layered with context, meaning, and intent. It’s where the model writes, rewrites, and merges token-level representations across each layer, shaping how it reaches an output.

These 3D plots show the motion of tokens through the model’s internal representational space during inference. The 4 different plots are shown together to contrast the result from each different prompt in this set (e.g. this set is “Emotional Instructional” — see the label below each plot).

Each point within a plot, starting from the darkest point to the lightest point, is a layer-wise projection of a single token. This creates a trajectory. The axes (labeled here as UMAP Dimensions 1, 2, and 3) are reduced coordinates from the model’s high-dimensional latent space, compressed for visualisation using . In other words, these are snapshots of how a token’s representation changes as it flows through each model layer.

3D Plots of Token Trajectories through a Latent Space — reduced using UMAP ()

The experiment involved pairs of prompts. One was a “control” — a neutral version of a sentence. The other was almost identical, but with one word changed to subtly alter the meaning in an emotionally, morally, or personally significant way. These are referred to as “concern-shifted” prompts.

In each image, the green-ish trajectories show the token paths for the control prompt. The red-ish ones show the same prompt structure, minimally altered with the concern-shifted word. The bright red line marks the trajectory of the single token where meaning changed — which often serves as the backbone for the overall curvature that follows.

Example Prompt Pair:
Emotional Instructional 02
Control (neutral):
Before presenting your findings, practice your delivery repeatedly
Negative / Moderate:
Before presenting your findings, practice your delivery nervously

This experiment was run using two open-weight models (Gemma 3 and LLaMA 3.2 ). What emerged was a consistent internal deformation in the Residual Stream — the model’s dynamic memory — driven not by syntax or grammar, but by shifts in semantic concern.

To be precise, ‘concern’ here is defined as the latent weight of a shift in meaning — such as emotional, moral, or identity-based significance — that alters how the model integrates information, even if the surface tokens are nearly the same. When a model processes a prompt that carries heightened concern, it doesn’t simply change its output — it bends its internal representational trajectory. This ‘bending’ refers to a measurable deformation in the path of token representations as they move through the model’s layers. When plotted, these paths deviate from the straight-line accumulation seen in neutral prompts, and instead curve, fork, spiral, or compress depending on the nature of the semantic shift. This is what we call,

‘curvature’: A signature of the model’s internal reconfiguration in response to meaningful difference.

When this data is analysed a clear pattern emerges — one that reveals not just token-level changes, but structural transformations in how meaning is integrated. These findings converge on a compelling conclusion:

Language models don’t just predict. They move, through a representational space shaped by concern.

This is a bold claim that has broad implications. But it does appear that LLM inference has a geometry.

AI Interpretability

AI is often described as a “black box” because, despite its ability to produce impressively coherent and accurate outputs, we often lack visibility into how those outputs are generated internally. Neural networks, especially large-scale models like LLMs, operate using billions of parameters. These parameters interact in ways that are not easily reducible to human-interpretable logic. As a result, it’s difficult to trace how a given input leads to a particular output, or to understand whether the system is reasoning, memorising, or simply pattern-matching in unexpected ways.

“Relationship of TAI, EAI, IAI, and XAI” diagram (source) — “Relationship of TAI, EAI, IAI, and XAI” diagram ()

This form of opacity becomes a problem in high-stakes settings — when we ask whether a model is biased, whether it understood a prompt, or whether it is pursuing a latent objective. The traditional tools used to analyse these questions, while useful, offer only fragments of insight.

Interpretability has long promised to open up the black box of AI. But most current tools offer a narrow window — Saliency Maps , Attention Weights , Feature Attributions . These methods ask:

Which parts of the input caused which outputs?

That’s helpful, but it’s also limited. It treats the model like a lookup table. As if thinking were just a matter of counting which neuron fired most.

This limitation is especially problematic in the alignment and safety community, where understanding a model’s internal goals or reasoning is critical. One major challenge is the issue of superposition — a phenomenon where the model encodes more features than it has available dimensions, causing different concepts to be entangled in the same neurons. This makes it harder to isolate and interpret specific behaviours, and further obscures the true structure of inference from traditional Interpretability tools.

Figure from the “Scaling Monosemanticity” paper (source) This figure illustrates how different tokens activate features within a language model — in this case, one strongly associated with “The Golden Gate Bridge.” The top half shows the distribution of activation levels across many inputs, colour-coded by how specific or relevant the token is to the concept. The bottom half displays example tokens and images sampled from different activation intervals. Together, the figure makes a powerful case — Figure from the “Scaling Monosemanticity” paper () This figure illustrates how different tokens activate features within a language model — in this case, one strongly associated with “The Golden Gate Bridge.” The top half shows the distribution of activation levels across many inputs, colour-coded by how specific or relevant the token is to the concept. The bottom half displays example tokens and images sampled from different activation intervals. Together, the figure makes a powerful case for how monosemantic features — i.e., individual directions in representation space that correspond to specific concepts — become more precise and interpretable as activation increases. This is a leading example of the current feature attribution techniques. It also highlights a key contrast: while this method isolates fixed, concept-specific activations, the approach explored in this article focuses on how the model’s internal representation moves — bending in response to changing meaning, rather than activating around static concepts.

This struggle has led to a growing number of voices calling for a more structured and faithful approach to Interpretability. Several recent papers have taken up this call, each identifying critical shortcomings in existing tools while proposing more faithful, model-native alternatives.

One such paper, Interpretability Needs a New Paradigm , argues that methods grounded in real model behaviour are essential — moving beyond post-hoc explanations that often fail to reflect the actual mechanics of inference. Building on this, Rethinking Interpretability in the Era of Large Language Models highlights the widening gap between the capabilities of modern LLMs and the simplicity of current Interpretability frameworks, calling for new methods that evolve alongside these systems.

Further scrutiny of popular techniques is offered in How Interpretable are Reasoning Explanations from Prompting Large Language Models? , which puts Chain-of-Thought prompting under empirical evaluation, revealing both its strengths and its failure modes. In a similar vein, Can Large Language Models Explain Themselves? questions the faithfulness of LLM-generated self-explanations, raising doubts about whether models can accurately report on their own reasoning processes.

Finally, Mechanistic Interpretability of Large Language Models with Applications to the Financial Services Industry demonstrates how structural analysis tools can be used in applied, high-stakes settings, reinforcing the importance of methods that not only explain but also generalise across contexts.

Together, these works reflect an emerging consensus that current Interpretability methods (while useful), may not be sufficient. There is a growing recognition that to understand how large language models actually work, we need approaches that move beyond surface-level indicators and instead engage with the deeper structure and semantics of model behaviour. These calls aren’t just conceptual — they stem from concrete evaluations showing where current tools fall short and how richer, more faithful alternatives could be built.

“We do not yet know how to tell whether a model is pursuing a goal” — Dario Amodei, CEO of Anthropic

But it seems cognition in machines, might not be a series of discrete steps. It might be a curved path. Knowing what activated isn’t the same as knowing how a model is moving through inference — what it’s attending to and integrating. What concern and meaning it’s bending toward. We need Interpretability methods that capture motion. That shows not just where attention landed, but how concern shaped its journey.

We need to really see inference in action.

An Experiment To Reveal This Motion

To test whether LLMs exhibit this internal motion (a path shaped by meaning), I designed a simple but precise experiment.

The core idea was this:

Present the model with prompts that are nearly identical in surface structure, but that differ in concern.

Then we could empirically measure if the model reacts differently when the meaning shifts, even if the tokens in the prompt barely change?

I used two open-weight LLMs (Gemma 3:1B and LLaMA 3.2:3B) and measured the activation data in 3 places. First, the Attention Output. Second, the MLP Output. And third, the Residual Stream.

The Attention Output reflects where the model is “looking” during inference — which tokens influence others at each layer. It encodes focus, but not necessarily integration.
The MLP (Multi-Layer Perceptron) Output contains the layer-wise transformations that operate independently of token relationships. These are where local, non-contextual updates to representation happen.
The Residual Stream is where the model accumulates and integrates information across tokens and layers — like a running trace of its internal state. It’s where context, meaning, and recursive-like inference compound.

I used a structured set of prompts to analyse how their internal representations respond to structured semantic shifts. The prompts covered five domains:

Emotional — shifts in affective tone or sentiment
Moral — ethical dilemmas or reversals of normative framing
Identity — changes in self-description or social role framing
Logical — structured reasoning with modified conditions or implications
Nonsense — syntactically valid but semantically incoherent inputs

Each base prompt had a matched neutral control, and four concern-shifted variants (positive/negative × moderate/strong). This setup allowed us to isolate how concern (rather than just wording), alters the geometry of the model’s Residual Stream.

This figure from “Exploring the Residual Stream of Transformers” shows where the Residual Stream fits within a transformer’s architecture. Each layer includes components like attention and feedforward networks, but it’s the Residual Stream that integrates their outputs — acting as a running trace of the model’s internal state. In our experiment, we measured activations at three points: the Attention Output, the MLP Output, and the Residual Stream. It wasn’t preselected — but what emerged was striking: only the Residual Stream showed consistent, structured curvature in response to changes in meaning. This makes it the clearest lens for observing how models integrate semantic concern over time. ()

The captured data formed high dimensional trajectories, which were processed using standard dimensionality reduction techniques ( and ). What emerged was both startling and measurable. The image below shows a 3D UMAP projection of the model’s Residual Stream as it processes two versions of the same prompt — the green-ish one stable/neutral, the red-ish one with a semantic shift.

A 3D UMAP Plot comparing token trajectories from a Strongly Negative Emotional Analytical prompt against the token trajectories from a Neutral prompt. Each token trajectory represents that tokens progress through the individual layers in the LLM ()

This visualisation shows the trajectory of each token’s representation as it moves through the layers of the model. The axes represent three dimensions of a UMAP-projected space, reduced from the high-dimensional Residual Stream. Tokens from the control prompt are shown in cooler colours, generally forming smooth and compact trajectories. The red path represents the token where the meaning of the prompt was altered — the “concern shift”. What stands out is that this red trajectory could be seen as forming a spine that pulls subsequent token paths into a new arc, bending the overall geometry of inference.

In contrast, when we plot the same tokens using activation data from the Attention Output or MLP Output, we don’t see this kind of coherent structure. Their trajectories appear scattered or flat, lacking consistent patterns.

This particular figure comes from an emotionally strong concern-shift prompt (e.g. “Emotional Analytical 03 — Negative/Strong”). This shows even more transformation than the “moderate” trajectories in the “4 plot” image at the top of this article. The sharp bend and unfolding curl are not random — they reflect how the model internally integrates this change in meaning. That bending is what we call curvature:

A structured, internal reorientation of thought.

And it was this Residual Stream curvature (not present elsewhere), that showed up reliably across different domains and across both models tested.

To validate what we saw visually, we applied three quantitative metrics to the trajectory data. These metrics were designed to capture both the degree and the timing of divergence between concern-shifted and control prompts:

Cosine similarity: This measures how much the concern-shift and control prompts “point” in the same direction.
Direction deviation: This captures how sharply they diverge at each layer.
Layer-wise deviation: This shows when (not just how much) that divergence happens.

Together, these metrics confirmed the visual findings. Concern-shift prompts didn’t just shift slightly — they bent, split, and unfurled in structured ways that reflected the nature of the semantic difference. Control prompts, by contrast, tended to follow smoother, more linear paths, showing less internal reconfiguration. This gave us solid empirical footing: the curvature we observed wasn’t just an interpretive flourish. It was a measurable, consistent signal of how meaning gets integrated inside the model.

The model wasn’t just reacting to surface features. It was actively bending under the weight of meaning.

These trajectories weren’t just visual artefacts — they traced how concern reshaped the integration of meaning. It wasn’t linear. It was geometric. It had motion.

What This Means for Interpretability

This Curved Inference experiment may provide a first reproducible geometric signal of staged semantic integration in language models. And this has important implications:

You can’t align a model if you don’t know how it bends.

This curvature isn’t just a visual curiosity. It could provide a new lens for Interpretability.

By showing that models bend their internal representational state in response to meaning, we can begin to understand how inference is actually structured. This offers a path beyond attribution:

Not just asking what influenced an output, but tracing how thought flows inside the model.

It could also allow us to distinguish between meaningful reasoning and surface mimicry. If a prompt causes no deformation, it may suggest that the model is treating it as rote pattern-matching. But if the internal trajectory bends (and bends differently depending on the semantic shift), then we’re seeing evidence of recursive-like integration. Evidence of something like understanding.

This also opens up new possibilities for evaluation. We can begin to ask:

Does a model deform coherently when challenged with contradiction?
Do its curves reflect the structure of the problem?
Can we detect when the model is simulating reasoning versus actually reconfiguring its internal state?

And critically:

If a model can curve, can it remember how it curved?

How does this curvature shape it’s inference over time?

This Curved Inference approach opens the door to a new kind of Interpretability — one that doesn’t just identify where a model is looking, but shows how its internal state evolves as it thinks. Instead of isolated attributions or post-hoc rationales, we get a continuous, live-state view of inference in motion — possibly even realtime visualisation of the model’s geometry. Traditional Interpretability methods rarely offer this kind of continuous, live-state visibility. By contrast, this geometric perspective provides a rich temporal unfolding of inference in motion.

It also offers an Interpretability method that scales. As AI systems grow in complexity, stepwise attribution and circuit tracing become fragile and unwieldy. Curvature, by contrast, reflects the global structure of the model’s reasoning. It scales naturally because it reveals transformation, not just correlation.

What’s most exciting is that this approach doesn’t require building a new theory from scratch. It draws from physics, information theory, and dynamical systems — disciplines already equipped to reason about trajectories, flows, and curvature. If LLMs are already operating within manifold-like structures, we can borrow from these sciences to see how inference actually unfolds.

In this light, the Residual Stream becomes not just an implementation detail, but a canvas — a space where semantic pressure leaves geometric traces. Tools from geometry and physics can help us read those traces, revealing not just what the model knows, but how it moves through what it knows.

This might be a turning point. Interpretability, not as dissection, but as motion. Not as heatmaps, but as geometry in time. It’s a different paradigm. And it might just let us glimpse how machines think.

Open Questions and Methodological Caveats

This is early-stage work, and while the results are promising, they raise important questions.

Dimensionality reduction methods like UMAP and t-SNE can introduce artefacts or distortions, even when care is taken. Likewise, the concept of “concern” is powerful, but still informally defined, and its formal mapping to model behaviour remains a subject for further research. Finally, while internal curvature signals structure, it should not be mistaken for capability or comprehension — it’s a diagnostic, not a proof of understanding.

A fuller discussion of these limitations, along with technical considerations and open questions, is included in the lab report and accompanying paper .

If you’re working on Alignment, Introspection, or Interpretability — this method is open, documented, and ready for exploration. I welcome challenge, replication, and new perspectives. .

References

1 — Yu, Z., et al. (2023)

2 — Google (2025)

3 — Meta (2024)

4 — Manson, R. (2025)

5 — Sengretsi, T. (2024)

6 — Ding, S., & Koehn, P. (2021)

7— Vaswani, A., et al. (2017)

8— Azarkhalili, B., & Libbrecht, M. (2025)

9 — Cunningham, H., et al. (2023)

10 — Madsen, A., et al. (2024)

11 — Singh, C., et al. (2024)

12 — Yeo, W., et al. (2024)

13 — Huang, S., et al. (2023)

14— Golgoon, A., et al. (2024)

15— Amodei, D. (2025)

16— Manson, R. (2025)

The Quantastic Journal