Neural Networks, Strange Attractors, and Orderliness in Chaos

first published in August, 2025

In language models, individual neuron activations exhibit apparent randomness, while linear probing often finds that these representations consistently encode information about specific concepts [DLK]. It's difficult to completely interpret individual circuits or small groups of activations, but with an appropriate scale of analysis, grasping some sense of coherent computational structures become tractable [ActAdd, LRH]. Multiple training runs initialized with different random seeds yield models with vastly different microscopic activation patterns, yet they converge toward similar downstream capabilities and potentially similar representations [PRH], though the extent of this convergence remains an active area of investigation. What's going on?

Strange attractors demonstrate how deterministic systems If strange attractors are not predictable why are they deterministic systems? They're deterministic because the rules map each state to a unique next state, and there is no randomness in the dynamics. They're unpredictable long-term because tiny uncertainty in the initial condition grows exponentially. You can predict short-term trajectories and long-run statistics, but not the exact path far ahead. can produce behavior that seems random locally (when you look closely) while maintaining stable statistical properties globally (when you look from far). They exist in phase space Phase space is a set of all possible physical states of the system when described by a given parameterization. where trajectories exhibit extreme sensitivity to initial conditions yet remain confined to well-defined geometric structures. We tend to think a system is either deterministic or stochastic [Laplace's Demon]. However, understanding systems with such a dichotomy fails to capture the full spectrum of possible dynamics. Iterating deterministic nonlinear systems, like neural networks or a simple system of equations, can produce behavior that is unpredictable yet structured.

Lorenz Attractor with ML precision: The first trajectory starts at \((x, y, z) = (1, 1, 1)\). Additional trajectories begin with \(x\) offset by floating-point precision errors typical in ML systems. Use the selector to choose different precisions: bfloat16 (\(\pm 7.8 \times 10^{-3}\)), float16 (\(\pm 9.8 \times 10^{-4}\)), or float32 (\(\pm 1.2 \times 10^{-7}\)). To observe divergence: Select a precision, click "Clear", then "Add 3 Trajectories". Watch how ML-scale numerical differences lead to completely different paths. The left panel shows chaotic time series, while the right reveals all trajectories remain on the same butterfly attractor. Tiny initialization differences in neural networks can lead to subtly different training dynamics while converging to similar results.

Strange attractors are minimal examples of order from chaos

In 1963, Edward Lorenz attempted to simplify a weather model and developed a 3-dimensional differential equation that exhibits chaotic behavior. These equations are deterministic. Given initial values for \(x\), \(y\), and \(z\), along with parameters \(\sigma\), \(\rho\), and \(\beta\), the entire future evolution is mathematically determined. No randomness appears anywhere. Note that nonlinearity Strange attractors are impossible without nonlinearity. Linear systems can only do three things: settle to a fixed point, repeat in regular cycles, or grow/shrink forever. That is, linear systems cannot generate chaos in finite dimensions. Linear transformations preserve the relative distances between points. They can scale, rotate, or shear space, but they cannot create the stretching and folding mechanism essential for chaos. enters through \(xz\) in the second equation and \(xy\) in the third.

\begin{align} \dot{x} &= \sigma(y - x) \\ \dot{y} &= x(\rho - z) - y \\ \dot{z} &= xy - \beta z \end{align}

In Lorenz's simplified weather model, these variables represent physical quantities, where \(x\) captures the rate of convective flow, \(y\) represents the horizontal temperature variation, and \(z\) describes the vertical temperature variation. Interestingly, we discover something unexpected when \(\sigma = 10\), \(\rho = 28\), and \(\beta = 8/3\). The system never reaches equilibrium, never settles into a periodic cycle, yet never escapes to infinity. Each variable oscillates irregularly. Looking at \(x(t)\), \(y(t)\), or \(z(t)\) individually reveals no pattern, no predictability beyond a few time steps. The signals appear random, jumping between positive and negative values with no discernible rhythm.

But when plotted in three-dimensional phase space, with \(x\), \(y\), and \(z\) as coordinates, a structure emerges. The trajectory traces out a distinctive butterfly-shaped object, looping around two regions of attraction without ever repeating its path. This is the Lorenz attractor. Despite the apparent randomness in each dimension, the system is confined to this specific geometric structure. The trajectory may be unpredictable, but it will always remain on the attractor.

The interactive demonstrations (Lorenz Attractor with ML Precision - above, Thomas Attractor with ML Precision - below) demonstrate this duality. Start with "Clear" and "Add 3 Trajectories" to launch particles with initial x-coordinates differing by amounts comparable to machine learning floating-point precision. You can adjust between float32-, float16-, and bfloat16-analogous errors. In the left panel, the time series will eventually diverge into chaos, Roughly speaking, a system is called chaotic if its trajectory is non-periodic and exhibits sensitive dependence on initial condition. even though all we introduced was a small offset analogous to precision error (float32 will take some time to start diverging). Yet in the right panel, all trajectories diverge but remain confined to similar structures once you let it evolve for enough time.

Thomas Attractor with ML precision: Using trigonometric rather than polynomial nonlinearity, the Thomas system creates an interesting knotted structure. With damping parameter \(b = 0.208186\), trajectories weave through phase space forming a tangled, torus-like attractor. The same precision controls apply as above. Notice how the time series (left) show smoother oscillations due to the sine functions, yet still exhibit sensitive dependence on initial conditions.

The Thomas attractor, discovered by René Thomas, provides another minimal example of chaos arising from simple deterministic processes. The system uses only trigonometric nonlinearity:

\begin{align} \dot{x} &= \sin(y) - bx \\ \dot{y} &= \sin(z) - by \\ \dot{z} &= \sin(x) - bz \end{align}

In the demo, we use \(b = 0.208186\), which produces a particularly beautiful attractor. The sine functions implement stretching and folding through periodic nonlinearity rather than Lorenz's polynomial terms. Where Lorenz trajectories loop between two distinct regions, Thomas trajectories weave through phase space in a continuous knot, never quite repeating yet always remaining within the same tangled structure. The emergence of qualitatively similar chaotic dynamics from such different equations suggests that chaos arises generically in iterative nonlinear systems with appropriate parameter values and sufficient dimensionality.

What creates orderliness in chaotic systems?

The paradox of strange attractors lies in their dual nature. Trajectories diverge exponentially from their neighbors yet remain forever confined to specific geometric structures. This orderliness emerges from a mechanism found in many chaotic systems that exhibit long-term bounded behavior. The repeated application of stretching and folding operations creates both the unpredictability and the structure.

Stretching and folding is the geometric mechanism that generates chaos in bounded systems. The stretching operation pulls nearby trajectories apart exponentially. This creates sensitive dependence on initial conditions. The folding operation prevents escape by bending the stretched space back onto itself. Without folding, trajectories would diverge to infinity. Without stretching, the system would settle into regular periodic behavior. Together, they create bounded yet unpredictable dynamics.

The Rössler system demonstrates this mechanism:

\begin{align} \dot{x} &= -y - z \\ \dot{y} &= x + ay \\ \dot{z} &= b + z(x - c) \end{align}

With parameters \(a = 0.2\), \(b = 0.2\), and \(c = 5.7\), trajectories spiral outward in the x-y plane (stretching). When \(x\) exceeds \(c\), the nonlinear \(z(x-c)\) term becomes positive, causing \(z\) to grow. This increased \(z\) then feeds back through the \(\dot{x} = -y - z\) equation to fold the trajectory back down. This creates a single-lobe structure where the continuous stretching and folding is visually apparent.

Rössler Attractor with stretching and folding shown in colors: Red highlights show stretching phases where the linear terms \(-y\) and \(x + ay\) create outward spiraling motion, exponentially separating nearby trajectories. Blue highlights indicate folding phases where the nonlinear \(z(x-c)\) term activates, ejecting trajectories upward and reinjecting them to prevent escape to infinity.

Therefore, nonlinearity serves a dual mathematical role. It enables sensitive dependence on initial conditions by creating regions where trajectories diverge exponentially. Yet the same nonlinear terms that amplify small differences also bound the dynamics when parameters are "tamed" properly. This nonlinearity grants systems far greater expressiveness than linear dynamics, which can only produce fixed points, regular oscillations, or exponential growth. The price is analytical tractability. We cannot solve these equations in closed form or predict long-term behavior. But we gain the ability to generate infinitely complex, never-repeating patterns from simple rules.

Neural networks create their own stretching and folding

A neural network is fundamentally a large deterministic function. They transform input through compositions of different layers. Each layer applies an affine transformation \(Wx + b\) followed by a nonlinearity (activation fn). Layer by layer, these simple operations compound into remarkably complicated behaviors. The affine transformation \(Wx + b\) stretches and rotates the input space. The weight matrix \(W\) amplifies differences along certain directions while compressing others. This stretching creates the capacity for complex representations. The nonlinearity combined then help keep activations/gradients well-behaved while enabling the network to learn finer decision boundaries. ReLU clips negative values to zero. Sigmoid and tanh compress extreme values. Through many layers, these operations work together to create rich representations Early deep networks with sigmoid and tanh struggled with vanishing gradients. Nonlinearities that enabled expressiveness also killed gradient flow. Modern techniques like ReLU activations, Xavier initialization, batch normalization, and residual connections helped solve this. Together, these techniques "tamed" non-linearity. from simple building blocks.

Through these transformations, information becomes distributed across populations of neurons. No single neuron knows where New York is. Instead, the concept emerges from patterns of activation across many units. This mirrors a famous puzzle in neuroscience where Karl Lashley spent decades systematically destroying parts of rat brains trying to find where specific memories were stored, only to discover they seemed to be nowhere and everywhere at once [Lashley, In search of the engram]. Each layer's transformations spread and recombine information, creating increasingly abstract representations.

The repeated application of these simple operations produces something unexpected. Early layers might detect edges. Middle layers combine these into shapes and textures. Deeper layers assemble these into objects and concepts. With enough layers and parameters, these networks learn to recognize speech, generate text, and understand images. The representations emerge from the dynamics of training. We don't design them, they arise from the interplay of data, scale, and optimization.

This brings us back to our opening puzzle. Different training runs produce different microscopic patterns because small differences in initialization lead to different trajectories through weight space. Yet they converge to similar capabilities because the task and data create consistent pressures. The apparent randomness of individual neurons coexists with stable, interpretable representations at the population level. Local chaos enables expressiveness and global order ensures usefulness. Strange attractors offer a lens for understanding this duality. They show us that deterministic systems can be locally unpredictable yet globally structured. Simple rules iterated many times can produce infinite complexity within bounded regions. Chaos and order are not opposites but complementary aspects of nonlinear systems. Neural networks embody similar principles. Maybe not through literal dynamical iteration, but through the emergent properties of large-scale nonlinearity.

Discussion: On finding that right level of abstraction in mech interp

The connection between neural networks and chaotic systems may be most useful as a lens for thinking about levels of analysis. In chaotic attractors, examining individual coordinate trajectories reveals apparent randomness, while the phase space view reveals geometric structure. Similarly, individual neurons or circuits in trained networks often appear uninterpretable (to humans), yet linear probing consistently finds population-level representations of semantic concepts. Perhaps the central challenge in mechanistic interpretability is determining at what scale meaningful computational structures become visible.

Different scientific domains choose their abstractions based on what proves useful rather than what captures all detail. Astronomers classify all elements heavier than helium as "metals," even though oxygen, carbon, and nitrogen are chemically non-metals. In neural networks, the optimization pressure from training might act primarily on macroscopic computational properties rather than microscopic implementation details (useful thought experiment: if you change two neurons in a network, does the behavior change?). There are many ways to arrange individual units to achieve the same population-level computation.

The repeated application of nonlinearities in deep networks does create a form of iterative dynamics, though whether this probably doesn't produce genuinely chaotic behavior in the technical sense. The stretching from weight matrices and folding from activation functions occur in parameter space rather than through temporal evolution. Still, the observation that neural networks learn hierarchical representations, edges combining into textures, textures into shapes, shapes into objects, suggests that some form of multi-scale structure emerges from these operations.

The implication for interpretability research may be that we need to identify the natural scales at which meaingful computational structures emerge, rather than imposing our preferred level of analysis. Just as strange attractors reveal their structure only when viewed in the appropriate phase space, neural representations might become interpretable only at certain emergent scales that we have yet to systematically characterize.

Suggested Further Resources

Thank you Jeffrey Heninger and Su Hyeong Lee for feedback!