Frontier Research — Anthropic

Cross-Architecture Model Diffing with Crosscoders

Anthropic on a method for surfacing — without telling it where to look — what's actually different between two LLMs. The operating-model implication for every team about to swap a model in production.

Anthropic Research, 2026·7 min read·1 primary source

01Why "Cross-Architecture Model Diffing with Crosscoders" matters operationally

"Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs" is interpretability research. The deeper interest, for the operator, is that it's the first method we've seen that can surface — without being told what to look for — where two LLMs differ in their internal representations.

Today, when a team decides to upgrade a model in production, the question 'what actually changed' is answered with held-out evals, anecdotal spot-checks, and faith. "Cross-Architecture Model Diffing with Crosscoders" is a principled answer to that question at the level of features the model represents, not just outputs the model produces. That's a different category of evidence than anything an operator currently has access to.

02What "Cross-Architecture Model Diffing with Crosscoders" actually does

At the technical level, crosscoders extend sparse-autoencoder methodology to operate across two models simultaneously — learning a shared feature space and surfacing where the two models' internal representations diverge. The 'unsupervised' word in the title is the load-bearing one: the operator doesn't have to specify which capabilities or behaviors to compare. The technique surfaces them.

That matters because the failure modes of a model swap are almost always the ones nobody thought to check. Held-out eval suites are designed against the failure modes the team has already imagined. "Cross-Architecture Model Diffing with Crosscoders" is the first method we've read that addresses the unimagined ones.

03Why most production model migrations are run blind

Take the typical model-upgrade decision in a real company. Engineering runs a regression suite, product runs a vibe check on a few prompts, leadership signs off. Six weeks later something subtle is off — tone has shifted, edge-case handling has changed, refusal patterns are different — and nobody can tell whether the model changed or the workflow drifted around it.

That conversation has played out in nearly every production AI deployment we've watched in the past eighteen months. The team doesn't have an instrument that can answer the question. "Cross-Architecture Model Diffing with Crosscoders" is the first paper we've read that points at where the instrument eventually comes from.

"Most production model migrations are run on the assumption that the new model is the old model plus quality. The interpretability research is starting to give operators a way to actually check that assumption."

How this maps to the work

We don't expect clients to be running crosscoders inside their own ML org in 2026 or 2027. The technique is still maturing and the tooling isn't packaged for operators yet. The discipline the paper implies, though — treating a model swap as a structured before/after comparison rather than a point-in-time eval pass — is portable today, and we install it.

In practice that looks like documented behavior baselines before any model migration, an explicit catalog of the qualitative dimensions the team is checking for change on, and a structured follow-up review at four to six weeks in production rather than just at deploy. None of it requires crosscoders. All of it is the same posture toward model change that "Cross-Architecture Model Diffing with Crosscoders" is implicitly arguing for.

If this maps to what you're carrying, let's talk.

Most engagements start with a 30-minute conversation about the specific operating-model question on your desk this quarter.