Frontier Research — Anthropic

Long-Running Claude

Anthropic on what changes when an agent runs for hours instead of seconds — and why most workflows haven't caught up to it.

Anthropic Research·7 min read·1 primary source

01Why "Long-Running Claude" matters

Most discussion of agents has, until recently, assumed second-scale interactions: a user prompts, the agent responds, the loop closes. Anthropic's "Long-Running Claude" is part of a documented shift to hour- and day-scale autonomy: an agent that takes on a goal, plans toward it, executes across many tools, and reports back later.

That shift breaks most of the assumptions built into existing workflows. Permission models that worked for a single tool call don't work for a hundred. Logging that worked for a single response doesn't work for a multi-hour trace. The human reviewing the work needs to do so without re-doing it. "Long-Running Claude" is the lab itself naming the shift; the operating-model implications are downstream.

02What the time horizon implies for org design

A second-scale agent fits inside an existing reporting structure — someone owns it, supervises it, escalates it. An hour-scale agent doesn't. The supervisor can't watch it in real time. The accountability surface is fundamentally different.

Most organizations haven't thought about who, exactly, is responsible for an agent that ran for six hours overnight and produced an outcome the human reviewer disagrees with the next morning. Whose decision was it? What does 'approval' even mean when the human is reviewing a compressed summary of decisions the agent already executed? These are organizational questions, not technical ones, and they have to be answered before the agent ships — not after the first incident.

03The trust progression we use

We don't deploy long-running agents at full autonomy on day one. The progression we use with clients is staged: read-only autonomous (the agent does the work but doesn't write), gated-write (the agent proposes, the human approves), bounded-write (the agent writes within a defined scope), then autonomous-with-summary (the agent writes and reports). Each stage builds the org's trust before the next one is unlocked.

Skipping stages is the most common cause of a team rolling back an agent deployment in month two. The deployment didn't fail technically — it failed because the organization wasn't ready to trust it at the autonomy level it had been granted. The technical problem is easy. The org problem is the harder one and the one the lab post mostly leaves to the operator to solve.

"An agent running for an hour is not a longer version of an agent running for a second. It's a different system, with a different accountability surface — and most organizations haven't redesigned their accountability to match."

How this maps to the work

Almost every agentic engagement we've taken recently has been pulled toward longer time horizons. Clients want the agent to take on multi-hour or overnight work — research, document drafting, codebase changes — not just respond to prompts.

Our work is to install the operating discipline and the org structure around that longer horizon: defined goal hierarchies, compressed traces for human review, a clear answer to who owns the outcome, and a staged trust progression that the team can actually walk up. The result is a deployment the team trusts to leave running — and an organization that knows what to do when the trust gets tested.

Four engagements we run against this thesis.

None of these require a multi-year transformation. Each is scoped to land specific operating-model improvements with a measurable result.

01

Trust-progression staging

We design the staged path from read-only to autonomous-with-summary, with explicit criteria for when each gate opens. The deployment matures alongside the organization's trust in it, not ahead of it.

02

Accountability mapping for hour-scale runs

We document — before the agent ships — who owns the outcome of a long autonomous run, what 'approval' means when the human is reviewing a summary, and what the escalation path is when an overnight run produces something the morning reviewer disagrees with.

03

Compressed trace review

We design the human-review surface so the reviewer sees what mattered, not every step. Multi-hour runs require summary-level review; full-trace review doesn't scale, and the deployment falls over the moment the volume grows.

04

Goal-hierarchy and abandon-criteria design

We document which sub-goals the agent can abandon if the top-level goal isn't achievable — and what 'abandon' looks like in practice. Without this, long-running agents grind on impossible work and burn budget.

If this maps to what you're carrying — let's talk.

Most engagements start with a 30-minute conversation about the specific operating-model question on your desk this quarter.