In May 2026 a 102-page survey came out — authors from UIUC, Meta, and Stanford pulled into one frame everything that’s happening right now in agent systems. They call it code as agent harness: code stops being only what an LLM generates and starts being what an LLM operates through. An operational environment, not a product.

This is a review. I’ll try to give weight to the central idea, walk through all four chapters, leave references to specific systems (Claude Code, Voyager, SWE-agent, Cursor, and dozens of others mentioned in the paper), and end with what you can take into work today.

The paper is not about which model is better. It’s about what’s around the model.

What harness means and why the authors shift focus

In an engineering context the word harness covers everything surrounding a running engine: mounts, fuel feed, wiring, sensors, the casing. The engine alone doesn’t make a car. The authors give a precise definition:

An agent harness refers to the software layer that surrounds an LLM with tools, APIs, sandboxes, memory, validators, permission boundaries, execution loops, and feedback channels, thereby turning a stateless model into a functional agent capable of long-running task execution.

Stateless model → functional agent. Between them — a software layer. The bottleneck of autonomy, the authors argue, has long stopped being the base model’s reasoning ability and become the reliability of the system that connects the model’s outputs to long-horizon actions and persistent state.

The main conceptual move of the paper is to distinguish three coupled elements of any long-running agentic system:

  1. Model-internal capabilities — what’s inside the weights: reasoning, perception, planning, simulation, evaluation. This is what improves when a new model version ships.

  2. System-provided harness infrastructure — the predefined tools, APIs, sandboxes, memory systems, validators, permission boundaries, telemetry, and workflows. What connects model outputs to external actions and feedback. The main focus of harness engineering.

  3. Agent-initiated code artifacts — interactive code objects that agents themselves create, execute, observe, revise, persist, and share within the task execution loop. Regression tests, temporary tools, DSL programs, executable workflows, reusable skills, intermediate program states.

The third category is the most under-explored, and it’s the survey’s focus. If the first two are “what we gave the agent at the start,” the third is “what the agent builds for itself along the way.” Claude Code, Codex, LangChain — they work because all three layers come together.

Why code at all, instead of text? The authors list three properties that make code the chosen substrate:

Unlike natural language, code is executable, meaning model outputs become operations with formally verifiable outcomes; inspectable, meaning intermediate computation is exposed as structured traces that the harness can read, store, and act upon; and stateful, meaning the evolving program represents task progress in a persistent, modifiable form across steps.

Executable — you can verify what actually happened. Inspectable — you can look at what the model intended to do and pass that back into context. Stateful — history isn’t lost between steps because state is encoded in files, tests, the repository.

The survey divides the literature into three coupled layers: the harness interface (how code connects the model to its environment), harness mechanisms (how the agent stays reliable over long distances), and scaling the harness (how code becomes a shared substrate for multiple agents at once). The rest of the review walks through these layers.

three layers of a long-running agentic systemModel-internalcapabilitiesreasoningperceptionplanningsimulationevaluationinside the weightsSystem-providedharness infratools, APIssandboxesmemory, validatorspermissionstelemetry, workflowgiven before startAgent-initiatedcode artifactsregression teststemporary toolsDSL programsreusable skillsprogram statesbuilt on the fly — focus of the surveystateless model→ functional agent capable of long-horizon execution
Three elements of a long-running agentic system. Model-internal capabilities — the model itself. System-provided harness — the predefined infrastructure. Agent-initiated code artifacts — what the agent builds for itself during the run. The survey focuses on the third category.

§2. Harness Interface: code as the interface between model and world

The first layer answers: what connects the model’s outputs to its task environment? The authors say — code, in three roles.

Code for Reasoning

Pure chain-of-thought is unreliable, especially for symbolic and arithmetic operations. More importantly, textual reasoning doesn’t give the harness any way to verify intermediate states. No sensors. Code fixes both.

Three paradigms. The first is program-delegated reasoning: the model writes a program, an external interpreter executes it. PAL (Program-Aided Language Models) and PoT (Program-of-Thoughts) are classic examples; the model decomposes the task into an executable program, and arithmetic, search, tree traversal are handed off to Python. Chain of Code goes further: for non-executable semantic code, it uses an “LMulator” — the model simulates execution step by step. CodeI/O actually transforms programs into input/output prediction tasks, exposing reasoning primitives as a training signal.

The second paradigm is formal verification and symbolic reasoning: code as machine-verifiable contract. DeepSeek-Prover and DeepSeek-Prover-V2 combine LLMs with proof assistants (Lean, Isabelle, Coq) for mathematical proofs. Lean4Agent makes an interesting extension — applying Lean4 not to math but to modeling and verifying agent workflows themselves. CodeSteer adds explicit switching between symbolic code and neural text.

The third — iterative code-grounded reasoning, closed loops of generate → execute → verify → refine. RLEF formalizes this as policy optimization based on multi-step execution feedback. R1-Code-Interpreter interleaves reasoning and multiple rounds of execution in persistent interactive sessions. EG-CFG injects execution signals directly during generation for step-level correction.

Code for Acting

The central challenge here is grounding: how to map an abstract linguistic intent into executable behaviors that respect environmental constraints, APIs, and physics. The authors call out a separate system, AutoHarness — it makes the harness view explicit, automatically synthesizing a code harness between LLM and environment that filters invalid actions before execution.

Grounded skill selection — the agent picks from a library of executable capabilities. SayCan ties linguistic planning to embodiment feasibility through affordance scores. KnowNo adds conformal prediction for uncertainty-aware control: it detects ambiguous states before unsafe execution. BOSS synthesizes new executable skill chains through language-guided practice — the action space grows over time.

Programmatic policy generation — code itself is the control interface. CaP (Code as Policies) — Python programs as executable robot policies. RoboCodeX — multimodal tree-structured code generation for manipulation. GenSwarm coordinates policy generation across multiple robots. NormCode adds a semi-formal programming interface with enforced data isolation — auditability as a first-class property.

Lifelong code-based agents — code as persistent memory substrate. The paradigmatic example is Voyager in Minecraft: an automatic curriculum and a continuously growing executable skill library for open-ended interaction. LYRA converts human corrections into reusable executable skills. ViReSkill adds vision-grounded replanning and skill caching on environment failures. UI-Voyager carries the idea to GUI: a self-evolving agent through failure-driven adaptation and self-distillation.

Code for Environment

The third role is to present the environment as an explicit, executable, queryable object rather than an opaque external process accessible only through textual observation. Two benefits: verifiable state transitions (execution, not linguistic judgment) and persistent modifiable state (you can query, simulate, edit).

The approaches: structured world representations (ViStruct, FactoredScenes, Code2World — GUI state prediction as renderable HTML); execution-trace world modeling (SemCoder trains models to reason about statement-level execution effects; WorldCoder agents explicitly write and update Python programs as world models); code-grounded evaluation environments (SWE-bench with unit-test execution as objective oracle, AgentBench, InterCode); verifiable environment construction (SWE-smith, EnvScaler — programmatic synthesis of tool-interactive environments).

The main shift of §2 — code becomes a three-dimensional interface. Not “call tool X,” but “through code make reasoning executable, action programmable, and environment state inspectable.”

§3. Harness Mechanisms: what keeps an agent reliable over distance

This is the heart of the paper. If §2 is about the interface, §3 is about the mechanisms that keep an agent in a sane state beyond a single generation step. Planning, memory, tools, the execution loop, and evolution of the harness itself.

code-agent loop: model inside the harnessharnessLLMstatelessPlanningcontract over next stepToolsgoverned interfaceto external systemsMemorystate-managementacross stepsVerifydeterministic sensorsexecuteread stateevidence bundlesensor signaltermination governed by verification, not model confidence
The code-agent loop. The model inside the harness: planning produces a contract, execution happens in a sandbox, verification gives a deterministic signal that loops back into memory and the next planning step.

Planning: not a capability but a form of control

Planning in the harness perspective isn’t an internal LLM capability — it’s a form of harness control. It structures how the agent externalizes intent into executable steps. Four types:

Linear decomposition planning — the agent produces a single sequence of steps and then follows it. Self-Planning first decomposes into numbered steps, then generates against them. WebAgent does the same for web tasks. A current practice the authors point at directly: files like PLAN.md and Implement.md as filesystem-backed control objects. They can be reviewed by humans, versioned through Git, consumed by subagents. Planning becomes a durable artifact, not an ephemeral reasoning trace. This resonates strongly with how modern coding agent harnesses actually work.

Structure-grounded planning — planning is grounded in explicit structural representations. CodePlan builds a plan graph from edit obligations and dependency analysis. VerilogCoder uses a Task and Circuit Relation Graph to enrich subtasks with signals and transitions. Files like AGENTS.md, architecture notes, API specs become persistent harness objects — the agent’s shared picture of the world.

Search-based planning — allocates compute for systematic exploration of candidates. SWE-Search applies Monte Carlo Tree Search to software engineering repair trajectories. CodeTree unifies strategy, generation, and refinement into a single tree. Especially interesting is Meta-Harness — the system performs search over the harness code itself, giving the agent access to previous sources and traces through the filesystem. This is already self-modification territory.

Orchestration-based planning — implemented through systemic coordination. MapCoder does role orchestration: example recall → planning → generation → debugging. Anthropic splits planning, generation, evaluation into separate roles; Cursor uses planner–worker coordination for many parallel agents. Natural-Language Agent Harnesses with an Intelligent Harness Runtime is a direction where the harness logic itself (roles, stages, contracts, adapters) becomes editable natural language, interpreted by a runtime.

Memory: not a context window, but state management

Memory in code agents isn’t “bigger context window” and isn’t just a vector database. It’s a state-management layer: which information stays in the model’s active context, which is compacted, which is offloaded to external storage. The authors call out six types.

Working memory — state along the current task trajectory. SWE-agent: structured state tracking — files, commands, tests. CodeMem uses budgeted memory slots to stabilize multi-step edits. RepairAgent maintains dynamic prompt-state updates so evidence isn’t lost between autonomous cycles.

Semantic memory — task-relevant external evidence; the codebase as a queryable evidence space. AutoCodeRover does structure-aware retrieval with AST-based chunking. RepoCoder — iterative repo retrieval for context-aware generation. CodeRAG — multi-path retrieval with reranking.

Experiential memory — reusable experience accumulated across tasks: repair trajectories, failure cases, strategy patterns. A separately important work mentioned here is MemGovern on governed experience replay: the quality of stored experience matters more than its scale; unmanaged records introduce semantic noise. A rare sober note in a field where the default is “let’s just save everything.”

Long-term memory — focuses on memory governance: when to write, when to compress, when to retrieve. MemCoder uses structured historical commits as persistent memory.

Multi-agent memory — shared blackboard or collaborative state graph. ChatDev does phase-level context passing, MIRIX adds cross-agent memory routing for specialized roles, G-Memory — shared memory across planning, testing, and review.

Context compaction and state offloading — the boundary between active context and durable task state. A failing test turns into failing test name + key stack frames + suspected files + link to full log. State moves into files, databases, MCP-style servers — the agent gets compact summaries and resource identifiers instead of raw logs.

The principle running through this section: memory isn’t “more,” it’s “cut correctly and put in the right place.” The same principle that separates a competent engineer from someone who dumps everything into one file.

Tool Use: governed interface

Tool use is the governed interface between model intent and external systems. Four categories.

Function-oriented tool use fills gaps in the model’s knowledge — APIs, libraries, documentation. ToolCoder integrates API search tools and trains models to decide when to query. Retrieval-oriented methods reduce reliance on parametric memory and adapt to long-tail APIs.

Environment-interaction tool use — the agent acts inside a software engineering environment. CodeAgent: navigation, editing, validation over real repositories. SWE-agent formalizes an agent-computer interface: shell commands, file editing, test execution as the primary channel. RepairAgent does repair-specific tools: read code, search ingredients, apply patches, run tests.

Verification-driven tool use — tools as deterministic sensors (tests, linters, type checkers, static analyzers). AgentCoder implements a closed loop of programmer + test designer + test executor. VeriGuard — a verifier-guided tool loop: gates and repairs code via verification.

Workflow-orchestration tool use — how multiple tools, roles, and policies organize into a coherent workflow. OpenHands — a unified software-agent workspace with reusable harness components. Lifecycle hooks (pre-use: permission checks, argument validation, block risky commands; post-use: sanitize outputs, compact logs, update memory, trigger verification) — a concrete pattern worth borrowing into any serious agent system.

Plan-Execute-Verify Loop: harness as cybernetic governor

The authors reframe iterative debugging as the PEV loop — and offer a strong metaphor:

The harness acts as a cybernetic governor: a control layer that observes the effects of agent actions and regulates subsequent state transitions.

Plan — contract formation. The plan is not a list of steps but an explicit contract over the next state transition: which files are relevant, which invariants are expected, which validation commands, rollback points, risky operations. Files like AGENTS.md, MCP server registries, typed tool schemas make available actions inspectable before execution.

Execute — sandboxed state transition. Isolated filesystem, dependency state, shell, language runtime, browser/IDE interface. Here the authors introduce a three-tier permissions model:

  • Read-only tier — browsing, retrieval, static inspection. Safe by default.
  • Sandbox-edit tier — local patching, test execution, dependency installation. Isolated from production.
  • Full-access tier — network, credentials, deployment, destructive operations. Mandatory HITL gate.

Verify — deterministic sensors. Compilation and static analysis as cheap pre-execution checks, runtime signals, test-based feedback (unit/integration/regression/fuzzing). Verification supplies evidence for repair, reflection, and termination. Reflexion stores verbal feedback memory, Self-Debugging uses explanation-guided repair.

The key principle: termination governed by verification, not model confidence. The cycle stops when required checks pass, when additional attempts don’t improve state, when the risk tier changes, or when human review is needed. Not when the model says “I’m done.”

Plan-Execute-Verify loop with three permission tiersPlancontract formationExecutesandboxed state transitionVerifydeterministic sensorsreplan on failurethree permission tiers for ExecuteRead-onlybrowsingretrievalstatic inspectionSandbox-editlocal patchingtest executiondependency installFull-accessnetwork, credentialsdeploymentdestructive opssafe by defaultisolated from prod↓ mandatory HITL gate ↓human approvalthe same command is safe in a sandbox, unsafe in a production repo —permission depends on context, not just on the command
Plan-Execute-Verify loop with three permission tiers. Read-only — safe by default. Sandbox-edit — isolated. Full-access — mandatory human gate. Termination is determined by verification, not by the model's confidence.

Agentic Harness Engineering: evolution of the harness itself

The newest and most interesting idea in the whole paper. AHE answers: how do you measure and revise the software substrate that turns an LLM into a coding agent. This isn’t prompt engineering and isn’t context engineering — the object of change is the operational environment itself: tool schemas, planning artifacts, memory policies, retrieval strategies, sandbox configuration, permission tiers, routing rules, workflow.

The key observation that raises the bar for the whole discussion: many observable failures in code agents come not from model generation but from missing repository context, brittle tool interfaces, weak validators, excessive token cost, poor retry policies, mismatched permission boundaries. When an agent stalls — it usually isn’t the model stalling, it’s the environment around it.

Deep Telemetry as the Optimization Substrate. A shallow log is the final answer. Deep telemetry records: prompts + retrieved context, token usage, latency, tool arguments, permission requests, edited files, sandbox snapshots, command outputs, test results, stack traces, lint warnings, branch decisions, rejected alternatives, human interventions, final outcome. Without this, harness revision stays anecdotal debugging. With this — comparative diagnosis: token-cost traces show when retrieval or reflection consumes budget without improving verification outcomes; decision-tree traces show where the agent repeatedly picks unproductive tools.

The Evolution Agent. This is conceptually a new object:

An Evolution Agent is a meta-level agent that uses deep telemetry to propose, evaluate, and promote revisions to harness components. Unlike a task agent, which edits the target repository, the Evolution Agent edits the operating conditions under which later task agents work.

A five-step cycle: (1) observes trajectories — collects telemetry from PEV executions; (2) diagnoses failure modes — attributes cost, latency, invalid actions to specific harness components; (3) proposes candidate revisions — rewrites tool descriptions, changes context packing rules, adds a linter, modifies retry limits, inserts a HITL gate; (4) evaluates on held-out tasks or replayed traces with deterministic sensors; (5) promotes only changes that improve reliability, cost, or safety without regressing previously solved cases.

Example systems: EvoMAC (evolution agents for workflow DAG), SEW (self-evolving workflow), Live-SWE-Agent (online evolution through live issue trajectories).

Governed Harness Mutation. AHE is not unconstrained self-modification. The Evolution Agent is itself subject to the PEV loop: it plans the harness mutation, executes it in an isolated evaluation environment, verifies through telemetry and regression tests, escalates risky changes to humans. Changes to permission boundaries, network access, credential handling, deployment behavior, or human-review requirements require HITL approval before activation.

Tools the authors mention as a practical AHE base: Langfuse (observability platform), OpenLLMetry (trace instrumentation), Promptfoo (evaluation harness), LiteLLM (gateway governance), MetaHarness, AutoHarness.

§4. Scaling: code-centric harness for multiple agents

When a harness scales from one agent to many, three problems appear that a single-agent system doesn’t have to solve: one context window doesn’t fit the codebase plus interaction history plus execution traces; no effective specialization (planning + synthesis + testing + review at once is too much); no independent verification channels, and the agent can’t reliably detect its own mistakes.

four levels of shared-state formalization in multi-agentlowhighformality of shared state1. Implicit / File-onlyno persistent queryable representation; state divergence invisibleChatDev · MetaGPT · FlowGenmajority of systems2. Repository-basednavigable repo with dependency graphs, call hierarchiesMAGIS · HyperAgent · Lingma SWE-GPT · SyncMind3. Execution-basedshared state via runtime behavior; objective oracle signalsAgentCoder · AutoSafeCoder · MAGE (clock-edge granularity)4. Blackboard / Shared-Stateexplicit globally accessible structure; principled coordinationL2MAC · Cogito · Self-Collaborationtopology complexity inversely correlates with harness-state formality —complex adaptive topologies are a symptom of missing formal shared representation
Four levels of shared-state formalization in multi-agent systems. Most current systems sit in the Implicit/File-only category with no persistent queryable representation. The higher the formalization (repository → execution → blackboard), the less state divergence.

Roles and interaction modes

Roles mirror human software development: Manager, Planner, Coder (Synthesizer), Reviewer, Tester, Executor. Program synthesis: Coder in Self-Collaboration, Programmer in AgentCoder, Engineer in MetaGPT, Developer in ChatDev. Verification: the Test Designer in AgentCoder operates independently of the code, to avoid circular reasoning. AutoSafeCoder adds a Static Analyzer and Fuzzing Agent. Planning: Architect and Project Manager in MetaGPT, Manager in MAGIS; in SoA, Mother agents dynamically spawn Child agents at runtime based on inferred complexity.

A unique case is EvoMAC: it introduces two meta-roles (Gradient Agent reads execution logs to attribute failures to agents; Updating Agent revises prompts and the structure of the workflow DAG), and they operate on the level of the MAS as a system.

Interaction modes: collaborative synthesis (PairCoder with Navigator-Driver); critique and repair (the dominant mode — verification agent inspects, synthesis agent revises; AgentCoder uses a real Python executor, not an LLM simulator); adversarial validation / red-teaming (AutoSafeCoder — Fuzzing Agent with type-aware mutation; MAGE — waveform-based simulation mismatch); reasoning debate (ChatDev does communicative de-hallucination; CANDOR — three independent Panelists plus a Curator via majority vote).

Topologies: pre-defined vs adaptive

Pre-defined heuristic: chain/waterfall (ChatDev, MetaGPT: design → coding → testing); cyclic/agile (AgentCoder: programmer → executor → fail → programmer); hierarchical (MAGIS: Manager plus a dynamic Developer pool; SoA: Mother → Child recursively).

Adaptive is where it gets interesting: EvoMAC — workflow as DAG, Gradient + Updating Agent structurally modify the graph after each iteration (the only system in the survey with structural modification rather than parameter tuning). SEW optimizes workflow structures: sequence of agent calls, routing logic, feedback paths. FlowReasoner — a query-level meta-agent generates a tailored MAS for each input query.

Position: shared code-centric harness substrate

The authors put forward a position on the next generation of multi-agent intelligence. They look at 30+ systems and see four levels of shared-state formalization:

  1. Implicit / File-only (most: ChatDev, MetaGPT, FlowGen) — no persistent queryable representation; state divergence is invisible to the system.
  2. Repository-based (MAGIS, HyperAgent, Lingma SWE-GPT) — a navigable repository with dependency graphs, call hierarchies; SyncMind formally defines a ground-truth state and measures divergence.
  3. Execution-based (AgentCoder, AutoSafeCoder, MAGE) — shared state through runtime behavior; objective oracle signals; MAGE reaches clock-edge granularity.
  4. Blackboard / Shared-State (L2MAC, Cogito, Self-Collaboration) — explicit globally accessible structure; L2MAC implements the most principled blackboard.

And here’s the central gap they articulate very directly:

The majority of the literature resides in the implicit/file-only category, lacking any formal model of the shared harness substrate. The program, uniquely among multi-agent domains, is an artifact that executes. It produces objective, non-linguistic signals that could in principle anchor a formal shared substrate. Yet most systems fail to exploit this property at the architectural level.

They identify six patterns of harness-state convergence: correctness (test-gated), security (AutoSafeCoder), performance (MACRO), score-based (MAGE), consensus (CANDOR), implicit (ChatDev, MetaGPT — where convergence is simply “we ran out of the fixed iteration budget”).

And one observed trend worth underlining: topology complexity inversely correlates with harness-state formality. Complex adaptive topologies are a symptom of the absence of a formal shared representation, not of its presence. When there’s no shared picture of state, you have to wire up ever cleverer coordination.

A curious side fact: Self-Collaboration and QualityFlow show that LLM-simulated execution reaches 98%+ precision/recall without running code — but execution is still needed exactly at corner cases (runtime crashes, resource exhaustion, boundary conditions). Simulation is good for the average case, real execution is good for the edges.

§5. Applications and open problems

The authors walk through five application domains. I’ll give a short overview so the context lands.

Code Assistants — evolution from inline completion to autonomous SWE agents through expansion of the development harness around the model. SWE-agent and OpenHands in research; Claude Code, GitHub Copilot Coding Agents, Codex in production. The repository becomes the operational substrate (RepoCoder, CodexGraph, AutoCodeRover). One interesting 2026 trend — harness as distillation surface:

Production harnesses are no longer only deployment infrastructure; they are becoming a dominant source of training data for the next generation of code-assistant models.

Cursor trains with continuous online RL on real usage traces. codex-1 (an o3 derivative) is trained on long-horizon multi-turn coding interactions of the harness. The boundary between “the agent” and “the harness around the agent” is itself becoming a learnable surface. Production numbers: LingmaAgent autonomously resolves 16.9% of in-house Alibaba Cloud issues, 43.3% with manual intervention.

GUI/OS Agents — every observation is the rendered output of executable code (HTML, CSS, accessibility APIs), every action is a call into other code. Code is the lingua franca for observations, actions, evaluation, memory, world model. Environments: WebArena, OSWorld, AndroidWorld. Native grounding models: CogAgent, SeeClick, OS-Atlas, UI-TARS. Production: Anthropic Claude Computer Use, OpenAI CUA (Operator), Google Project Mariner.

Embodied Agents — code as control boundary and safety boundary. Layered harness: foundation model (semantic layer) + typed robot APIs + motion planners (admissibility boundary) + physical controllers. SayCan, Voyager, CaP — the paradigmatic systems. Memory here = skill library, and these are executable artifacts, not text.

Scientific Discovery — the ideal domain for code-as-harness, because the scientific method itself is a closed loop of hypothesize → design → execute → observe → revise, and each step can be done through a program. AI Scientist v1/v2 writes a whole ML paper as a single executable trace. AI co-scientist (Google) — multi-agent debate, and drug-repurposing hypotheses were experimentally validated. Virtual Lab designed 92 SARS-CoV-2 nanobodies in virtual meetings. AlphaEvolve through an evolutionary loop (Gemini → evaluators) discovered new matrix multiplication algorithms. Coscientist autonomously planned and executed palladium-catalyzed Suzuki couplings on real lab equipment. ChemCrow integrates 18 expert chemistry tools.

Personalization — the environment is not only the software system but the user with partially observable preferences. Preference state as editable artifact (AMem, MemRec). Open challenges: no reliable oracle for true user satisfaction, preference memory creates privacy risks, multi-stakeholder conflicts.

Seven open problems

This is the most sober part of the paper. The authors say directly what’s left unsolved.

1. Harness-Level Evaluation and Oracle Adequacy. Existing evaluations measure end-task success, conflating model capabilities, harness quality, tool reliability, feedback informativeness, environment difficulty. Harness-level metrics are needed: trajectory efficiency, verification strength, recovery ability, state consistency, safety compliance, replayability. Concrete fact: best step-level failure attribution accuracy in production reaches only 14–53% (per the “Why do multi-agent systems fail?” study). We poorly understand where exactly the system erred.

2. Semantic Verification Beyond Executable Feedback. Execution feedback creates a false sense of correctness: unit tests can be incomplete, GUI checkers miss unsafe intermediate actions, scientific scripts can encode invalid assumptions. Future harnesses must compose multiple verification artifacts, and each must explicitly declare: what it verifies, what it cannot verify, what confidence it provides. A useful direction — each accepted action carries an evidence bundle.

3. Self-Evolving Harnesses without Regression. Harness mutation must be handled as a code change to a safety-critical runtime. Each proposed change carries a change contract: what’s modified, which failure mode is targeted, what improvement is predicted, which invariants must hold, how to roll back. The danger: a new retrieval policy improves benchmark accuracy but increases hallucinated evidence; a new verifier increases pass rate by accepting underspecified solutions. Canaries and rollback for the harness — we don’t have those yet.

4. Transactional Shared Program State. A missing abstraction: each action should declare its read set, write set, assumptions, version dependencies, verifier obligations, conflict policy. Conflicts should be detected not only at the level of file diffs but at the level of plans, tests, retrieved evidence, permissions, memory entries, latent user requirements. We need semantic merge, rollback, dependency-aware locking, belief-state reconciliation.

5. Human-in-the-Loop Safety as Harness State. HITL control shouldn’t appear only as an occasional prompt interruption — it should become durable harness state. Each approval, rejection, policy exception, or reviewer correction should update permission rules, escalation policy, verification criteria. High-stakes approvals should be auditable state transitions: what was proposed, what evidence was shown, what risks were surfaced, who approved, what responsibility boundary changed.

6. Multimodal Code-Harness Systems. GUI agents observe screenshots, embodied — egocentric images, force, tactile signals, scientific — plots and molecular structures. The harness must manage multimodal observations as persistent, queryable, verifiable state. Central challenges: multimodal context compression (preserve task-relevant visual evidence, not just reduce token cost) and visual grounding contracts (each action carries a grounded reference — bounding box, UI element, frame index).

7. Toward a Science of Harness Engineering. The object of study is not only the model or the program but the complete closed-loop system: context, memory, tools, execution, feedback, safety, coordination, evaluation. Four properties of future systems: executable, inspectable, stateful, governed — autonomy constrained by permissions, verification, and accountability.

What you can take into work

If you strip the terminology and keep the engineering takeaways.

1. Distinguish three layers. When something breaks in your agent system — don’t immediately blame the model. Ask: is this a model-internal limit, is this the system harness, or is this something the agent built for itself during the run and broke. Most failures are in the middle layer.

2. Files over context. PLAN.md, AGENTS.md, repository notes, MCP server registries — these are all filesystem-backed control objects. They outlive the context window, are versioned, and reviewed. If your plan lives only in the reasoning trace, you lose it at compaction and you can’t reason about it systematically.

3. Memory as state management, not “more context.” The six types of memory from §3.2 aren’t theory — they’re a practical checklist. What do you have now — working, semantic, experiential, long-term, shared, or none? What’s compacted, what’s offloaded to external storage? MemGovern with its governed experience replay is a separate wake-up call: the quality of stored experience matters more than its scale.

4. Three permission tiers — the minimum standard. Read-only / sandbox-edit / full-access plus a HITL gate for the last. If your agent has a single access level for everything — that’s a future incident.

5. Termination governed by verification, not model confidence. Don’t trust the model when it says “all done.” The loop stops on a signal from a deterministic sensor (test passes, lint clean, type checker silent), not on self-assessment.

6. Deep telemetry as the foundation of any improvements. If you don’t have the full trace — prompts, retrieved context, tool arguments, permission requests, edited files, sandbox snapshots, command outputs, test results, branch decisions, rejected alternatives — you can’t reason about harness improvements. You’ll live anecdotally. Langfuse, OpenLLMetry, Promptfoo — a working stack.

7. Lifecycle hooks around tool use. Pre-use: permission checks, argument validation, block risky commands. Post-use: sanitize outputs, compact logs, update memory, trigger verification. A concrete pattern that fits any agent system in a few hours and pays off immediately.

8. Test designer independent from coder. If verification is generated by the same agent that writes the code — that’s circular reasoning. AgentCoder makes Test Designer a separate role, and that removes a whole class of false “successes.”

9. If you go multi-agent — think about shared state before complicating topology. Adaptive topologies are often a symptom of missing a formal shared picture of state. A simple shared blackboard can remove half the coordination overhead.

10. The harness is code that also evolves. Not as “let’s let the agent rewrite itself,” but as “let’s apply to the harness the same principles we apply to safety-critical code”: change contracts, canary, rollback, regression tests.


The survey closes with a thesis worth holding in mind:

Code refers to executable or machine-checkable artifacts, including programs, scripts, formal specifications, proof scripts, API schemas, tool definitions, tests, repositories, simulators, configuration files, and code-adjacent execution artifacts such as traces and logs.

That’s a very broad understanding of code. And when you look at modern production agents — Claude Code, Cursor, Codex — you see they work precisely because each of these categories is correctly built into their harness. Not because the models got smarter. Because the environment around them got better.

That’s the work that isn’t in the headline papers but determines whether the system actually works in reality.

Based on Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems Ning, Tieu, Fu et al. (UIUC, Meta, Stanford) · arxiv.org Read the original →

← Back to archive