Agent Harnesses for Coding: What Actually Matters

After building and iterating on several of these systems, one lesson stands out clearly. Once you have a reasonably capable model, the majority of reliability, safety, iteration speed, and user trust comes from the harness around it. The model supplies reasoning and tool-call intent. The harness supplies everything else that turns that intent into coherent, recoverable work on a real codebase.

Introduction: Why the Harness Dominates Outcomes

Most people building agentic coding tools eventually notice the same pattern: problems that feel like “the model is not smart enough” or “the model keeps guessing” are often harness problems in disguise.

A harness is the deterministic runtime and control layer around a nondeterministic model. It owns the agent loop, session state, tool execution, context governance, safety boundaries, recovery semantics, and observability. Without a solid harness, even strong models produce brittle, opaque, or unsafe behavior. With a good one, the same model becomes noticeably more reliable and useful.

The uncomfortable reality is that you will not one-shot a great harness. Even with strong patterns and clean architecture, real usage with specific models exposes weaknesses in tool contracts, prompt shaping, context behavior, permission edges, recovery logic, and failure handling. Models change quickly too. What worked well last month can start feeling wrong after a model update. The harness is never really “done.” It is a living system that improves through deliberate iteration against the friction you encounter while using it.

This piece focuses on harnesses for software engineering work: exploring codebases, planning changes, making edits, running tests, recovering from mistakes, and iterating over real projects.

What A Coding Agent Harness Actually Owns

A good harness does more than connect a model to tools. It creates a stable operating environment in which the model can do useful work over long sessions.

At its core, the harness owns the full agent loop — driving the observe, decide, act, observe cycle rather than treating the model as a fancy autocomplete. It governs context deliberately, deciding what gets injected, retrieved, pinned, compacted, or offloaded each turn. It defines tool contracts: how tools are shaped, when they are available, how their results are presented, and how failures are handled.

It enforces safety and permission boundaries through data, host validation, and tool behavior rather than relying primarily on prompt instructions. It maintains session durability so work can be inspected, resumed, branched, or explained later. It defines clear recovery semantics when model calls fail, tools are interrupted, or the process crashes. It makes behavior observable through structured events. And it handles mode orchestration, treating planning, research, implementation, and review as distinct runtime contracts rather than just UI labels.

If any of these areas is weak, the resulting friction is almost always blamed on the model.

The Core Architectural Patterns

These patterns address the places where agentic coding tools most commonly break down in practice.

The harness must own the tool loop.
When a model only sees conversation history and a system prompt, it guesses file paths, invents structure, and asks questions it could answer itself. A proper harness makes tool use the natural path: search before claiming, read before editing, run commands before declaring success. Prompting helps, but the real difference comes from shaping the environment so the model is guided into effective tool use.

Context is engineered governance, not a bigger bucket.
Early systems treat context as a retrieval problem and try to stuff more into the window. Stronger harnesses treat it as an active governance problem. They introduce budgets, pinning of key decisions, on-demand retrieval, summarization, and offloading. The practical question shifts from “how do we make the model remember?” to “what should be in the window right now for this turn?” This governance must be tuned per model and per type of task.

Turn snapshots create stability.
Each model turn should begin from a stable snapshot of the runtime state: current model, tools, system prompt, permissions, session messages, and resources. Changes during the turn affect only future turns, never the in-flight request. This simple discipline prevents a surprising number of flaky, hard-to-debug issues caused by mid-turn mutations.

Agent profiles beat one giant prompt.
Planning, implementation, research, and review are fundamentally different tasks. Strong harnesses define explicit agent profiles that carry different tool permissions, risk tolerances, and final-answer contracts. A planning profile can literally deny edit tools at the registry level. This enforcement is far more reliable than trying to encode the difference in prompt text alone.

Edit proposals need host validation.
Editing is where weak harnesses become painfully obvious — full-file rewrites, stale content, missing reads, ugly diffs, and formatting churn. Better systems treat edits as proposals. The model suggests changes, but the host enforces read-before-write, content hashing for staleness detection, normalization, and safety guards before anything is applied. The model proposes. The harness validates and records.

Modes should be an orchestration state machine.
Plan versus execute is not a UI toggle. It requires different tool availability, final-answer contracts, approval gates, and handoff logic. When this is properly implemented as a state machine, behavior stops leaking between modes.

Subagents should be isolated sessions.
Subagents become truly useful when they are real isolated workers with their own context, model selection, tool permissions, and result contracts. This enables clean delegation for research, exploration, or parallel work without polluting the main session.

Durable boundaries are more realistic than magical resume.
Long-running work fails in many ways. Strong harnesses design around durable artifacts — session logs, plans on disk, checkpoints, and turn records — so recovery is possible even after crashes, timeouts, or interruptions. They mark unfinished work clearly instead of pretending everything completed cleanly.

Observability is part of the product.
If neither the user nor the builder can clearly see what the agent is doing, trust and iteration both collapse. Good harnesses emit structured events for tool activity, proposals, state changes, and errors, while keeping a clean, deterministic session log.

Critical Design Decisions

Several recurring choices define the character of a harness.

You must decide between a minimal extensible core and opinionated defaults with strong profiles. You must choose how much to normalize across models versus maintaining per-model profiles for tool presentation, reasoning handling, and defaults. You must decide where safety lives — in prompts, tool contracts, permission data, host validation, or some combination. And you must accept the iteration imperative: even excellent designs require ongoing tuning as you use the system with real codebases and evolving models.

Common Failure Modes

These patterns appear again and again when building these systems:

Shipping a chat interface before a real tool-using agent loop exists.
Treating context as a retrieval problem instead of active governance.
Relying on prompting for behaviors that belong in tool contracts, permissions, or host validation.
Weak edit pipelines that allow stale or unsafe changes.
Mode distinctions that exist only in the UI.
Hiding tool activity and state transitions from the user.
Under-investing in observability and recovery design.
Treating the harness as static infrastructure that can be “finished.”

Most of these get misdiagnosed as model limitations until the harness gaps are closed.

What Actually Moves The Needle

If I had to compress the practical advice into a short list, it would be this:

Make the tool loop real and proactive.
Treat context as a governed resource, not a bucket.
Make modes enforceable through profiles and contracts.
Make edits proposals validated by the host.
Make permissions and safety explicit data.
Use turn snapshots for stability.
Design long-running work around durable boundaries.
Make behavior observable.
Accept model differences at the boundary.
Build the harness to be easy to observe and iterate on.

Conclusion: The Harness Is The System

In agentic coding work, the harness is the system you are building. The model is a powerful but nondeterministic engine. The harness is the runtime, controls, safety systems, memory, audit trail, and recovery process around it.

Strong patterns give you a solid foundation, but they are only the beginning. Real progress comes from using the system heavily, identifying friction that feels like model problems but is usually a harness gap, and iteratively closing those gaps.

If you are building one of these systems, treat the harness as a first-class engineering surface from day one. The teams that do this will spend far less time fighting their agents and far more time shipping useful capabilities.

Introduction: Why the Harness Dominates Outcomes

What A Coding Agent Harness Actually Owns

The Core Architectural Patterns

Critical Design Decisions

Common Failure Modes

What Actually Moves The Needle

Conclusion: The Harness Is The System

More field notes

Meet Ripple: A Local-First Task Manager for Coding Agents

Building Your First Agentic Coding Harness

Portfolio Page Lesson