Visibility Into the Black Box

A dark matte cube on a workbench has one panel cracked open, revealing glowing blue circuit-lattice patterns inside, lit by a hanging work lamp with tools nearby. Title text: Visibility Into the Black Box, with Black Box in cyan. Site identifier: MATGRETEN.DEV.

I ship production code through an AI pipeline. Not to a toy project — to a ten-year-old Rails monolith that serves real users, reviewed by teammates who genuinely care about craft. They love OOP. They spot code smells three abstractions deep. They'll tell you when you're building something today that makes tomorrow harder.

One thing I hear often: these tools are really good at adding code, but not very good at deleting it. The value isn't just in what we ship — it's in what we choose not to ship. That's a hard standard to codify for an AI agent, but it's the standard the code has to meet.

80% of the time the pipeline gets me 90-95% of the way there. I make a few tweaks, respond to feedback, and it ships. I've seen a literal 12x increase in my PR throughput moving from using coding agents as assistance to running this pipeline. The other 20% is where I'm blind. When a run fails, when a tweak doesn't help, when I change a model and the patch rate shifts — I'm guessing. The pipeline works in a high-ownership, high-craftsmanship context, and I need visibility that matches that context.

That's what drove the migration to swamp. Make every decision the pipeline makes observable, so I can iterate on it with the same care my team expects from the code it produces.

What ADW is and Why it Exists

ADW — Agentic Development Workflow — is a pipeline I built for my own workflow. It's not a team tool (yet). It's one developer's attempt to automate the mechanical parts of shipping code against a production codebase that serves real users.

This is all inspired by IndyDevDan's: AI Developer Workflow that he uses to teach agentic engineering principles through in his course Tactical Agentic Engineering.

The pipeline takes a plan and runs it through a series of phases:

Ideation — I describe what I want to build (often a messy brain dump), and the system explores the codebase, scores its own confidence, generates a structured contract, and runs an adversarial challenge against the plan before any code is written
Convert — breaks the plan into atomic user stories with acceptance criteria
Worktree — creates a git worktree, sets up Docker containers, seeds the environment
Build — spawns AI agents to implement each story, commits the work, manages parallel task groups. Think Ralph loops, but way more fine grained with AC being validated by a separate agent.
Test — runs the relevant spec suite, identifies failures, loops back for fixes
Review — a separate agent reviews for correctness, OOP violations, query performance. Blockers trigger patch-and-re-review cycles.
Ship — submits PRs via Graphite, adds descriptions, handles submission failures
Artifact — a walkthrough review artifact is created to help me get back "in-the-loop" after the agents are done cooking.

The whole thing is about 10,000 lines of Ruby.

But it's not even close to done. I'm actively dogfooding every part of it, and the feedback loop is constant. Every PR teaches me something — a standard that isn't codified, a pattern the agents miss, a review heuristic that catches real issues. Part of building ADW has been codifying our team's standards so that agents can follow them more consistently. The agents surface gaps I didn't even know existed in our documented conventions.

One big contention that I personally see between frontier agentic software development and the type of code base I'm working on is standards around atomic PRs and changes being bite-sized so that they're easy for humans to review. In the short term, the project I'm working on with this is not going to be able to just let agents ship gigantic PRs. And so we need a way to break up the work do it into logical chunks for review. That's where graphite comes in. And stacking PRs makes this much easier than before when I wasn't using Graphite.

All that to say it's taking what I was doing and would take me months and I can produce similar code, if not better in a morning ADW run. Code that mostly adheres to our standards and awaits the human review.

The Ideation Phase is Still the Most Critical Part

As more of the implementation becomes automated, the quality of the input matters more, not less. A bad plan implemented perfectly is still a bad plan. The ideation phase — where I describe what I want and the system explores the codebase, scores confidence, and challenges the approach before any code is written — is the highest-leverage part of the entire pipeline.

Right now, ideation happens in the Claude context and leaves no structured trace. I can see the contract it generates. I can read the adversarial challenge. But I can't query across 50 ideation sessions to ask "how often does the confidence score predict actual build success?" or "which types of problems consistently score low confidence and then fail?" That's a candidate for swamp — and maybe the most valuable one left.

The pattern holds: as agents handle more of the execution, the human decisions at the top of the funnel become the biggest lever. Getting ideation right — and being able to measure whether it's getting better — is where the real gains are.

The Black Box Problem

Here's a concrete example. The build phase picks an AI model for each task. When I changed the review phase from Opus to Sonnet to save cost, I thought the patch-attempt rate went up. But I couldn't prove it. The provider choice was buried in Ruby, logged as a line in a text file, and gone.

Same story everywhere:

How often does Graphite PR submission fail and fall back to gh pr create?
What's the actual phase failure rate? Build fails more than test, I think. But by how much?
When the review phase does a second patch round, does it usually succeed? Or am I wasting tokens?
I keep iterating on prompts, models, thresholds, heuristics — is any of it actually moving the numbers?

I'd wrapped deterministic layers around the pipeline — typed schemas, structured event emission, lifecycle tracking. But the data was locked in log files and JSON dumps that nothing queried systematically. The pipeline was reliable. It was also a black box when I needed to understand why or measure whether my changes helped.

What Swamp Provides

I already had the deterministic layer — that's what ADW is. The pipeline runs the same phases in the same order every time. It validates inputs, enforces constraints, produces structured output. The determinism wasn't the problem. The problem was I couldn't see inside it. I had no way to query what decisions were made across hundreds of runs, no way to compare before and after when I changed something.

After listening to Adam Jacob on Changelog & Friends talk about swamp, it clicked. Typed schemas, versioned data artifacts, methods that produce queryable output — that's the observability layer I was missing on top of the deterministic one I'd already built. Swamp gave me a way to instrument the pipeline I had.

Swamp is a structured data layer for automation. The core idea: every method execution produces a typed, versioned, immutable data artifact that's queryable with CEL expressions. The agent (or pipeline, in my case) still does the work. Swamp records what happened in a shape you can query later.

The question I asked about each piece of code was simple: does anyone need to observe this later? If a future dashboard, a downstream step, or me-at-2am-debugging-a-failure needs to see what happened and why — it should produce a swamp artifact. If it's a shell command where the side effect is the artifact (a git commit exists or it doesn't), opaque execution is fine.

The Migration Wasn't Hard, Just Some Tokens

Three days, about 155 commits. None of it was technically difficult, let's be honest I let Claude Code Opus 4.7 do this over the weekend while I just approved or steered things from my phone when I'd remember to check in.

The work fell into three categories:

1. Logic that genuinely moved. Provider resolution — which AI model and CLI tool to use for each phase — was a 268-line Ruby class. That logic now lives entirely in a TypeScript swamp extension with a 5-level priority hierarchy. The Ruby class was deleted. Not wrapped, not duplicated — replaced. Every resolution produces a typed artifact recording which source won and why:

// adw_session.ts — resolveProvider
// The full decision tree lives here now.
// Ruby's ProviderResolver (268 LOC) is gone.

const sources = [
  { check: policyOverrides, label: "policy_override" },
  { check: configPhaseProviders, label: "config_phase" },
  { check: configProviders, label: "config_project" },
  { check: PHASE_DEFAULTS, label: "phase_default" },
  { check: FALLBACK, label: "fallback_default" },
];

2. Logic that lives in both places (for now). Error classification and branch strategy have the decision logic implemented in swamp, but Ruby keeps a fallback copy for resilience. The swamp path runs first and produces the queryable artifact. If swamp is unavailable, Ruby's local logic kicks in — same decision, but unrecorded. Over time, as the swamp path proves reliable, the Ruby fallbacks get removed. This is a migration pattern, not the end state.

3. Logic that stays in Ruby but now records its output. The build, test, and review phases still execute in Ruby — they spawn agents, manage subprocesses, handle retries. That's structural I/O that doesn't belong in swamp currently. But their results now get written to swamp as typed artifacts. The phase still runs in Ruby; the outcome becomes queryable. This isn't extraction — it's instrumentation. And it's where most of the observability value comes from.

What I Can Query Now

Here's a single session record:

$ swamp data get adw-analytics session-adw-rif-193
{
  "issueClass": "feature",
  "finalVerdict": "proceed",
  "phasesCompleted": ["convert","worktree","build","test","review","refine"],
  "tasksTotal": 7,
  "tasksCompleted": 7,
  "totalCostUsd": 0.42,
  "totalAgentCalls": 23
}

There are 2,584 of those. The analyze method aggregates across all of them:

{
  "sessionCount": 263,
  "byIssueClass": {
    "feature": { "count": 89, "avgCostUsd": 0.51, "avgDurationMs": 482000 },
    "bug":     { "count": 74, "avgCostUsd": 0.38, "avgDurationMs": 341000 },
    "chore":   { "count": 100, "avgCostUsd": 0.22, "avgDurationMs": 195000 }
  },
  "phaseFailureRates": {
    "build": { "runs": 263, "failures": 31, "rate": 0.118 },
    "test":  { "runs": 232, "failures": 12, "rate": 0.052 },
    "review":{ "runs": 220, "failures": 8,  "rate": 0.036 }
  }
}

Note: these numbers are from test runs against ADW itself during the migration, not from production code-writing sessions.

Build fails at 11.8%. That's the biggest bottleneck — not the review phase, which is what I assumed. Every patch round is tracked individually, so I can measure whether second attempts resolve findings or just burn tokens. Provider resolution records where each choice came from, so when I change a config I can see whether the fallback starts firing more.

These aren't hypotheticals. These are questions I've been asking myself for months and answering with gut feeling. Now I have data.

What I'm Building Next

The structured data is there. What's missing is the layer that turns it into continuous feedback:

Failure pattern detection — are build failures mostly "agent produced no changes" (bad plan) or "tests failed after implementation" (real bugs)? The artifacts have the data.
Regression alerts — if build failure rate creeps above 15% over a rolling week, something changed. Notify me.
A/B provider comparisons — run 20 features with Opus review, 20 with Sonnet, compare patch rates and cost. The resolution artifacts already exist.
Ideation observability — the biggest candidate left. Record confidence scores, adversarial challenge outcomes, and plan shapes in swamp so I can correlate them with build success. "Does low ideation confidence actually predict failure?" is a question I'd love to answer.

I'm iterating on this system constantly — new prompts, model swaps, threshold changes, codified standards I didn't know were missing. Every change is a hypothesis. The pipeline ships code. Swamp lets me test the hypotheses instead of just hoping.

Thank you Swamp.club for helping walk a little bit closer to Agentic Engineering than Vibe Engineering.