From Manual Checklist to Merge Gate: E2E Testing with Swamp

A closed metal gate glowing from within stands across a wet, foggy road at night, its control box showing a green signal light, with loose checklist papers scattered and blowing near its base. Title text: From Manual Checklist to Merge Gate: E2E Testing with Swamp. Site identifier: MATGRETEN.DEV.

I kept shipping small changes to Moment Savor that quietly broke something I wasn't looking at. A family memory journal — my family uses it — and I'd push a fix for one thing and silently regress something else. Nothing catastrophic. Just the slow accumulation of embarrassing bugs that I'd catch days later.

I knew I needed end-to-end tests. I just didn't want to set up Cypress.

That's not a knock on Cypress. It's a real tool that solves a real problem. But for a side project I work on in the margins, the overhead felt wrong: browser binary to maintain, a new framework to learn, another CI service to wire up and pay for. The cost-benefit didn't pencil out.

So I had a manual checklist instead. When I wanted to verify something before a deploy, I'd feed the checklist to an AI agent and let it drive a browser CLI through the app. Sign in, create a memory, check the API, log out. It worked, kind of. It wasn't automatic, it didn't gate anything, and I only did it when I remembered to. Which wasn't often enough.

Then I heard the Swamp team describe how they use UAT testing as a gate in their own workflows. Not as a theoretical best practice — as something they actually ran on real deployments. Something clicked. This is exactly what I need. And I don't have to do it the way I've been doing it.

The insight: I already have an HTTP API

Moment Savor has a real REST JSON API that the iOS and Android clients use. If I wanted to test the app, I didn't need a browser. I needed HTTP.

The Swamp extension I built is TypeScript/Deno. It makes raw fetch() calls against a running Rails server. No Playwright, no headless Chrome, no Selenium. The one piece of browser behavior I needed to replicate was session management — CSRF tokens, redirect chains, set-cookie headers. So I wrote a CookieJar class to handle that. It's about 30 lines:

class CookieJar {
  private cookies: Map<string, string> = new Map();

  setCookies(headers: Headers) {
    for (const value of (headers.getSetCookie?.() ?? [])) {
      const [pair] = value.split(";");
      const [k, v] = pair.split("=");
      if (k) this.cookies.set(k.trim(), v?.trim() ?? "");
    }
  }

  cookieHeader(): string {
    return [...this.cookies.entries()]
      .map(([k, v]) => `${k}=${v}`)
      .join("; ");
  }
}

That's it. That replaces a browser for auth testing purposes. The suite hits /users/sign_in, extracts the CSRF token from the HTML, POSTs credentials, and holds the session cookie for the rest of the run.

The structure: seven test sections, one method

The extension exports a single method: runAll. When called, it:

Signs in via web session and creates two API tokens (E2E-RW and E2E-RO)
Runs seven test sections: web auth, API auth, token CRUD, memory CRUD, family management, push tokens, and rate limiting
Tears down — deletes all [E2E]-prefixed memories, cleans up test invitations, revokes both tokens

Each individual test goes through a runTest() helper that catches exceptions and returns structured JSON:

{ "name": "create memory", "section": "api-memories", "status": "pass", "durationMs": 84 }

Status can be pass, fail, error, or blocked. That last one matters. If "create memory" fails, the downstream "delete memory" test doesn't run as a failure — it returns blocked. That tells me the root cause was upstream, not that delete is broken. Cascading failures hide real problems; blocked surfaces them.

The rate limit test is my favorite. I fire 65 concurrent fetch() calls in Deno against a local server with a 60rpm Rack::Attack limit. They run genuinely in parallel, and the test asserts that at least one came back 429. It works reliably — consistently, repeatably — in a way I'd never trust against production.

The results are data, not stdout

This is the part that changed how I think about testing infrastructure.

Most test suites produce output. You run them, you get a pass/fail, it either blocks CI or it doesn't. That's useful, but it's ephemeral. The results don't accumulate. You can't query them. You can't hand them to an LLM and ask "what's been flaky this week?"

Swamp stores results in a data model. After every run:

swamp data get moment-savor-ci-e2e current --json

That's the same command my CI workflow uses to decide whether to block a merge. It's the same command I'd run locally to debug a failure. It's the same command an LLM can use when I paste it into a conversation and ask what went wrong.

I also built a history report extension that reads the last five stored suite results and renders a per-test trend table. When something starts intermittently failing, I can see it across runs instead of guessing.

The CI setup: a server in a closet I never touch

I'm not spinning up anything in the cloud for this. I have a Linux box in a closet — always on, connected via Tailscale — registered as a GitHub Actions self-hosted runner. It runs the CI Rails server on a separate port from my dev server. Both are long-running Puma processes.

The workflow on every PR:

Checks out the PR branch to a dedicated CI clone of the repo
Runs scoped RSpec first — fast unit-level feedback on the API layer
Restarts Puma: touch tmp/restart.txt
Polls the health endpoint via Tailscale until the server is up
Runs swamp model method run moment-savor-ci-e2e runAll
Checks swamp data get moment-savor-ci-e2e current --json | jq '.failures > 0' — fails the workflow if true

No Docker, no container orchestration, no ephemeral build environments. Phased restart takes a few seconds. The full suite runs in under a minute.

The server in the closet costs me nothing extra. I'm not paying for CI minutes. I'm using hardware I already own and a Tailscale connection I already had.

The composability I didn't expect

Here's the thing about Swamp extensions: they're just TypeScript modules. The batch-concurrent-fetch pattern I use for the rate limit test — I didn't write that for Moment Savor. I wrote it for a different project, pulled it out, dropped it in here. The same is true for some of the HTTP helper utilities.

This is different from having a Playwright config that lives in one repo. It's different from shell scripts that encode test logic in ephemeral CI steps. The extension model is typed, versioned, and managed by Swamp. I can extend it without breaking existing methods. I can reuse patterns across projects. I can describe its schema and let an agent interact with it.

When I started using Swamp, I thought I was adding a test runner. What I actually built was test infrastructure I own — that accumulates value instead of evaporating after each run.

The manual checklist still exists. But now Swamp runs it.