An Offline Meeting Transcriber Built on Swamp

I take "handwritten" meeting notes in Obsidian when I can. Some of these meetings I'd want transcribed as well, but they are high trust, private, and sensitive — the exact kinds of meetings you don't invite strangers to. So the notes and recordings shouldn't leave my laptop either to be processed or stored by strangers.

On a whim, I had Claude Code build offline-meeting-transcriber: audio in, Granola-style markdown out, nothing leaves the MacBook. To be clear, I didn't write any of this by hand — Claude Code did the bash, the Python, and later the swamp models, while I steered, reviewed diffs, and made the design calls. The first version was a bash wrapper around four Python files. It worked. It was also a dead end the moment I wanted to use any piece of it for something else.

I had it migrated to swamp for the same reason I migrated ADW — but this time the lever was composability, not observability. The pieces wanted to be reused. Bash had them welded together.

What it Does

bin/meeting-process samples/standup.m4a "Engineering Standup" runs three stages locally:

mlx-whisper transcribes the audio to segment-level JSON. --no-speech-threshold 0.6 is in there because without it Whisper hallucinates "Thanks for watching!" on every silent stretch.
pyannote/speaker-diarization-3.1 tags each Whisper segment with the speaker whose timeline overlaps it the most. Segments reach the LLM as [SPEAKER_00]: …, [SPEAKER_01]: …. Label stability inside a meeting matters more than getting names right — renaming SPEAKER_00 → Alice is a one-line sed after the fact.
qwen3.6:35b-a3b-nvfp4 on local Ollama summarizes the transcript into a Granola-style note. Chunked at ~2500 tokens with 200-token overlap. The token estimator is len(text.split()) * 1.3 — no tiktoken, no extra dep, accurate enough for chunk boundaries.

Final markdown lands in ~/Obsidian/Meetings/Unsorted/ for now.

Why Swamp

Three things in the bash version were obviously reusable and obviously stuck:

Transcription. mlx_whisper doesn't care that it's transcribing a meeting. It would transcribe a podcast, an interview, a voice memo I left myself in the car. The bash script only knew about meetings.
Diarization. Same model, same merge logic, same [SPEAKER_xx]: … output. Useful anywhere I have multi-speaker audio.
Summarization with a Granola-style prompt. This one is meeting-shaped, but the chunking + merge infrastructure underneath it isn't. Different prompt, different downstream consumer, same machinery.

In bash, none of those were components. They were lines in a single script, with implicit data passing through filenames in out/, and the only way to "reuse" any of them was to copy-paste the script and edit it. That's the path I've been on for years and the path I keep regretting.

So I had the pipeline broken out into three swamp models under models/@mgreten/, all published on swamp.club:

@mgreten/mlx-whisper — wraps the binary, exposes transcribe(audioPath), output is a typed transcript artifact.
@mgreten/pyannote-diarizer — takes the audio and the transcript artifact, returns a diarized artifact. Soft-fails when the HF token is missing; the workflow continues on the undiarized transcript. A bad diarization never blocks the note.
@mgreten/meeting-summarizer — takes the transcript artifact and a model tag, returns markdown plus a separate write_note method that lands the file in the vault. The combine_notes method (handwritten + analysis merge) lives here too.

Each one has a typed input, a typed output, and exactly one job. The output of each step is a data artifact the next step pulls by name, not a file path I have to remember to clean up:

- name: summarize
  steps:
    - name: run-summarize
      task:
        type: model_method
        modelIdOrName: meeting-summarizer
        methodName: summarize
        inputs:
          transcriptJson: ${{ data.latest("pyannote-diarizer", inputs.noteName).attributes.transcriptJson }}
          instanceName: ${{ inputs.noteName }}
  dependsOn:
    - job: diarize
      condition:
        type: succeeded

The workflow YAML is one way of wiring those three models. It's not the only way. That's the whole point.

What Composability Actually Buys Me

Hermes is the obvious next caller. When I want my agent to be able to transcribe a recording I just dropped into it, I don't expose bin/meeting-process to it as a shell command. I expose the model. Same typed inputs, same typed outputs, no shell quoting, no parsing stdout. The bash wrapper still exists for me at the terminal — it shells out to swamp now and gives me a progress counter — but it's no longer the only entry point.

The watch-folder daemon is the next one after that. v2 territory, not built. But when I build it, it's calling the workflow, not re-implementing it.

This is the part bash couldn't give me. Not "the pipeline is observable." The pipeline is separable. The day I want to reuse the diarizer in something that has nothing to do with meetings, I'm not copy-pasting anything.

The `--notes` Flag is the Real Design Insight

The agents almost shipped a tool that overwrote my handwritten meeting notes.

The first version they built generated the meeting note. Beautiful, clean, full of action items. The problem: I take "handwritten" notes during meetings, and those notes have the context the audio doesn't — the side conversation, the link I scribbled, the thing I almost said. The generated note has different value. It's complete where mine is partial, but it's also confidently wrong in places my human note isn't. That's the kind of call no agent was going to make for me; I had to catch it in review and redirect.

So --notes was added. When you pass it an existing note, the pipeline:

Leaves the handwritten note untouched.
Writes the LLM summary as <base name> Analysis.md.
Writes a <base name> - Combined Notes.md that links to the original, embeds the handwritten content, nests the analysis underneath, and leaves placeholder headers for personal observations and next steps.

Three files instead of one. The human note stays canonical. The analysis is a sibling, not a replacement. This is also a composability story — the combine step is its own model, callable on any analysis + existing-note pair, not a special case buried in the meeting pipeline.

What I Got Wrong

Two things are in the friction log because they should be.

The v1.5 LOC budget overshot. The plan capped v1 at 250 lines. v1.5 is 383 across bin/ + src/ + prompts/. Most of the overage is wiring around pyannote — the gated HF model, the soft-fail path, the --turns-json test escape hatch that lets me feed synthetic turns when I don't want to download an 80 MB checkpoint just to run a unit test. None of that complexity is in the chunker or the summarizer. It's the price of treating pyannote as a real component instead of a shell call. Worth it, but I should stop being surprised by it.

I shipped v1.5 without exercising pyannote end-to-end on a real meeting. The HF token wasn't provisioned. The merge logic (Whisper segment ↔ pyannote turn, max-overlap assignment) was verified against a synthetic 2-speaker JSON. There's a 22-second 2-speaker fixture on disk (samples/test-2speakers.m4a, alternating say -v Alex and say -v Samantha, ffmpeg-concat) waiting for the first token-equipped run. The friction log entry is there partly so I don't forget that the test passed and the system is still unproven.

What I'm Building Next

Hermes as a caller. Wire the agent to invoke the workflow directly instead of shelling out. First real test of whether the composability story holds outside my own terminal.
Watch folder. Drop an .m4a in a directory, the workflow picks it up, processes it, writes the note. v2 territory.
Real multi-person meeting validation. The 4-person diarization test from the original plan is still owed. I won't trust the speaker labels until I've seen them hold on a 60-minute meeting with people I know. Testing this right now actually.

The offline-by-default constraint forced the components to be the unit. There was no cloud API to lean on, and once I'd built local pieces for transcribe, diarize, and summarize, swamp let me keep them separate without paying the integration tax every time I wired them together. The same three models will power the watch folder, the agent caller, and whatever else I haven't thought of yet. That's the part I couldn't get from a script.

Addendum — May 29, 2026: MPS Acceleration and a 10× Diarization Speedup

After the first real end-to-end run on a two-hour Elder's Meeting recording, two problems surfaced and got fixed the same day.

The m4a sample-count bug. pyannote's pipeline crops audio into 10-second chunks and asserts the chunk contains exactly the expected number of samples. AAC-encoded m4a files have encoder-delay priming samples baked into the container, so the first decoded chunk comes back a few hundred samples short and pyannote throws a ValueError. The fix: transcode to a temporary 16 kHz mono WAV via ffmpeg before handing the file to pyannote. The WAV is cleaned up after diarization. The fix went into @mgreten/pyannote-diarizer v2026.05.28.4.

CPU diarization was painfully slow. Diarizing two hours of audio on CPU took 61 minutes — roughly 0.5× real-time. The fix was one line in the Python helper: pipe.to(torch.device("mps")), with an auto default that prefers MPS on Apple Silicon and falls back to CPU everywhere else. The result on an M5 Max:

Diarization: 61 min → 6 min (~10× faster)
Full pipeline (transcribe + diarize + summarize + write): ~67 min → ~12 min

The model also switched from qwen3.6:35b-a3b-nvfp4 to gemma4:26b-mlx-bf16 for summarization. The MoE model has ~3.8B active parameters per token at nvfp4 precision; the Gemma4 is dense at 26B and full bf16. Better prose, slightly larger output. Both run locally via Ollama.

All three changes went into the published swamp extensions and were pulled into the consumer repo the same session. That's the composability story paying off in maintenance terms, not just in new features: the fix lived in one place, got versioned, and was one swamp extension pull away from being live.