I take "handwritten" meeting notes in Obsidian when I can. Some of these meetings I'd want transcribed as well, but they are high trust, private, and sensitive — the exact kinds of meetings you don't invite strangers to. So the notes and recordings shouldn't leave my laptop either to be processed or stored by strangers.
On a whim, I had Claude Code build offline-meeting-transcriber: audio in, Granola-style markdown out, nothing leaves the MacBook. To be clear, I didn't write any of this by hand — Claude Code did the bash, the Python, and later the swamp models, while I steered, reviewed diffs, and made the design calls. The first version was a bash wrapper around four Python files. It worked. It was also a dead end the moment I wanted to use any piece of it for something else.
I had it migrated to swamp for the same reason I migrated ADW — but this time the lever was composability, not observability. The pieces wanted to be reused. Bash had them welded together.
What it Does
bin/meeting-process samples/standup.m4a "Engineering Standup" runs three stages locally:
mlx-whispertranscribes the audio to segment-level JSON.--no-speech-threshold 0.6is in there because without it Whisper hallucinates "Thanks for watching!" on every silent stretch.pyannote/speaker-diarization-3.1tags each Whisper segment with the speaker whose timeline overlaps it the most. Segments reach the LLM as[SPEAKER_00]: …,[SPEAKER_01]: …. Label stability inside a meeting matters more than getting names right — renamingSPEAKER_00→Aliceis a one-linesedafter the fact.qwen3.6:35b-a3b-nvfp4on local Ollama summarizes the transcript into a Granola-style note. Chunked at ~2500 tokens with 200-token overlap. The token estimator islen(text.split()) * 1.3— no tiktoken, no extra dep, accurate enough for chunk boundaries.
Final markdown lands in ~/Obsidian/Meetings/Unsorted/ for now.
Why Swamp
Three things in the bash version were obviously reusable and obviously stuck:
- Transcription.
mlx_whisperdoesn't care that it's transcribing a meeting. It would transcribe a podcast, an interview, a voice memo I left myself in the car. The bash script only knew about meetings. - Diarization. Same model, same merge logic, same
[SPEAKER_xx]: …output. Useful anywhere I have multi-speaker audio. - Summarization with a Granola-style prompt. This one is meeting-shaped, but the chunking + merge infrastructure underneath it isn't. Different prompt, different downstream consumer, same machinery.
In bash, none of those were components. They were lines in a single script, with implicit data passing through filenames in out/, and the only way to "reuse" any of them was to copy-paste the script and edit it. That's the path I've been on for years and the path I keep regretting.
So I had the pipeline broken out into three swamp models under models/@mgreten/, all published on swamp.club:
@mgreten/mlx-whisper— wraps the binary, exposestranscribe(audioPath), output is a typed transcript artifact.@mgreten/pyannote-diarizer— takes the audio and the transcript artifact, returns a diarized artifact. Soft-fails when the HF token is missing; the workflow continues on the undiarized transcript. A bad diarization never blocks the note.@mgreten/meeting-summarizer— takes the transcript artifact and a model tag, returns markdown plus a separatewrite_notemethod that lands the file in the vault. Thecombine_notesmethod (handwritten + analysis merge) lives here too.
Each one has a typed input, a typed output, and exactly one job. The output of each step is a data artifact the next step pulls by name, not a file path I have to remember to clean up:
- name: summarize
steps:
- name: run-summarize
task:
type: model_method
modelIdOrName: meeting-summarizer
methodName: summarize
inputs:
transcriptJson: ${{ data.latest("pyannote-diarizer", inputs.noteName).attributes.transcriptJson }}
instanceName: ${{ inputs.noteName }}
dependsOn:
- job: diarize
condition:
type: succeeded
The workflow YAML is one way of wiring those three models. It's not the only way. That's the whole point.
What Composability Actually Buys Me
Hermes is the obvious next caller. When I want my agent to be able to transcribe a recording I just dropped into it, I don't expose bin/meeting-process to it as a shell command. I expose the model. Same typed inputs, same typed outputs, no shell quoting, no parsing stdout. The bash wrapper still exists for me at the terminal — it shells out to swamp now and gives me a progress counter — but it's no longer the only entry point.
The watch-folder daemon is the next one after that. v2 territory, not built. But when I build it, it's calling the workflow, not re-implementing it.
This is the part bash couldn't give me. Not "the pipeline is observable." The pipeline is separable. The day I want to reuse the diarizer in something that has nothing to do with meetings, I'm not copy-pasting anything.
The --notes Flag is the Real Design Insight
The agents almost shipped a tool that overwrote my handwritten meeting notes.
The first version they built generated the meeting note. Beautiful, clean, full of action items. The problem: I take "handwritten" notes during meetings, and those notes have the context the audio doesn't — the side conversation, the link I scribbled, the thing I almost said. The generated note has different value. It's complete where mine is partial, but it's also confidently wrong in places my human note isn't. That's the kind of call no agent was going to make for me; I had to catch it in review and redirect.
So --notes was added. When you pass it an existing note, the pipeline:
- Leaves the handwritten note untouched.
- Writes the LLM summary as
<base name> Analysis.md. - Writes a
<base name> - Combined Notes.mdthat links to the original, embeds the handwritten content, nests the analysis underneath, and leaves placeholder headers for personal observations and next steps.
Three files instead of one. The human note stays canonical. The analysis is a sibling, not a replacement. This is also a composability story — the combine step is its own model, callable on any analysis + existing-note pair, not a special case buried in the meeting pipeline.
What I Got Wrong
Two things are in the friction log because they should be.
The v1.5 LOC budget overshot. The plan capped v1 at 250 lines. v1.5 is 383 across bin/ + src/ + prompts/. Most of the overage is wiring around pyannote — the gated HF model, the soft-fail path, the --turns-json test escape hatch that lets me feed synthetic turns when I don't want to download an 80 MB checkpoint just to run a unit test. None of that complexity is in the chunker or the summarizer. It's the price of treating pyannote as a real component instead of a shell call. Worth it, but I should stop being surprised by it.
I shipped v1.5 without exercising pyannote end-to-end on a real meeting. The HF token wasn't provisioned. The merge logic (Whisper segment ↔ pyannote turn, max-overlap assignment) was verified against a synthetic 2-speaker JSON. There's a 22-second 2-speaker fixture on disk (samples/test-2speakers.m4a, alternating say -v Alex and say -v Samantha, ffmpeg-concat) waiting for the first token-equipped run. The friction log entry is there partly so I don't forget that the test passed and the system is still unproven.
What I'm Building Next
- Hermes as a caller. Wire the agent to invoke the workflow directly instead of shelling out. First real test of whether the composability story holds outside my own terminal.
- Watch folder. Drop an
.m4ain a directory, the workflow picks it up, processes it, writes the note. v2 territory. - Real multi-person meeting validation. The 4-person diarization test from the original plan is still owed. I won't trust the speaker labels until I've seen them hold on a 60-minute meeting with people I know. Testing this right now actually.
The offline-by-default constraint forced the components to be the unit. There was no cloud API to lean on, and once I'd built local pieces for transcribe, diarize, and summarize, swamp let me keep them separate without paying the integration tax every time I wired them together. The same three models will power the watch folder, the agent caller, and whatever else I haven't thought of yet. That's the part I couldn't get from a script.