WAL rotation without a coordinator
A design note on KMQ, a homelab message broker built from FIFOs, gawk and Kubernetes primitives. Earlier notes covered the original FIFO pipeline, the ring-buffer block policy, the out-of-band ACK sidecar and the dual-ACK feedback loop with TLS. The private OCI registry and pull-through cache sits behind all of them.
The ACK note queued log rotation as future work. The TLS note listed unbounded log growth as one of the remaining boundaries: both append.log and processed.log grew forever, the CRD had retention fields and the implementation was not in tree. This note closes that boundary for append.log and is honest about the part still deferred.
It covers two planes of the same work. The mechanism, which is what was built and why, with Kleppmann as the conceptual frame. And the versatility, which is the claim that retention-by-consumption is a deliberate feature rather than a limitation, shown by two unrelated use cases running on the one mechanism.
1. The problem
append.log grew without bound under sustained throughput. The v0.4.4 throughput suite documented file-size-dependent variance: as the log got larger, benchmarks stopped being comparable across runs because the cost of touching the file changed underneath them. A SUBSCRIBE with FROM_START against a 272k-line topic took 14.7 seconds before the first byte reached the consumer, because the consumer-side reader walks the log from the top.
Three operational consequences, all measurable. Disk fills. Benchmarks lose comparability across runs. FROM_START latency degrades with log size. None of these is exotic. They are the standard reasons every log-based system eventually grows a rotation story.
2. What Kleppmann’s book gave us and what it did not
The conceptual scaffolding came from Designing Data-Intensive Applications by Martin Kleppmann, specifically Chapter 11 on stream processing and log-based message brokers and Chapter 3 on log-structured storage and segment-based reclamation.
The book provides the model KMQ already inhabits. A log-based message broker uses an append-only log as the durability substrate, consumers track offsets and retention is based on time or size rather than on delivery confirmation. Kafka and Amazon Kinesis are the canonical examples. Segmented logs split the log into segments, treat sealed segments as immutable and reclaim by deleting whole segments. Consumer offset tracking is the mechanism by which the broker knows what has been read.
The book also describes something KMQ deliberately does not do. Log compaction in the Kafka sense retains only the latest value per key in an update stream. KMQ’s append.log has no key-update semantics. Every message is an independent event, not a new value for an existing key, so there is no “latest value” to retain and nothing to compact. Compaction is the wrong primitive here.
This distinction matters enough to name explicitly, because a reader who knows Kleppmann will assume the Kafka meaning of the word otherwise. Rotation moves sealed segments aside. Compaction retains the latest value per key. KMQ does the first and not the second. By Kleppmann’s taxonomy it is a log-based broker with rotation but without compaction. The book named the design space. The implementation decided where on that map this particular broker sits.
3. The design pivot
The honest version, because these notes own their wrong turns.
The initial rotation plan was an external sidecar container watching file sizes and coordinating with the durability stage through a signal or a filesystem flag. Self-rotation inside durability.awk was flagged in planning as not recommended, on the grounds that it contaminates the hot path.
The implementation went with self-rotation anyway, and the justification turned out cleaner than the initial caution.
durability.awk is the single writer of append.log. That is a load-bearing invariant across the whole pipeline, enforced by replicas: 1 and the Recreate strategy. Self-rotation has zero coordination cost: no signal, no flag and no race window between a rotator and the writer, because the writer rotates itself. A sidecar would have had to coordinate with the single writer, which is exactly the complexity self-rotation eliminates. The size check happens at the batch boundary, where the file is already being closed and reopened for the tier-1 persistence flush, so it does not move the hot-path needle in any way the throughput suite can see.
The lesson is worth stating plainly. When a single-writer invariant already holds, the “do not contaminate the hot path” instinct can be overcautious. The contamination is real but minor. The coordination it would have cost is real and large. The reversal made the design simpler, not worse.
4. CUT and RECLAIM
The design separates two operations that look the same from outside but answer different questions.
| CUT | RECLAIM |
|---|
| Question | When to seal the active segment | When to delete a sealed segment |
| Trigger | Size or age threshold | Consumption watermark |
| Fires | Always, when threshold is reached | Only when every route has consumed past the segment |
| Effect on consumers | None | A lagging consumer retains segments until it catches up |
| Failure if confused | n/a | Gating the cut here would starve the pipeline |
| Data safety | Loss-free | Deletes only fully consumed data |
CUT seals the active segment and starts a fresh one. The sequence is loss-free:
# CUT, fired at a batch boundary when the size or age threshold is reached
close(APPEND_LOG) # release the file descriptor
mv append.log -> append.log.<n> # seal the segment under a new name
gzip append.log.<n> & # compress in the background, off the hot path
reopen APPEND_LOG for append # fresh active segment, writing resumes
The cut keeps append.log from growing without bound regardless of what any consumer is doing. That is the whole point of separating it from reclaim.
RECLAIM deletes a sealed .gz segment from disk. It is gated by the consumption watermark, which is the minimum per-route processed.cursor sequence number across the active routes. A segment is reclaimed only when its top sequence number is at or below the watermark, meaning every route has consumed past it.
append.log active segment, being written
append.log.000003.gz sealed, top seq 30000 reclaimable when watermark >= 30000
append.log.000002.gz sealed, top seq 20000 reclaimable when watermark >= 20000
append.log.000001.gz sealed, top seq 10000 reclaimed, watermark already passed it
watermark = min(processed.cursor seq) over active routes
The naive design gates the cut on consumption, or simply never cuts while a consumer lags. That fails in a specific way. A slow consumer turns rotation off, append.log grows without bound again and the original problem returns wearing a different hat. Gating only the reclaim is the correct decomposition: the cut bounds the active file unconditionally, the reclaim refuses to delete anything a route has not yet read. A stuck consumer costs disk, not data and not throughput.
5. The per-route cursor as the watermark
The per-route processed.cursor was not built for rotation. It was introduced earlier to give O(1) ACK reads, bypassing a slow path that scanned a log to answer “has sequence N been processed”. One small file per route, holding the last processed sequence number.
With rotation, the same file becomes the consumption watermark. Take the minimum cursor sequence across all active routes and that is the floor under which a reclaim is safe. No new bookkeeping, no new file and no second source of truth about what has been consumed.
The reuse is the point worth highlighting. A primitive built for one purpose turned out to be the natural primitive for another. That is the quiet payoff of composable files over time. When the durable state is just a line in a file readable with cat, a second consumer of that state costs nothing to add.
6. The behavior that looked like a bug
The retention behavior first showed up as an apparent leak. Segments were not being reclaimed, disk was growing and a route with no consumer kept its segments forever. The instinct was to treat it as a bug to fix.
It is not a leak. It is a choice. Reclamation is gated on consumption on purpose, because both alternatives are worse. Gating the cut starves the pipeline under load, as Section 4 shows. Reclaiming unconsumed data deletes messages nobody has read yet, which is data loss dressed up as housekeeping. Retaining a stuck route’s segments until they are consumed is the only option that loses nothing, and it is the one the design takes.
The verification surfaced one genuine bug, worth reporting because it is the kind that only appears under rotation. The cut writes a comment marker as the first line of each fresh segment. The O(1) ACK read took the tail line of the active segment to find the durable sequence, and under load the read sometimes crossed the window right after a cut, saw the comment marker and returned sequence 0. An end-to-end loop then locked on 0. The fix was small: read the last several lines and skip comments rather than trusting the literal tail. Verified on cluster on 2026-05-31, with rotation enabled, two consecutive contract runs clean. The bug was real, transient and now closed. The retention behavior was never the bug. The tail read was.
The proof that the boundary was chosen rather than tripped over is that the same mechanism, unchanged, serves two unrelated real use cases.
7. Use case one: an SMPTE timecode bus
SMPTE timecode in its longitudinal form, LTC, is a synchronization stream recorded on its own track parallel to the content tracks. It exists to be read by an external synchronizer that does not consume the content. The standard is SMPTE 12M. The timecode itself dates to the late 1960s, with a SMPTE committee formed around 1969 to end the chaos of incompatible editing-machine codes, and the formal ANSI/SMPTE 12M specification approved in 1975, later split into 12M-1 and 12M-2 in 2008. The relevant property for this note is structural, not historical: it is the canonical case of a side channel read by an external asynchronous device that is not the content consumer.
KMQ’s retained route is the same shape. Timecode frames go on a route named smpte, are retained in the WAL across rotations and an external reader consumes them at its own cadence. That reader is a sync slave, the kind of thing you would drive off a GPIO line on a small embedded board. The retention is the timecode-bus guarantee: the sync track has to survive until the reader has it, which is exactly retention-by-consumption.
The homelab angle is the honest one. This will never drive a real tape deck, and that is the point. The pattern is real, the use case is real and the broker carries it without knowing the bytes are timecode. A 1969 sync protocol carried by an AWK broker on Kubernetes and read out to a microcontroller is vaporwave that actually compiles.
The scenario, usecase-smpte-timecode.sh, verified on cluster: 750 frames in HH:MM:SS:FF form at 25 frames per second starting from 10:00:00:00, retained through a rotation, then read FROM_START by the synchronizer. It asserts coverage, meaning no dropped frames and monotonic first-appearance order, because a sync bus that delivers out of order cannot lock. Span verified from 10:00:00:00 to 10:00:29:24.
8. Use case two: cold storage with verifiable bundles
A route named archive accumulates with no live consumer. Once a period, an archivist runs a job that consumes the route in full, acknowledges that consumption through the drainer (tools/drain-routes.sh, reused rather than reimplemented), packages the content into per-bundle tar.gz files and computes a sha256 for each. It then emits a manifest aligning every bundle to its sequence range and archival date. The manifest is the product. It is what makes the cold store verifiable on retrieval rather than a pile of opaque tarballs.
The standards lineage sits naturally at the doorstep of this output, and the wording here is deliberately cautious because compatibility is a claim that should be earned, not asserted. The bundles are TAR, the pax interchange format under POSIX.1-2001, the most primitive and most thoroughly deployed tape interchange form there is. The manifest plus per-bundle sha256 is a rudimentary archival information package in the sense of the OAIS reference model, ISO 14721 / CCSDS 650.0-M-2, which mandates integrity validation for long-term cold storage. An optional ARCHIVE_DEST can point at an LTFS mount, ISO/IEC 20919:2021, which writes the bundles to LTO tape with the operating system seeing the tape as a disk. Adjacent and equally documented are the object-storage front ends for tape, such as AWS Tape Gateway and Spectra Logic BlackPearl, which present an S3 interface over a tape library. These are real, decade-deployed standards, used by CERN’s tape archive and by national archives.
KMQ implements none of them. It produces output that sits naturally where they begin. The right framing is “the WAL turned into a feed for tape”, not “KMQ is an archival system”.
The scenario, usecase-cold-storage.sh, verified: 1000 records sent to archive, the drainer invoked to acknowledge consumption, a full read, a split into 4 bundles of 250, each as tar.gz plus sha256 and a manifest aligning each bundle to its sequence range. Integrity is recomputed and matched. The bundles land in the PVC by default, with no external destination required.
9. The maturity argument
Pull it together.
One mechanism: a reclaim gated on the consumption watermark, an always-on cut and the per-route cursor doing double duty as that watermark.
Two unrelated use cases: a synchronization bus and an archive feed.
The behavior that looks like a bug for a naive queue, segments retained because a route has no consumer, is precisely the guarantee both use cases require. The sync bus needs the track to survive until the slave reads it. The archive needs the route to survive until the archivist packages it.
The discipline that made this legible matters as much as the mechanism. The retention is documented as a feature with paired white-box and black-box scenarios, one to one, not asserted in prose. The retention is shown surviving a rotation and then released by consumption, with the reclaim invariant checked from inside the broker. A design pays off when it does work the designer did not have to do. The CUT and RECLAIM split was sized for rotation. It turned out to be exactly what these two use cases needed, and neither was aimed at when the split was drawn.
10. What was deferred
The boundaries, stated plainly, because the value of a note like this is in being exact about where the work stops.
processed.log rotation is deferred. The ACK-ingress endpoint runs as a socat EXEC handler, a fresh gawk process per connection that reopens the file by path each time, so it never holds the long-lived descriptor that made append.log rotation necessary in the first place. processed.log most likely decouples from the broker logs into the ACK ceremony layer entirely. The right move was to not rotate it now and to decide its architectural home first, rather than copy a mechanism it does not need.
Segment-aware GET_STATUS is deferred as an edge case. When a queried sequence has been reclaimed and now lives in a deleted segment, GET_STATUS answers NOT_FOUND rather than pointing at the archive. A segment index recorded as semantic intent, in etcd or a ConfigMap and changing per rotation rather than per message, would fix this. It is not yet built.
Per-topic file rotation is not done. test.log, jobs.log and dead_letter.log are the per-route queue files, written one line per routing key and never reclaimed. They are real debris under heavy synthetic load, and they tie to the deferred processed.log question rather than to the WAL rotation this note describes.
What is queued
A segment index so GET_STATUS can answer for reclaimed sequences instead of returning NOT_FOUND. A decision on where processed.log lives before it gets a rotation policy of its own. And the ARCHIVE_DEST path actually pointed at an LTFS mount, so the cold-storage bundles land on tape rather than back on the same PVC they came from. None of these changes the mechanism in this note. They extend the surface it already exposes.