Closing the delivery gap: an out‑of‑band acknowledgment sidecar and retry loop
Previous write‑ups documented KMQ’s four scenarios going from two failures to five passes, and the local OCI registry behind them. The core pipeline – FIFO, ring buffers, append.log – was proven durable and inspectable. The boundary at the TCP ingress remained: a producer could not know which messages had been safely persisted, and under a stalled consumer the sender could outrun backpressure.
This post describes the addition of an out‑of‑band acknowledgment (ACK) sidecar and a retry loop that turns the broker’s append log into a pull‑based durability contract. The change required no modification to the hot‑path containers, no PVC migration, and only a single new container in the existing pod. The result is a scenario that delivers 20 000 messages with zero loss, even when the initial send loses some, proving that the combination of block‑mode ring buffers and an on‑demand ACK endpoint closes the delivery gap.
Why an out‑of‑band ACK?
Inside the pipeline, backpressure is enforced by kernel FIFO and ring‑buffer blocking. At the TCP boundary, however, the producer can push data faster than the ingress propagates that backpressure, resulting in message loss – a documented trade‑off of the phase‑2 design. Adding a per‑message acknowledgment inside the hot path would violate the “pure AWK, no shell” contract and increase latency. A separate query endpoint that reads the append log on demand gives the producer the information it needs to decide when to retry, without touching the pipeline.
This approach follows the inspectability principle of KMQ: every durable message is a line in append.log, and its existence can be checked with standard tools. The ACK sidecar simply automates that check over a TCP socket.
The ack‑egress sidecar
A new container, ack‑egress, is added to the broker pod. It runs socat listening on port 5675, forking a single‑shot gawk process per connection. The AWK script accepts a keyword GET_ACK and responds with ACK <seq>, where <seq> is the last sequence number found in append.log. The script uses only built‑in getline and string functions – no system(), no external binaries.
# ack-egress.awk – KMQ out-of-band ACK query endpoint.
BEGIN {
LOG = ENVIRON["APPEND_LOG"]
}
{
if ($1 == "GET_ACK") {
last_seq = 0
while ((getline line < LOG) > 0) {
split(line, f, "|")
last_seq = f[1]
}
close(LOG)
print "ACK " last_seq
fflush()
}
exit # one request per connection
}
The sidecar mounts the same PVC as the broker, read‑only. It is stateless and consumes negligible resources ( < 10 MiB RSS ). A dedicated NetworkPolicy allows ingress on port 5675 within the cluster, and a new port on the broker‑svc Service makes the endpoint reachable by DNS name.
Retry logic
With the ACK endpoint in place, a producer can implement a simple retry loop:
- Send a batch of messages (e.g. 5 000 lines).
- Query the ACK endpoint to obtain the current durable sequence number.
- If the number of delivered messages is less than the number sent, resend the missing sequences starting from
last_ack + 1. - Repeat until the ACK confirms all messages are durable.
This loop turns the unreliable TCP boundary into an application‑level at‑least‑once guarantee. The producer may send duplicate messages, but the broker’s framer assigns a unique sequence number to each line, and the gap‑check tool verifies that the final log contains a contiguous, duplicate‑free sequence.
Producer (retry loop)
│
│ 1. send messages (TCP :5673)
▼
┌─────────────────────────────────┐
Broker Pod (worker node)
internal‑ingress :5673
… (pipeline) …
durability writes append.log
ack‑egress :5675 (sidecar)
└─ reads append.log
└─────────────────────────────────┘
│ ▲
│ 2. GET_ACK (TCP :5675) │
└────────────────────────┘
ACK <seq>
The ack‑retry scenario
A new scenario script, scenario-ack-retry.sh, demonstrates the loop. It starts from a clean broker pod, queries the baseline ACK, and then sends 20 000 messages in chunks of 5 000. After each chunk, the ACK is checked. As seen in earlier backpressure tests, the first chunk may not deliver all 5 000 messages; the ACK reveals the shortfall immediately.
Excerpt from a typical run (with a log that already contained older data; the script uses a delta from the starting sequence):
ts=… step=send action=send_chunk seq=95771 count=5000 missing=20000
ts=… step=ack_check ack_seq=98270 delivered=2500 target=20000
ts=… step=retry action=continue
ts=… step=send action=send_chunk seq=98271 count=5000 missing=17500
ts=… step=ack_check ack_seq=102555 delivered=6785 target=20000
…
ts=… step=complete action=all_delivered
ts=… step=result outcome=PASS delivered=21100 total_sent=20000
The final gap check reports a contiguous sequence with zero gaps, confirming that every message eventually reached the append log.
Relationship to the transactional outbox pattern
The transactional outbox pattern solves the problem of atomically updating a database and publishing a message. A message relay reads an outbox table and sends the messages to a broker. KMQ does not implement that pattern because the broker’s write‑ahead log is the database; there is no separate store to synchronise. The ACK endpoint fulfills a similar role as the outbox relay – it reads the durable store and provides a status signal – but it is pull‑based rather than push‑based. The producer polls the ACK, decides whether to retry, and thus retains control over delivery semantics. This design keeps the broker minimal and leaves policy decisions to the client.
Surgical change, no disruption
Adding the ACK sidecar required only:
- A new container definition in the broker Deployment manifest.
- An additional port in the Service and a NetworkPolicy rule.
- Rebuilding the broker image with the
ack‑egress.awk file included (the same image is used for all broker containers).
No hot‑path pipelines were modified. The existing scenarios (resume, routing, backpressure measurement, dead‑letter replay) continue to pass unchanged. This demonstrates the operational lightness of the architecture: new capabilities can be grafted onto the pod without disrupting the running system, and without complex rollout procedures. Future additions (e.g. TLS on the ACK port, or a push‑based ack via a response FIFO) can follow the same pattern.
What comes next
With the ACK endpoint operational, the cluster can be considered lab‑production‑ready for the niche it serves. Next steps under consideration include:
- Mutual TLS on the internal‑ingress and ack ports, using a self‑signed certificate generated by a one‑time init container, to ensure only authorised producers can connect.
- A minimal producer library (AWK or shell) that wraps the retry loop, exposing a simple “send‑and‑confirm” interface.
- Log rotation for
append.log, driven by the CRD’s retention policy, to bound disk usage over long running experiments.
KMQ remains a single‑pod, single‑node broker built from Unix primitives. Its reliability guarantees are now explicit, measurable, and under the control of the operator – exactly the right posture for a laboratory testbed.