Closing the delivery gap: an out-of-band acknowledgment sidecar and retry loop
Earlier posts documented KMQ’s four scenarios going from two failures to five passes and the local OCI registry behind them. The core pipeline (FIFO, ring buffers, append.log) was proven durable and inspectable. One boundary remained. A producer could not know which messages had been safely persisted. Under a stalled consumer the sender could outrun backpressure.
This post describes the addition of an out-of-band acknowledgment (ACK) sidecar and a retry loop that turns the broker’s append log into a pull-based durability contract. The change required no modification to the hot-path containers, no PVC migration and only one new container in the existing pod. The result is a scenario that delivers 20 000 messages with zero loss even when the initial send loses some, proving that the combination of block-mode ring buffers and an on-demand ACK endpoint closes the delivery gap.
Why an out-of-band ACK?
Inside the pipeline, backpressure is enforced by kernel FIFO and ring-buffer blocking. At the TCP boundary, however, the producer can push data faster than the ingress propagates that backpressure, resulting in message loss. This is a documented trade-off of the previous iteration. Adding a per-message acknowledgment inside the hot path would violate the “pure AWK, no shell” contract and increase latency. A separate query endpoint that reads the append log on demand gives the producer the information it needs to decide when to retry, without touching the pipeline.
This approach follows the inspectability principle of KMQ: every durable message is a line in append.log and its existence can be checked with standard tools. The ACK sidecar automates that check over a TCP socket.
The ack-egress sidecar
A new container, ack-egress, is added to the broker pod. It runs socat listening on port 5675, forking a single-shot gawk process per connection. The AWK script accepts the keyword GET_ACK and responds with ACK <seq>, where <seq> is the last sequence number found in append.log. The script uses only built-in getline and string functions. No system(). No external binaries.
# ack-egress.awk - KMQ out-of-band ACK query endpoint.
BEGIN {
LOG = ENVIRON["APPEND_LOG"]
}
{
if ($1 == "GET_ACK") {
last_seq = 0
while ((getline line < LOG) > 0) {
split(line, f, "|")
last_seq = f[1]
}
close(LOG)
print "ACK " last_seq
fflush()
}
exit # one request per connection
}
The sidecar mounts the same PVC as the broker, read-only. It is stateless and consumes negligible resources (under 10 MiB RSS). A dedicated NetworkPolicy allows ingress on port 5675 within the cluster. A new port on the broker-svc Service makes the endpoint reachable by DNS name.
Retry logic
With the ACK endpoint in place, a producer can implement a simple retry loop:
- Send a batch of messages (e.g. 5 000 lines).
- Query the ACK endpoint to obtain the current durable sequence number.
- If the number of delivered messages is less than the number sent, resend the missing sequences starting from
last_ack + 1. - Repeat until the ACK confirms all messages are durable.
This loop turns the unreliable TCP boundary into an application-level at-least-once guarantee. The producer may send duplicate messages but the broker’s framer assigns a unique sequence number to each line. The gap-check tool verifies that the final log contains a contiguous, duplicate-free sequence.
producer broker pod
| 1. send messages |
|---------- TCP 5673 -----------------> | internal-ingress
| | |
| | v
| | (pipeline: FIFO,
| | ring buffers,
| | framer, durability,
| | router)
| | |
| | v
| | append.log (PVC)
| | ^
| | | reads only
| 2. GET_ACK | |
|---------- TCP 5675 -----------------> | ack-egress (sidecar)
| |
| 3. ACK <last_seq> |
|<--------------------------------------|
| |
| 4. resend from last_ack+1 if needed |
| (loop until ACK == sent count) |
The ack-retry scenario
A new scenario script, scenario-ack-retry.sh, demonstrates the loop. It starts from a clean broker pod, queries the baseline ACK and then sends 20 000 messages in chunks of 5 000. After each chunk the ACK is checked. As seen in earlier backpressure tests, the first chunk may not deliver all 5 000 messages. The ACK reveals the shortfall immediately.
Excerpt from a typical run (with a log that already contained older data; the script uses a delta from the starting sequence):
ts=... step=send action=send_chunk seq=95771 count=5000 missing=20000
ts=... step=ack_check ack_seq=98270 delivered=2500 target=20000
ts=... step=retry action=continue
ts=... step=send action=send_chunk seq=98271 count=5000 missing=17500
ts=... step=ack_check ack_seq=102555 delivered=6785 target=20000
...
ts=... step=complete action=all_delivered
ts=... step=result outcome=PASS delivered=21100 total_sent=20000
The final gap check reports a contiguous sequence with zero gaps, confirming that every message eventually reached the append log.
Relationship to the transactional outbox pattern
The transactional outbox pattern solves the problem of atomically updating a database and publishing a message. A message relay reads an outbox table and sends the messages to a broker. KMQ does not implement that pattern because the broker’s write-ahead log is the database. There is no separate store to synchronise. The ACK endpoint fulfills a role similar to the outbox relay. It reads the durable store and provides a status signal. The difference is that it is pull-based rather than push-based. The producer polls the ACK, decides whether to retry and retains control over delivery semantics. This design keeps the broker minimal and leaves policy decisions to the client.
Surgical change, no disruption
Adding the ACK sidecar required only:
- A new container definition in the broker Deployment manifest.
- An additional port in the Service and a NetworkPolicy rule.
- Rebuilding the broker image with the
ack-egress.awk file included (the same image is used for all broker containers).
No hot-path pipelines were modified. The existing scenarios (resume, routing, backpressure measurement, dead-letter replay) continue to pass unchanged. This shows the operational lightness of the architecture. New capabilities can be grafted onto the pod without disrupting the running system and without complex rollout procedures. Future additions (e.g. TLS on the ACK port or a push-based ack via a response FIFO) can follow the same pattern.
What comes next
With the ACK endpoint operational, the broker has an inspectable reliability contract for the laboratory niche it serves. Next steps under consideration include:
- Mutual TLS on the internal-ingress and ack ports, using a self-signed certificate generated by a one-time init container, to ensure only authorised producers can connect.
- A minimal producer library (AWK or shell) that wraps the retry loop, exposing a simple “send-and-confirm” interface.
- Log rotation for
append.log, driven by the CRD’s retention policy, to bound disk usage over long-running experiments.
KMQ remains a single-pod, single-node broker built from Unix primitives. Its reliability guarantees are now explicit, measurable and under the control of the operator.