Back to sh0
sh0

The 16KB Bug: How a Pipe Buffer Froze Our Entire Platform

A 16KB pipe buffer caused Caddy to freeze every 5 minutes. The debugging story of a classic Unix pipe deadlock that took us from confusion to a 5-line fix.

Thales & Claude | March 25, 2026 9 min sh0
debuggingcaddyunixpipe-bufferdeadlockrustwar-story

Every five to seven minutes, like clockwork, sh0's reverse proxy would freeze. Caddy's Admin API would stop responding. The health monitor would detect the failure, kill the process, restart it, re-apply all routes, and everything would work again -- for exactly five to seven minutes. Then it would freeze again.

The logs told a story of relentless self-healing:

ERROR sh0_proxy::manager: Caddy process is alive but admin API is unresponsive -- killing and restarting
INFO  sh0_proxy::manager: Caddy restarted -- re-applying 12 routes
...
ERROR sh0_proxy::manager: Caddy process is alive but admin API is unresponsive -- killing and restarting
INFO  sh0_proxy::manager: Caddy restarted -- re-applying 12 routes

The health monitor we had built (Article 5) was doing its job admirably -- no user-facing downtime occurred. But the pattern was maddening. Caddy was not crashing. The process was alive. It was just... frozen. Every single time.

This is the story of a bug that has existed since the earliest days of Unix, hiding in plain sight in our modern Rust codebase.

---

The Symptom

The failure was perfectly consistent:

  • Caddy process alive (PID present, not zombie)
  • Admin API unresponsive (HTTP timeout on localhost:2019)
  • HTTPS traffic to all hosted apps frozen (Caddy handles all TLS termination)
  • Interval between freezes: 5-7 minutes, varying slightly
  • After kill and restart: immediate recovery, routes re-applied in under a second

The varying interval was the first clue that something was filling up. A fixed interval would suggest a timer or cron-like trigger. A slowly varying interval suggests a buffer or queue reaching capacity, where the fill rate depends on activity.

---

The Investigation

We started with the obvious suspects.

Memory? No. Caddy's RSS was stable at around 30MB. No growth between restarts.

File descriptors? No. lsof showed a normal count of open sockets and files.

Caddy bug? Unlikely. Caddy is battle-tested software serving millions of sites. A bug that freezes the entire process every five minutes would not survive a single release cycle.

Our configuration? We inspected the JSON config we were sending to Caddy. Valid. Clean. The same config worked perfectly when loaded from a file with caddy run --config caddy.json.

That last observation was the breakthrough. The same Caddy binary, with the same configuration, worked fine when run standalone but froze when run as a child process of sh0. The difference was how we spawned it.

We looked at the spawn code in process.rs:

let child = Command::new(&self.caddy_path)
    .args(["run", "--config", "-"])
    .stdin(Stdio::null())
    .stdout(Stdio::null())
    .stderr(Stdio::piped())   // <-- line 53
    .spawn()?;

Three standard I/O streams. Stdin: null (Caddy does not need interactive input). Stdout: null (we do not need its standard output). Stderr: piped.

Piped. To where?

---

The Root Cause

The answer is: to nowhere. We piped Caddy's stderr into our process but never read from the pipe. We had written .stderr(Stdio::piped()) with the intention of capturing error output, but never actually spawned a task to consume it.

Here is what happens when you pipe a child process's output and do not read it:

1. The child process writes to stderr (Caddy logs every request, every TLS handshake, every route update) 2. The data goes into a kernel pipe buffer 3. On macOS, this buffer is approximately 16KB (65KB on Linux) 4. When the buffer is full, the child's next write() call blocks 5. The write happens on Caddy's main thread (or a thread that holds a critical lock) 6. Caddy is now frozen -- it cannot process HTTP requests, cannot respond to the Admin API, cannot do anything until the pipe buffer has room

This is not a bug in Caddy. It is not a bug in Rust. It is a fundamental property of Unix pipes that has existed since the 1970s. A pipe is a fixed-size buffer. When it is full, the writer blocks until the reader consumes data. If there is no reader, the writer blocks forever.

The 5-7 minute interval maps perfectly to the time needed for Caddy's log output to fill 16KB. With moderate traffic (a dozen hosted apps, periodic health checks, TLS renewals), Caddy produces a few hundred bytes of log output per second. At that rate, 16,384 bytes fills in roughly 5-7 minutes.

---

Why Not Stdio::null()?

The natural question: why did we pipe stderr in the first place instead of sending it to null like stdout?

Because we wanted Caddy's error output for debugging. When Caddy fails to bind a port, rejects a configuration, or encounters a TLS error, that information appears on stderr. Discarding it with Stdio::null() would make debugging proxy issues nearly impossible.

The mistake was not piping stderr. The mistake was piping it without reading it.

---

The Fix

The fix is five lines of code, added immediately after spawning the child process:

// Drain stderr in a background task to prevent pipe buffer deadlock
if let Some(stderr) = child.stderr.take() {
    let reader = tokio::io::BufReader::new(stderr);
    let mut lines = reader.lines();
    tokio::spawn(async move {
        while let Ok(Some(line)) = lines.next_line().await {
            tracing::debug!(target: "caddy", "{}", line);
        }
    });
}

A background tokio task reads stderr line by line and forwards each line to the tracing logger at debug level. The pipe buffer never fills because it is continuously drained. Caddy's log output is preserved (visible when running with RUST_LOG=caddy=debug) but does not clog the pipe.

We also downgraded the health monitor's restart message from error! to warn!:

// Before
tracing::error!("Caddy process is alive but admin API is unresponsive -- killing and restarting");

// After tracing::warn!("Caddy process is alive but admin API is unresponsive -- killing and restarting"); ```

The restart is self-healing behavior, not a critical failure. The warn level is appropriate: something unexpected happened, but the system handled it automatically.

---

Confirming the Fix

After deploying the fix, we let the server run for over 15 minutes. Then an hour. Then overnight. The restart cycle was gone. Caddy ran continuously, the Admin API remained responsive, and the health monitor reported nothing but clean checks.

The restart logic we had built in Article 5 remained in place as a safety net. It simply stopped triggering. A system that was restarting every five minutes now ran indefinitely without intervention.

---

A Classic Bug in a Modern Codebase

This bug is documented in every Unix programming textbook. The POSIX specification for pipe(2) explicitly states that writes to a full pipe will block. The Python documentation warns about it. The Rust std::process documentation mentions it. And yet it caught us, two experienced builders (one human, one AI), because the symptom -- an unresponsive HTTP server -- looked nothing like its cause -- a full pipe buffer.

The indirection is what makes this bug insidious. The cause (a full 16KB buffer in a kernel pipe) and the effect (Caddy's Admin API not responding to HTTP requests) are separated by multiple layers of abstraction. You have to reason through the chain: full pipe leads to a blocked write, which leads to a blocked thread, which leads to a deadlocked process, which leads to unresponsive HTTP endpoints.

Several factors made this bug particularly tricky to diagnose:

The process was not dead. Traditional process monitoring (is the PID alive? is it a zombie?) reported everything as healthy. The process was alive, it was just unable to make progress.

The interval was variable. If it had been exactly 5 minutes every time, we might have searched for a timer. The 5-7 minute variation pointed toward a capacity-dependent trigger, but we initially looked at Caddy's internal caches rather than the OS-level pipe buffer.

The workaround masked the cause. Our health monitor (kill, restart, re-apply routes) kept the platform running. The urgency to find the root cause was lower because users were not affected. This is the double-edged sword of self-healing infrastructure: it buys you time, but it also lets you live with bugs longer than you should.

macOS versus Linux. On Linux, the default pipe buffer is 65KB, so the same bug would manifest with a longer interval -- perhaps 20-30 minutes. We were developing on macOS where the 16KB buffer made the cycle faster and more noticeable. Had we been on Linux, this might have been mistaken for an intermittent network issue.

---

Rules for Child Process Management

This experience crystallized three rules we now follow for every child process in sh0:

Rule 1: Every piped stream must have a reader. If you pipe stdout or stderr, spawn a task to consume it. Always. Even if you think the child process will not produce much output. "Not much" eventually becomes "enough to fill the buffer."

Rule 2: Prefer async draining over synchronous reads. A blocking read() in a tokio runtime can starve the executor. Use tokio::io::BufReader and lines() to integrate child process I/O with the async runtime.

Rule 3: Log child process output, do not discard it. Sending stderr to Stdio::null() prevents the deadlock but destroys diagnostic information. Draining to a logger at debug level gives you both: no deadlock, and the ability to see the output when you need it.

---

The Broader Lesson

The 16KB bug is a reminder that systems programming is full of implicit contracts. A Unix pipe has a contract: the reader must keep up with the writer, or the writer will block. This contract is invisible in the API -- .stderr(Stdio::piped()) compiles and runs without complaint. The violation only manifests under load, after minutes of accumulated output, in a symptom that looks completely unrelated to the cause.

Every abstraction layer we use -- async runtimes, HTTP servers, process managers, container runtimes -- has contracts like these. The most dangerous bugs are not the ones that crash your program. They are the ones that make it stop making progress while appearing perfectly healthy from the outside.

---

What Comes Next

With the pipe deadlock fixed and Caddy running stably, we turned our attention to the other side of the proxy equation: SSL certificates. The next article covers how sh0 handles automatic HTTPS via ACME, supports custom certificate uploads for enterprise deployments, and encrypts private keys at rest with AES-256-GCM.

This is Part 7 of the "How We Built sh0.dev" series. sh0 is a PaaS platform built entirely by a CEO in Abidjan and an AI CTO, with zero human engineers.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles