The Bugs That Almost Broke Us

Building a PaaS in 14 days means you ship fast and break things at a rate that would make a QA engineer weep. We broke a lot of things. Some bugs were trivial -- a missing import, an off-by-one pagination offset. But a handful of bugs nearly derailed the entire project. They were the kind of failures that produce no error message, or worse, produce an error message that lies. They ate hours when we had none to spare.

This is a collection of war stories from the sh0 bug tracker: the symptoms we saw, the rabbit holes we went down, and the fixes that saved us. If you are building infrastructure software in Rust, or wrangling Docker and Caddy in production, some of these will feel painfully familiar.

BUG-009: Git Pull, Stale Objects, and the Fresh Clone Fallback

Symptom: Redeploying an app that had been deployed before would fail silently during the git pull phase. The deployment status would transition to "cloning" and then hang, eventually timing out. First deploys always worked fine.

Investigation: The deploy pipeline cached git repositories in a local directory keyed by app name. On the first deploy, it ran git clone --depth 1. On subsequent deploys, it ran git fetch followed by git reset --hard origin/{branch}. The failure happened in the fetch step -- libgit2 returned an opaque "object not found" error, which our error handling dutifully logged and then... did nothing useful with.

The root cause was shallow clone history. When a repository is cloned with --depth 1, the local repo contains exactly one commit. If the remote force-pushes, rebases, or the shallow history diverges in any way, git fetch cannot reconcile the object graph. libgit2 does not handle this gracefully -- it throws an error about missing objects rather than telling you "your shallow clone is stale."

Fix: We added a fallback mechanism in sh0-git/src/repo.rs. If a pull or fetch fails with an object error, the pipeline deletes the entire cached repository directory and performs a fresh clone:

rustpub async fn clone_or_pull(url: &str, path: &Path, branch: &str, creds: Option<&GitCredentials>) -> Result<String> {
    if path.join(".git").exists() {
        match pull(path, branch, creds).await {
            Ok(hash) => return Ok(hash),
            Err(e) => {
                tracing::warn!("Pull failed ({e}), removing stale repo for fresh clone");
                tokio::fs::remove_dir_all(path).await?;
            }
        }
    }
    clone(url, path, branch, creds).await
}

The retry adds a few seconds to the deploy, but it eliminates an entire class of "works the first time, breaks on redeploy" failures. Stale shallow clones are now a non-issue.

BUG-010: Caddy Permission Denied on macOS

Symptom: Running sh0 serve on macOS crashed immediately with "permission denied" when Caddy tried to write its data files. The same binary worked fine on Linux.

Investigation: Caddy stores its TLS certificates, OCSP staples, and configuration state in a data directory. Our default was /var/lib/caddy, which follows the Filesystem Hierarchy Standard on Linux. On macOS, /var/lib/ requires root permissions, and running a development PaaS as root is not something we wanted to encourage.

Root cause: We had hardcoded a Linux-specific path as the default for ProxyConfig::caddy_data.

Fix: Changed the default to ./sh0-data/caddy -- a relative path inside the sh0 working directory. This works everywhere without elevated permissions:

rustimpl Default for ProxyConfig {
    fn default() -> Self {
        Self {
            caddy_bin: "caddy".into(),
            caddy_data: PathBuf::from("./sh0-data/caddy"),
            caddy_admin_url: "http://localhost:2019".into(),
            acme_email: None,
        }
    }
}

Later, we implemented a full platform-aware data path system (FHS on Linux root, XDG on Linux user, macOS Application Support, or --data-dir override), but this quick fix unblocked macOS development for three critical days.

BUG-012: CSRF Middleware vs. Body-Less POSTs

Symptom: After enabling CSRF protection, several dashboard actions stopped working -- rollback, restart, and stop all returned 415 Unsupported Media Type. The browser console showed no request body being sent, which was correct: these are POST endpoints that take no input.

Investigation: Our CSRF middleware enforced Content-Type: application/json on all POST, PATCH, and PUT requests. This is a standard defense against cross-site form submissions: browsers cannot send application/json from a plain HTML form, so requiring it blocks CSRF attacks without tokens.

The problem: when a POST request has no body, the browser does not set a Content-Type header. There is nothing to describe the type of. The CSRF middleware saw a missing Content-Type, decided it was not application/json, and rejected the request.

Fix: Skip the Content-Type check when the request body is empty:

rustasync fn csrf_middleware(req: Request, next: Next) -> Response {
    if matches!(*req.method(), Method::POST | Method::PATCH | Method::PUT) {
        let content_length = req.headers()
            .get(header::CONTENT_LENGTH)
            .and_then(|v| v.to_str().ok())
            .and_then(|v| v.parse::<u64>().ok())
            .unwrap_or(0);

        if content_length > 0 {
            // Enforce Content-Type: application/json for requests with a body
            let content_type = req.headers().get(header::CONTENT_TYPE);
            if !content_type.map_or(false, |ct| ct.as_bytes().starts_with(b"application/json")) {
                return StatusCode::UNSUPPORTED_MEDIA_TYPE.into_response();
            }
        }
    }
    next.run(req).await
}

A request with no body cannot carry a CSRF payload, so skipping the check is safe. Simple logic, but it took us an embarrassingly long debugging session to reach that conclusion because the 415 error made us look at the wrong layer of the stack.

BUG-014: App Name Uniqueness Was Global, Not Per-Project

Symptom: Creating an app named "api" in Project B failed with a uniqueness constraint error, because Project A already had an app named "api". Users expected app names to be unique within their project, not across the entire platform.

Investigation: The apps table had a simple UNIQUE constraint on the name column. This was correct during the early phases when sh0 had no project concept, but Phase 19 introduced multi-project support without updating the constraint.

Fix: Migration 016 recreated the apps table with a composite unique index:

sqlCREATE UNIQUE INDEX idx_apps_name_project ON apps(name, project_id);

But the fix did not stop there. Container names also needed to be project-scoped. Two apps named "api" in different projects would produce the same Docker container name sh0-api, causing collisions. We introduced a container_prefix() helper:

rustfn container_prefix(app_name: &str, project_id: Option<&str>) -> String {
    match project_id {
        Some(pid) => format!("sh0-{}-{}", &pid[..8], app_name),
        None => format!("sh0-{}", app_name),
    }
}

The first 8 characters of the project UUID serve as a namespace prefix. This change propagated through all six pipeline variants (git, Docker image, Dockerfile, upload, template, compose), the scaling system, and the stop/start/restart handlers. Every place that constructed a container name needed updating -- a reminder that naming is one of the two hard problems in computer science.

BUG-017: ZIP Upload Blocked by CSRF and Body Limits

Symptom: Uploading a ZIP file through the dashboard produced a silent failure. The upload appeared to start, then the API returned an error. No useful error message in the browser. The server logs showed a 415 status code.

Investigation: This bug had two root causes stacked on top of each other, which is why it was so difficult to diagnose.

Root cause 1: The CSRF middleware required Content-Type: application/json for all POST requests. ZIP uploads use Content-Type: multipart/form-data. Every upload was rejected before it even reached the handler.

Root cause 2: Even after exempting the upload route from CSRF, files larger than 10 MB were rejected. The global body size limit was 10 MB, applied as an Axum middleware layer. The upload handler declared a 100 MB limit in its own extractor, but the global limit hit first.

Fix: Two changes:

rust// 1. Exempt the upload route from CSRF (auth via Bearer token is sufficient)
let csrf_exempt = vec!["/api/v1/webhooks", "/api/v1/apps/upload"];

// 2. Per-route body limit override
Router::new()
    .route("/api/v1/apps/upload", post(upload_app))
    .layer(DefaultBodyLimit::max(500 * 1024 * 1024)) // 500 MB

This led to a broader audit of all rate limits and size caps. sh0 is a self-hosted platform -- users own their server. Aggressive throttling made the platform feel broken during normal usage. We relaxed API reads from 300/min to 1000/min, writes from 120/min to 500/min, and the global body limit from 10 MB to 50 MB. A self-hosted PaaS should not fight its own administrator.

The FTP IPv6/EPSV Nightmare

Symptom: FTP and FTPS connections to Hetzner Storage Box failed from the sh0 backup system. The exact same server worked perfectly when connected via Transmit, a macOS FTP client.

Investigation: This was a three-layer problem that took an entire session to untangle.

Layer 1: The Hetzner Storage Box DNS resolved to an IPv6 address (2a01:...). Our FTP library (suppaftp, via OpenDAL) used PASV mode, which is an IPv4-only command. When connected over IPv6, PASV returns an IPv4 address that the server cannot listen on. The server responded with: 421 Could not listen for passive connection: invalid passive IP "[2a01". That truncated address in the error message was the clue -- it was trying to parse an IPv6 address as an IPv4 passive-mode response.

Layer 2: The fix for IPv6 FTP is EPSV (Extended Passive Mode), which is protocol-agnostic. But OpenDAL's FTP backend did not expose set_mode(Mode::ExtendedPassive). There was no configuration option, no escape hatch.

Layer 3: We considered resolving the hostname to IPv4 manually and connecting directly. But for FTPS (FTP over TLS), the TLS SNI hostname must match the certificate. If we connected to an IPv4 address but sent the hostname for SNI, it might work -- or might not, depending on the server's certificate configuration.

Fix: We bypassed OpenDAL entirely for FTP/FTPS and wrote a direct suppaftp client:

rustpub struct FtpClient {
    host: String,
    port: u16,
    username: String,
    password: String,
    use_tls: bool,
}

impl FtpClient {
    pub async fn connect(&self) -> Result<AsyncNativeTlsFtpStream> {
        let mut ftp = AsyncNativeTlsFtpStream::connect(
            format!("{}:{}", self.host, self.port)
        ).await?;

        if self.use_tls {
            ftp = ftp.into_secure(
                AsyncNativeTlsConnector::from(tls_connector),
                &self.host  // Correct SNI hostname
            ).await?;
        }

        ftp.login(&self.username, &self.password).await?;
        ftp.set_mode(Mode::ExtendedPassive);  // EPSV -- works with IPv4 and IPv6
        Ok(ftp)
    }
}

The StorageBackend was refactored with an Engine enum -- OpenDAL for S3/R2/SFTP/cloud providers, FtpClient for FTP/FTPS. We also fixed a UI bug where the default port field showed 22 (SFTP) even when the user selected FTP as the provider type.

Docker Network Aliases: The Silent Failure

Symptom: Template deployments with multiple services (like WordPress + MySQL) would start both containers, but the application container could not reach the database container by hostname. WordPress showed "Error establishing a database connection."

Investigation: Docker containers on the same network can reach each other by container name, but only if the network aliases are correctly configured. When we created containers with docker create, we attached them to the sh0-net network. But we were not setting the network alias to match the service name from the template.

The template defined depends_on: [mysql] and the WordPress environment referenced mysql:3306 as the database host. The container was named sh0-wordpress-mysql, but the network alias was also sh0-wordpress-mysql -- not mysql. The hostname resolution failed because nothing on the network was listening as just "mysql."

Fix: When connecting a container to the network, set the alias to the service name from the template:

rustdocker.network_connect(
    "sh0-net",
    &container_id,
    Some(vec![service.name.clone()])  // Alias matches template service name
).await?;

This way, a container created from a template service named "mysql" is reachable as "mysql" on the Docker network, regardless of what the actual container name is. A two-line fix for a bug that made every multi-service template deployment fail.

The Lesson: Infrastructure Bugs Are Different

Application bugs crash your program. Infrastructure bugs break the programs you are supposed to be running for other people. The failure mode is almost always silence -- the container starts but cannot reach the database, the deploy succeeds but the route does not work, the upload goes through but the file is rejected by a middleware layer the user cannot see.

Every one of these bugs reinforced the same lesson: log everything, trust nothing, and always have a fallback. The fresh clone fallback for BUG-009. The platform-aware data path for BUG-010. The CSRF exemption for uploads. The direct FTP client bypassing the abstraction layer. The explicit network alias instead of relying on Docker's default behavior.

When you build a PaaS, you are building a platform that runs other people's software. Your bugs become their bugs, except they cannot see the source code. That is why we tracked every failure, wrote it up, and fixed it the same day.

This is Part 31 of the "How We Built sh0.dev" series. Next up: how we used utoipa to auto-generate an OpenAPI 3.1 spec from Rust handler annotations, then used that spec to power API docs, an interactive playground, and MCP tool definitions -- three outputs from a single source of truth.

The Bugs That Almost Broke Us

BUG-009: Git Pull, Stale Objects, and the Fresh Clone Fallback

BUG-010: Caddy Permission Denied on macOS

BUG-012: CSRF Middleware vs. Body-Less POSTs

BUG-014: App Name Uniqueness Was Global, Not Per-Project

BUG-017: ZIP Upload Blocked by CSRF and Body Limits

The FTP IPv6/EPSV Nightmare

Docker Network Aliases: The Silent Failure

The Lesson: Infrastructure Bugs Are Different

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live