Multi-Server BYOS: SSH Tunnels, Image Transfer, and Trust On First Use

For the first 28 phases of sh0's development, every application ran on a single server. One Docker daemon. One set of resources. One point of failure. That was fine for individual developers and small teams. It was not fine for anyone running production workloads across multiple regions, or anyone who wanted to deploy to servers they already owned.

Multi-server support -- what we called BYOS (Bring Your Own Server) -- was the feature that transformed sh0 from a single-machine PaaS into a multi-node deployment platform. Users could register their own servers, and sh0 would deploy applications to any of them. The technical challenge was significant: we needed to control Docker daemons on remote machines without installing any agent software, transfer container images between servers, verify host keys on first connection, and make every existing API handler node-aware.

The Architecture

The design constraint was clear: zero agent installation. Users should be able to point sh0 at any server with Docker installed and SSH access. They should not need to install a daemon, open a port, or configure a VPN. The solution was SSH tunnels.

sh0 would SSH into the remote server and forward the Docker socket (/var/run/docker.sock) through the tunnel. From that point on, every Docker API call -- creating containers, pulling images, reading logs, collecting stats -- would travel through the SSH tunnel to the remote Docker daemon. The remote server did not need to expose Docker's TCP port to the internet.

sh0 binary  --SSH tunnel-->  Remote Server
  |                              |
  v                              v
Local Docker API             Remote Docker API
(Unix socket)                (/var/run/docker.sock via tunnel)

SSH Tunnels with russh

We used russh, a pure-Rust SSH implementation, to establish tunnels. Unlike shelling out to the ssh binary, russh gave us programmatic control over connection parameters, key verification, and tunnel lifecycle:

rustpub struct SshTunnel {
    session: Handle<SshHandler>,
    local_port: u16,
}

impl SshTunnel {
    pub async fn connect(
        host: &str,
        port: u16,
        username: &str,
        private_key: &str,
        expected_fingerprint: Option<&str>,
    ) -> Result<(Self, String)> {
        let key_pair = decode_secret_key(private_key, None)?;

        let handler = SshHandler {
            expected_fingerprint: expected_fingerprint.map(String::from),
            observed_fingerprint: Arc::new(Mutex::new(None)),
            hostname: host.to_string(),
        };

        let config = Arc::new(russh::client::Config::default());
        let mut session = connect(config, (host, port), handler).await?;
        session.authenticate_publickey(username, Arc::new(key_pair)).await?;

        // Forward local TCP port to remote Docker socket
        let local_port = find_available_port()?;
        session.channel_open_direct_tcpip(
            "/var/run/docker.sock", 0,
            "127.0.0.1", local_port as u32,
        ).await?;

        let fingerprint = /* extract observed fingerprint */;

        Ok((SshTunnel { session, local_port }, fingerprint))
    }
}

Once the tunnel was established, the Docker client connected to 127.0.0.1:{local_port} instead of the Unix socket. Every Docker API request -- POST /containers/create, GET /containers/{id}/stats, DELETE /containers/{id} -- traveled through the SSH tunnel transparently.

Trust On First Use (TOFU)

SSH host key verification is one of those security features that most people click "yes" on without reading. But for a deployment platform that manages remote servers, blindly accepting host keys would be a serious vulnerability. A man-in-the-middle attack could redirect Docker API calls to a malicious server, intercepting container images, environment variables, and secrets.

We implemented Trust On First Use (TOFU), the same model that SSH itself uses:

rustpub struct SshHandler {
    expected_fingerprint: Option<String>,
    observed_fingerprint: Arc<Mutex<Option<String>>>,
    hostname: String,
}

impl russh::client::Handler for SshHandler {
    async fn check_server_key(
        &mut self,
        server_public_key: &PublicKey,
    ) -> Result<bool, Self::Error> {
        let fingerprint = server_public_key.fingerprint();
        *self.observed_fingerprint.lock().unwrap() = Some(fingerprint.clone());

        match &self.expected_fingerprint {
            None => {
                // First connection: accept and store
                tracing::info!(
                    host = %self.hostname,
                    fingerprint = %fingerprint,
                    "TOFU: accepting host key on first connection"
                );
                Ok(true)
            }
            Some(expected) if expected == &fingerprint => {
                // Known host, matching key
                Ok(true)
            }
            Some(expected) => {
                // DANGER: key mismatch
                tracing::error!(
                    host = %self.hostname,
                    expected = %expected,
                    observed = %fingerprint,
                    "Host key mismatch -- possible MITM attack"
                );
                Ok(false) // Abort connection
            }
        }
    }
}

On first connection, the handler accepted any host key and the fingerprint was stored in the database alongside the node record. On subsequent connections, the handler compared the server's key against the stored fingerprint. A mismatch aborted the connection immediately and logged a security warning.

The database migration added a host_key_fingerprint column to the nodes table:

sqlALTER TABLE nodes ADD COLUMN host_key_fingerprint TEXT;

The fingerprint was stored on every successful connection, whether it was the initial registration, a health check reconnection, or a deployment. This meant the stored fingerprint was always the most recently verified one.

The Node Registry

Managing multiple Docker clients required a registry that mapped node IDs to their corresponding clients. The NodeRegistry used DashMap for concurrent access from multiple API handlers:

rustpub struct NodeRegistry {
    local: Arc<DockerClient>,
    remotes: DashMap<String, Arc<DockerClient>>,
}

impl NodeRegistry {
    pub fn get(&self, node_id: Option<&str>) -> Arc<DockerClient> {
        match node_id {
            None => self.local.clone(),
            Some(id) => self.remotes
                .get(id)
                .map(|r| r.value().clone())
                .unwrap_or_else(|| self.local.clone()),
        }
    }

    pub async fn register(
        &self,
        node_id: &str,
        host: &str,
        port: u16,
        username: &str,
        private_key: &str,
        expected_fingerprint: Option<&str>,
    ) -> Result<String> {
        let (tunnel, fingerprint) = SshTunnel::connect(
            host, port, username, private_key, expected_fingerprint
        ).await?;

        let client = DockerClient::with_tcp(
            &format!("127.0.0.1:{}", tunnel.local_port())
        )?;

        self.remotes.insert(node_id.to_string(), Arc::new(client));
        Ok(fingerprint)
    }
}

The DockerClient was refactored from a Unix-socket-only implementation to an enum dispatch that supported both Unix sockets and TCP connections:

rustenum DockerInner {
    Unix(UnixStream),
    Tcp(TcpStream),
}

impl DockerClient {
    pub fn new() -> Result<Self> { /* Unix socket */ }
    pub fn with_tcp(addr: &str) -> Result<Self> { /* TCP connection */ }

    async fn send(&self, request: &str) -> Result<Vec<u8>> {
        match &self.inner {
            DockerInner::Unix(stream) => /* send via Unix socket */,
            DockerInner::Tcp(stream) => /* send via TCP stream */,
        }
    }
}

This was a clean refactor: the 40+ existing Docker API methods (create container, start, stop, exec, stats, logs, etc.) continued to call self.send() without any changes. Only the transport layer was polymorphic.

Image Transfer Between Nodes

When an application was built on the local server and deployed to a remote node, the container image existed only locally. Docker's pull command could not help -- the image was custom-built, not available on any registry. We needed to transfer it.

The solution was disk-based image transfer using Docker's save and load APIs:

rustpub async fn transfer_image(
    source: &DockerClient,
    target: &DockerClient,
    image: &str,
) -> Result<()> {
    // Save image from source Docker daemon to a tar archive
    let tar_data = source.save_image(image).await?;

    // Load the tar archive into the target Docker daemon
    target.load_image(&tar_data).await?;

    Ok(())
}

The save_image method called GET /images/{name}/get on the source Docker daemon, which returned a tar stream containing all image layers. The load_image method called POST /images/load on the target Docker daemon with that tar stream. The entire image -- layers, metadata, tags -- was transferred in a single operation.

The deploy pipeline was modified to use this transfer mechanism:

rustasync fn maybe_transfer_image(ctx: &DeployContext, image: &str) -> Result<()> {
    if ctx.node_id.is_some() {
        // Build happened on local Docker, deploy target is remote
        transfer_image(
            &ctx.local_docker,  // Source: local Docker daemon
            &ctx.docker,        // Target: remote Docker daemon (via SSH tunnel)
            image,
        ).await?;
    }
    Ok(())
}

Three build pipelines -- git push builds, Dockerfile builds, and file upload builds -- were updated to build with ctx.local_docker and then call maybe_transfer_image() before starting the container on the remote node.

Making Every Handler Node-Aware

The most labor-intensive part of multi-server support was not building the tunnel or the transfer mechanism. It was updating every existing API handler to use the correct Docker client for the app's assigned node. A helper function encapsulated the lookup:

rustpub async fn docker_for_app(
    db: &DbPool,
    nodes: &NodeRegistry,
    app_id: &str,
) -> Result<Arc<DockerClient>> {
    let app = App::find_by_id(db, app_id).await?;
    Ok(nodes.get(app.node_id.as_deref()))
}

This function was called in 14 handlers: app stop/start/restart/delete, terminal WebSocket, log streaming, file/volume operations, service inspect/restart/stop/start, container stats, domain container IP inspection, volume operations, and deployment context construction.

Each handler that previously used state.docker.clone() was changed to use docker_for_app(&state.db, &state.nodes, &app_id).await?. The change was mechanical but critical -- a single handler using the local Docker client for a remote app would silently fail, producing "container not found" errors that would be mystifying to debug.

The Node API

Nodes were managed through a CRUD API gated to Business plan users:

GET    /api/v1/nodes          -- List all registered nodes
POST   /api/v1/nodes          -- Register a new node
GET    /api/v1/nodes/:id      -- Get node details
PATCH  /api/v1/nodes/:id      -- Update node configuration
DELETE /api/v1/nodes/:id      -- Remove a node
POST   /api/v1/nodes/:id/test -- Test SSH connection

The create endpoint accepted the hostname, SSH port, username, and private key. It established an SSH tunnel, verified the Docker daemon was reachable, stored the host key fingerprint, and registered the node in both the database and the in-memory registry.

A background health monitor task ran every 30 seconds, checking each registered node's tunnel status and reconnecting if necessary. Node status was tracked as online, pending, error, or offline, with the last heartbeat timestamp stored in the database.

The Dashboard: Node Management

The Settings page gained a "Nodes" section with a table showing each node's name, hostname, status badge (green for online, yellow for pending, red for error, gray for offline), Docker version, and last seen timestamp.

Adding a node opened a modal with fields for name, hostname, port, username, and SSH key. The "Test Connection" button established a tunnel, verified Docker access, and displayed the result without saving the node -- letting users verify their SSH credentials before committing.

The host key fingerprint was displayed read-only in the edit modal, serving as a visual confirmation that the node identity had not changed.

Most importantly, the NodeSelector component was integrated into all seven deploy forms (Git, Dockerfile, Docker Image, Framework, Upload, Service, Compose). When remote nodes existed, a dropdown appeared letting users choose where to deploy. When no remote nodes were registered, the selector was hidden entirely -- the single-server experience remained unchanged.

The Numbers

Multi-server BYOS touched 29 files in the first session and 27 more in the completion session. It added a database migration, a Node model with full CRUD, an SSH tunnel module, a node registry, an image transfer module, a node API with plan gating, a docker_for_app helper used in 14 handlers, a background health monitor, TOFU host key verification, deploy pipeline integration with image transfer, a dashboard node management page, and a node selector in all deploy forms -- with i18n in five languages.

All 53 API tests passed. Zero existing functionality was broken. The single-server experience was completely unaffected unless the user explicitly added a remote node.

That was the goal: additive complexity. Multi-server was a capability that existed when you needed it and was invisible when you did not.

Next in the series: Cron Jobs and Preview Environments: Two Features, Zero Downtime -- how we built cron scheduling with timeout enforcement and PR-based preview environments, developed in parallel using git worktree isolation.

Multi-Server BYOS: SSH Tunnels, Image Transfer, and Trust On First Use

The Architecture

SSH Tunnels with russh

Trust On First Use (TOFU)

The Node Registry

Image Transfer Between Nodes

Making Every Handler Node-Aware

The Node API

The Dashboard: Node Management

The Numbers

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live