Back to sh0
sh0

Autoscaling in Rust: CPU Thresholds, Cooldowns, and Load Balancing

How we built horizontal scaling with Caddy load balancing, replica container management, and an autoscaler that evaluates CPU/memory thresholds with configurable cooldowns.

Thales & Claude | March 25, 2026 9 min sh0
autoscalingload-balancingcaddyrustcontainersdevops

A PaaS that cannot scale is a PaaS with a ceiling. Below that ceiling, everything works. Above it, your users' applications start dropping requests, timing out, and losing customers. The ceiling is usually one container on one server, and when traffic spikes, the only option is to SSH in and manually start more instances.

We needed horizontal scaling: the ability to run multiple replicas of the same application, distribute traffic across them, and -- critically -- add or remove replicas automatically based on real-time CPU and memory metrics. Manual scaling for developers who know what they need. Autoscaling for everyone else.

This is how we built it: a replica manager, a Caddy-backed load balancer, and an autoscaler that evaluates two-minute rolling averages every 30 seconds with configurable cooldowns to prevent oscillation.

The Scaling Model

Horizontal scaling in sh0 operated at the container level. Each "replica" was an independent Docker container running the same image with the same environment variables, connected to the same network. The difference between one replica and five was four additional containers and a load balancer configuration that knew about all of them.

The data model was a ScalingConfig associated with each app:

pub struct ScalingConfig {
    pub app_id: String,
    pub min_replicas: i32,
    pub max_replicas: i32,
    pub current_replicas: i32,
    pub cpu_threshold: f64,     // Scale up when avg CPU exceeds this (e.g., 80.0)
    pub memory_threshold: f64,  // Scale up when avg memory exceeds this
    pub cooldown_seconds: i64,  // Minimum time between scaling events
    pub lb_policy: String,      // "round_robin", "least_conn", "ip_hash"
    pub autoscale_enabled: bool,
    pub last_scaled_at: Option<DateTime<Utc>>,
}

Manual scaling set current_replicas directly and disabled autoscaling. Autoscaling set the thresholds and let the background task manage the replica count within the configured bounds.

Manual Scaling: The CLI Path

The simplest scaling operation was explicit:

# Scale to 3 replicas
sh0 scale my-app 3

# Enable autoscaling sh0 scale my-app --auto

# Check current scaling status sh0 scale my-app --status ```

When a user ran sh0 scale my-app 3, the API handler performed the following sequence:

1. Validated that the requested replica count was between 1 and the maximum (default 10) 2. Determined the current replica count by listing containers with the app's label 3. If scaling up: created new containers from the same image, environment, and volume configuration 4. If scaling down: stopped and removed the excess containers, starting from the highest-numbered replica 5. Updated the Caddy load balancer configuration with the new set of upstream addresses 6. Updated the database record with the new replica count

Each replica container was named with a suffix: my-app-1, my-app-2, my-app-3. The naming convention was not just cosmetic -- it made it trivial to identify which replica was which in logs, metrics, and the Docker container list.

Load Balancing via Caddy

sh0 already used Caddy as its reverse proxy (see earlier articles in this series). For single-container apps, Caddy routed traffic to one upstream. For scaled apps, Caddy needed to distribute traffic across multiple upstreams.

When the replica count changed, the proxy manager rebuilt the Caddy route configuration with all active replica addresses:

{
  "handle": [{
    "handler": "reverse_proxy",
    "upstreams": [
      { "dial": "172.18.0.5:8080" },
      { "dial": "172.18.0.6:8080" },
      { "dial": "172.18.0.7:8080" }
    ],
    "load_balancing": {
      "selection_policy": {
        "policy": "round_robin"
      }
    }
  }]
}

Three load balancing policies were supported:

  • round_robin -- each request goes to the next upstream in sequence. The default, and appropriate for stateless applications.
  • least_conn -- each request goes to the upstream with the fewest active connections. Better for applications with variable request durations.
  • ip_hash -- requests from the same client IP always go to the same upstream. Necessary for applications with server-side sessions that are not shared across instances.

The policy was configurable per app through the dashboard's scaling tab or the CLI. Changing the policy triggered an immediate Caddy configuration reload without dropping active connections.

The Autoscaler: A Background Task

The autoscaler was a background task that ran on a configurable interval (default 30 seconds). Its design prioritized stability over responsiveness -- in autoscaling, the worst outcome is oscillation, where the system rapidly scales up and down, creating more instability than the original load spike.

pub struct AutoScalerContext {
    db: Arc<DbPool>,
    docker: Arc<DockerClient>,
    proxy: Arc<ProxyManager>,
    master_key: Option<Arc<MasterKey>>,
}

impl AutoScalerContext { pub async fn tick(&self) -> Result<()> { let configs = ScalingConfig::list_autoscale_enabled(&self.db).await?;

for config in configs { if let Err(e) = self.evaluate_app(&config).await { tracing::error!(app_id = %config.app_id, "Autoscale error: {e}"); } }

Ok(()) } } ```

The tick() function iterated all apps with autoscaling enabled. For each app, evaluate_app() performed a four-step decision:

Step 1: Cooldown Check

if let Some(last_scaled) = config.last_scaled_at {
    let elapsed = Utc::now() - last_scaled;
    if elapsed.num_seconds() < config.cooldown_seconds {
        return Ok(()); // Still in cooldown, skip
    }
}

If the app was scaled within the cooldown window (default 300 seconds / 5 minutes), the evaluator skipped it entirely. This prevented the "scale up, immediately scale down, immediately scale up again" oscillation that plagues naive autoscalers.

Step 2: Metric Aggregation

The evaluator fetched the last two minutes of CPU and memory metrics from the database and computed rolling averages:

let cpu_metrics = Metric::query_recent(
    &self.db, &config.app_id, "cpu", Duration::minutes(2)
).await?;

let memory_metrics = Metric::query_recent( &self.db, &config.app_id, "memory", Duration::minutes(2) ).await?;

// Require minimum data points to avoid reacting to noise if cpu_metrics.len() < 10 { return Ok(()); // Not enough data, skip }

let avg_cpu = cpu_metrics.iter().map(|m| m.value).sum::() / cpu_metrics.len() as f64; let avg_memory = memory_metrics.iter().map(|m| m.value).sum::() / memory_metrics.len() as f64; ```

The minimum of 10 data points (approximately 100 seconds at the default 10-second collection interval) ensured that the autoscaler did not react to a single CPU spike. Only sustained load triggered scaling.

Step 3: Scale Up Decision

if avg_cpu > config.cpu_threshold || avg_memory > config.memory_threshold {
    if config.current_replicas < config.max_replicas {
        let new_count = config.current_replicas + 1;
        self.scale_to(&config, new_count).await?;
        tracing::info!(
            app = %config.app_id,
            from = config.current_replicas,
            to = new_count,
            cpu = avg_cpu,
            "Autoscale UP"
        );
    }
}

Scale-up was conservative: one replica at a time. If the application needed three more replicas, it would take three evaluation cycles (plus cooldowns) to get there. This was deliberate. Adding all replicas at once could overwhelm the Docker daemon and the network, and it made it harder to determine the right final count.

Step 4: Scale Down Decision

else if avg_cpu < config.cpu_threshold * 0.5
    && avg_memory < config.memory_threshold * 0.5
{
    if config.current_replicas > config.min_replicas {
        let new_count = config.current_replicas - 1;
        self.scale_to(&config, new_count).await?;
        tracing::info!(
            app = %config.app_id,
            from = config.current_replicas,
            to = new_count,
            cpu = avg_cpu,
            "Autoscale DOWN"
        );
    }
}

Scale-down used a 50% hysteresis threshold. If the CPU threshold for scaling up was 80%, the system would only scale down when CPU dropped below 40%. This gap prevented the autoscaler from immediately undoing a scale-up when load decreased slightly. The application had to be genuinely underutilized before replicas were removed.

The Dashboard: Scaling Tab

The dashboard added a "Scaling" tab to each application's detail page. The tab contained two panels:

Manual scaling presented a slider from 1 to 10 replicas with the current count displayed prominently. Moving the slider and clicking "Apply" sent the scale request immediately. This was the escape hatch for developers who knew exactly what they needed -- deploy day for a product launch, for instance, where you want five replicas ready before the traffic arrives.

Autoscaling configuration presented toggle controls for enabling autoscaling, input fields for min/max replicas and CPU/memory thresholds, a dropdown for the load balancing policy, and a cooldown duration input. Changes to the autoscaling configuration took effect on the next evaluator tick without restarting the background task.

Both panels showed the current state: how many replicas were running, the current average CPU and memory, and the timestamp of the last scaling event. This feedback loop let users verify that their autoscaling configuration was behaving as expected.

Ownership Challenges

The autoscaler was one of the trickier Rust implementations in the codebase. The background task needed access to the database pool, Docker client, and proxy manager -- all of which were also owned by the API server's AppState. Sharing Arc-wrapped resources across an Axum server and a background Tokio task required careful lifetime management.

The AutoScalerContext pattern solved this without leaking the full AppState (which contained DashMap and other types that complicated trait bounds):

// In main.rs
let autoscaler_ctx = AutoScalerContext {
    db: db_pool.clone(),
    docker: docker_client.clone(),
    proxy: proxy_manager.clone(),
    master_key: master_key.clone(),
};

let autoscale_handle = tokio::spawn(async move { let mut interval = tokio::time::interval( Duration::from_secs(autoscale_interval) ); loop { interval.tick().await; if let Err(e) = autoscaler_ctx.tick().await { tracing::error!("Autoscaler tick failed: {e}"); } } }); ```

By cloning the Arcs before passing them to the context, the autoscaler owned independent handles to the shared resources. The API server and the autoscaler could operate concurrently without coordination. The database handled its own concurrency through WAL mode. The Docker client was inherently stateless (each API call was an independent HTTP request over the Unix socket). The proxy manager used interior mutability for its route table.

The Result

At the end of the session, 374 tests passed. The scaling system supported manual scaling from 1 to 10 replicas, autoscaling with CPU and memory thresholds, three load balancing policies, configurable cooldowns, and a dashboard UI with real-time status display.

The autoscaler was conservative by design. It scaled up one replica at a time, only when two minutes of sustained high load confirmed the need. It scaled down only when load dropped to half the threshold. It refused to act during cooldown periods. And it required a minimum number of metric data points before making any decision.

In autoscaling, the goal is not to react as fast as possible. It is to react correctly, and never to make things worse.

---

Next in the series: Multi-Server BYOS: SSH Tunnels, Image Transfer, and Trust On First Use -- how we enabled users to bring their own servers with SSH tunnels to remote Docker daemons, disk-based image transfer, and TOFU host key verification.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles