#182 -- Production Hardening Phase 2

Phase 1 ensured that FLIN would not crash. Phase 2 ensured that when errors occur -- and they will, always, inevitably -- the system's state remains consistent. This is the difference between stability and reliability. A stable system keeps running. A reliable system keeps running correctly.

Session 245 tackled the harder problem. Crashes are dramatic but straightforward to fix: wrap the code in error handling, return an error response, move on. State corruption is insidious. It happens silently. The system keeps running, but the data is wrong. An entity is half-saved. A foreign key points to a deleted record. The WAL has entries that were written but never committed. These bugs do not announce themselves. They surface hours or days later, when a user notices that their data is missing or inconsistent.

The State Consistency Problem

FLIN's data layer is built on ZeroCore, a custom embedded database engine that stores entities in a B-tree structure with write-ahead logging. When a FLIN program saves an entity, several things happen in sequence:

The entity is validated against its schema (types, constraints, decorators).
The entity is serialized to the storage format.
A WAL entry is written to disk.
The entity is inserted into the in-memory B-tree.
Foreign key references are updated.
Search indexes are updated (BM25, vector embeddings).
Cache entries for the entity type are invalidated.

If any step after step 3 fails, the WAL contains an entry for an operation that was not fully applied. On restart, the WAL replay would attempt to re-apply the operation, potentially encountering the same failure. This is the classic write-ahead log recovery problem, and getting it wrong means data loss or corruption.

Transaction Boundaries

The first reliability fix was introducing explicit transaction boundaries around multi-step operations. Before Phase 2, each step was independent -- a failure at step 5 left steps 1 through 4 committed and step 5 onward incomplete.

rustpub fn save_entity(
    &mut self,
    entity_type: &str,
    values: &ValueMap,
) -> Result<EntityId, RuntimeError> {
    // Begin transaction
    let txn = self.storage.begin_transaction()?;

    // Step 1: Validate
    let validated = self.validate_entity(entity_type, values)
        .map_err(|e| { txn.rollback(); e })?;

    // Step 2: Serialize
    let serialized = self.serialize_entity(&validated)
        .map_err(|e| { txn.rollback(); e })?;

    // Step 3: WAL write
    txn.write_wal_entry(WalEntry::Insert {
        entity_type: entity_type.to_string(),
        data: serialized.clone(),
    })?;

    // Step 4: B-tree insert
    let id = txn.insert_btree(entity_type, &serialized)
        .map_err(|e| { txn.rollback(); e })?;

    // Step 5: Foreign keys
    self.update_foreign_keys(&txn, entity_type, id, &validated)
        .map_err(|e| { txn.rollback(); e })?;

    // Step 6: Search indexes (non-critical -- failure logged, not fatal)
    if let Err(e) = self.update_search_indexes(entity_type, id, &validated) {
        log::warn!("Search index update failed for {}#{}: {}", entity_type, id, e);
        // Do NOT rollback -- search indexes can be rebuilt
    }

    // Step 7: Cache invalidation (always succeeds)
    self.invalidate_cache(entity_type);

    // Commit transaction
    txn.commit()?;

    Ok(id)
}

The critical insight is the distinction between critical and non-critical steps. Steps 1 through 5 (validation, serialization, WAL, B-tree, foreign keys) are critical -- if any of them fails, the entire operation must be rolled back. Steps 6 and 7 (search indexes, cache) are non-critical -- they can be rebuilt from the primary data. A search index update failure is logged as a warning, but the save operation succeeds.

This distinction prevents a situation where a transient search index error causes data loss. The entity is saved. The search index might be temporarily stale. That is a degraded state, not a corrupt state.

WAL Recovery

The write-ahead log is the foundation of durability. Every mutation to the database is first written to the WAL before being applied to the in-memory data structures. On startup, the WAL is replayed to recover any operations that were written but not yet checkpointed.

Before Phase 2, WAL recovery was optimistic -- it assumed every WAL entry could be replayed successfully. In practice, some WAL entries referenced entity types that had been modified (schema changes between crash and restart), and replay would fail.

rustpub fn recover_from_wal(&mut self) -> RecoveryReport {
    let mut report = RecoveryReport::new();
    let entries = self.wal.read_pending_entries();

    for entry in entries {
        match self.replay_wal_entry(&entry) {
            Ok(()) => {
                report.entries_recovered += 1;
            }
            Err(e) => {
                log::error!(
                    "WAL recovery failed for entry {}: {}",
                    entry.sequence_number, e
                );

                // Categorize the failure
                match e.kind() {
                    ErrorKind::SchemaChanged => {
                        // Entity schema changed -- attempt migration
                        match self.migrate_and_replay(&entry) {
                            Ok(()) => {
                                report.entries_migrated += 1;
                            }
                            Err(migrate_err) => {
                                report.entries_failed.push(FailedEntry {
                                    sequence: entry.sequence_number,
                                    error: migrate_err.to_string(),
                                    data: entry.to_json(),
                                });
                            }
                        }
                    }
                    ErrorKind::ConstraintViolation => {
                        // FK target deleted -- skip entry, log for manual review
                        report.entries_skipped += 1;
                    }
                    _ => {
                        report.entries_failed.push(FailedEntry {
                            sequence: entry.sequence_number,
                            error: e.to_string(),
                            data: entry.to_json(),
                        });
                    }
                }
            }
        }
    }

    // Write failed entries to recovery file for manual inspection
    if !report.entries_failed.is_empty() {
        self.write_recovery_file(&report.entries_failed);
        log::warn!(
            "WAL recovery: {} entries recovered, {} migrated, {} skipped, {} failed (see .flindb/recovery.json)",
            report.entries_recovered,
            report.entries_migrated,
            report.entries_skipped,
            report.entries_failed.len(),
        );
    }

    report
}

The recovery process now handles three failure categories:

Schema changes: If an entity type was modified between the crash and restart, the WAL entry references the old schema. The recovery process attempts to migrate the old data to the new schema using the same coercion rules as the migration system.

Constraint violations: If a WAL entry references a foreign key target that no longer exists (because the target was deleted in a later WAL entry), the entry is skipped. This is safe because the referencing entity is orphaned regardless.

Unknown failures: Any other failure is recorded in .flindb/recovery.json with the full WAL entry data, allowing manual inspection and recovery.

Checkpoint Safety

Checkpointing is the process of writing the in-memory B-tree state to disk and truncating the WAL. Before Phase 2, a crash during checkpointing could leave the database in an inconsistent state -- the WAL was partially truncated, but the B-tree file was not fully written.

We implemented atomic checkpointing using a write-rename strategy:

rustpub fn checkpoint(&mut self) -> Result<(), StorageError> {
    // Step 1: Write B-tree to a temporary file
    let temp_path = self.db_path.join(".flindb/btree.tmp");
    self.btree.write_to_file(&temp_path)?;

    // Step 2: Sync the temporary file to disk
    let file = File::open(&temp_path)?;
    file.sync_all()?;

    // Step 3: Atomically rename temp file to final path
    let final_path = self.db_path.join(".flindb/btree.db");
    std::fs::rename(&temp_path, &final_path)?;

    // Step 4: Sync the directory entry
    let dir = File::open(self.db_path.join(".flindb/"))?;
    dir.sync_all()?;

    // Step 5: Only NOW truncate the WAL
    self.wal.truncate()?;

    // Step 6: Save blob reference index
    if let Some(ref blob_index) = self.blob_ref_index {
        blob_index.save_to_disk()?;
    }

    Ok(())
}

The key property is that the WAL is only truncated after the B-tree file is fully written and synced. If the process crashes at any point during checkpointing:

Before step 3: The temporary file exists, but the old B-tree file is intact. On restart, the temp file is ignored, and the WAL replays against the old B-tree.
After step 3, before step 5: The new B-tree file is written, but the WAL still contains the entries. On restart, replaying the WAL against the new B-tree is idempotent -- the entries are already applied.
After step 5: Checkpoint is complete. No recovery needed.

This guarantees that no data is lost regardless of when a crash occurs.

Foreign Key Consistency

Foreign key relationships create dependencies between entities. Deleting a User that has associated Post records requires cascading the delete to the posts, or restricting the delete, or setting the foreign key to null. Before Phase 2, these cascading operations were not atomic -- a crash mid-cascade could leave orphaned posts with invalid user references.

rustfn cascade_delete(
    &mut self,
    txn: &Transaction,
    entity_type: &str,
    entity_id: EntityId,
) -> Result<Vec<EntityId>, RuntimeError> {
    let mut deleted = vec![entity_id];

    // Find all entity types with FK references to this entity
    let dependents = self.schema.find_dependents(entity_type);

    for dep in dependents {
        match dep.on_delete {
            OnDelete::Cascade => {
                // Find and delete all dependent entities
                let refs = txn.find_by_fk(
                    &dep.entity_type, &dep.field, entity_id
                )?;

                for ref_id in refs {
                    // Recursive cascade (within same transaction)
                    let sub_deleted = self.cascade_delete(
                        txn, &dep.entity_type, ref_id
                    )?;
                    deleted.extend(sub_deleted);
                }
            }
            OnDelete::Restrict => {
                let count = txn.count_by_fk(
                    &dep.entity_type, &dep.field, entity_id
                )?;
                if count > 0 {
                    return Err(RuntimeError::new(
                        "ForeignKeyConstraint",
                        &format!(
                            "Cannot delete {}#{}: {} {} records reference it",
                            entity_type, entity_id, count, dep.entity_type
                        ),
                        Span::default(),
                    ));
                }
            }
            OnDelete::SetNull => {
                txn.nullify_fk(
                    &dep.entity_type, &dep.field, entity_id
                )?;
            }
        }
    }

    txn.delete(entity_type, entity_id)?;
    Ok(deleted)
}

Because the entire cascade operation runs within a single transaction, a crash mid-cascade rolls back all changes. Either all dependent entities are deleted (or nullified), or none are.

Idempotent Operations

A reliable system must handle retries safely. If a client sends a request, the server processes it, but the response is lost (network failure), the client will retry. If the operation is not idempotent, the retry creates a duplicate.

We added idempotency support for entity creation:

flinroute POST "/api/orders" {
    validate {
        product_id: int @required
        quantity: int @required
        idempotency_key: text?
    }

    // If idempotency key is provided, check for existing operation
    if body.idempotency_key != none {
        existing = IdempotencyLog.where(
            key == body.idempotency_key
        ).first

        if existing != none {
            // Return the same response as the original operation
            return existing.response_data
        }
    }

    order = Order {
        product_id: body.product_id,
        quantity: body.quantity,
        status: "pending"
    }
    save order

    response_data = { id: order.id, status: order.status }

    // Log the idempotency key and response
    if body.idempotency_key != none {
        save IdempotencyLog {
            key: body.idempotency_key,
            response_data: response_data,
            expires_at: now() + 24h
        }
    }

    response_data
}

The idempotency log is a built-in entity type with automatic expiration. Keys older than 24 hours are purged by the garbage collection system. This ensures that retried requests produce the same result without duplicating data.

Health Check Endpoint

A reliable system must be able to report its own health. Load balancers, container orchestrators, and monitoring systems need to know whether the application is ready to receive traffic.

FLIN automatically exposes a /_flin/health endpoint that reports:

Status: "healthy", "degraded", or "unhealthy"
Database: whether the storage engine is accessible
WAL: whether the WAL is writable
Memory: current usage vs. budget
Uptime: seconds since server start

A "degraded" status means the system is operational but some non-critical subsystem has failed (search index, cache, cron scheduler). An "unhealthy" status means the system cannot serve requests reliably (database inaccessible, WAL full).

The Reliability Guarantee

After Phase 2, FLIN provided the following guarantees:

Atomicity: Entity save, update, and delete operations are atomic. They either fully succeed or fully roll back. No partial writes.

Durability: Every committed operation is written to the WAL before acknowledgment. A crash after acknowledgment will recover the operation on restart.

Consistency: Foreign key constraints are enforced within transactions. Cascade operations are atomic. Orphaned references cannot exist.

Recovery: WAL replay handles schema changes, constraint violations, and unknown failures with categorized recovery strategies. Failed entries are preserved for manual inspection.

These are not theoretical properties. Each one is backed by tests that simulate crashes at specific points in the operation sequence and verify that recovery produces the correct state. We introduced 84 new tests in Phase 2, all targeting state consistency under failure conditions.

The system was now stable and reliable. Phase 3 would make it fast.

This is Part 182 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [181] Production Hardening Phase 1: Stability - [182] Production Hardening Phase 2: Reliability (you are here) - [183] Production Hardening Phase 3: Performance

#182 -- Production Hardening Phase 2

The State Consistency Problem

Transaction Boundaries

WAL Recovery

Checkpoint Safety

Foreign Key Consistency

Idempotent Operations

Health Check Endpoint

The Reliability Guarantee

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live