#068 -- FlinDB Hardening for Production

There is a chasm between "the database works" and "the database is production-ready." On one side: correct CRUD, passing tests, features that work in development. On the other side: data integrity under power loss, concurrent access protection, storage efficiency at scale, and self-describing databases that can recover without the original source code.

Session 308 bridged that chasm. Six hardening features, each addressing a specific production failure mode. CRC-32 checksums to detect WAL corruption. Auto-checkpointing to prevent unbounded WAL growth. File locking to prevent concurrent process corruption. Per-entity-type data files to eliminate filesystem bottlenecks. WAL history deduplication to prevent quadratic storage growth. Schema persistence to make the database self-describing.

This is the session that turned FlinDB from "works on my laptop" into "deploy to production."

CRC-32 Checksums: Detecting Corruption

The Problem

Disk corruption, power loss, and incomplete writes can leave the WAL with garbled entries. Without integrity checks, replaying a corrupted WAL could silently insert wrong data into the database. The database would appear to work normally, but some records would contain garbage values -- the worst kind of bug, because it is invisible until someone reads the corrupted data.

The Solution

Every WAL entry is now prefixed with a CRC-32 checksum:

CRC:a1b2c3d4\t{"op":"Save","entity_type":"Todo","data":{...}}
CRC:e5f6a7b8\t{"op":"Delete","entity_type":"Todo","id":42}

The format is:

CRC:{hex_checksum}\t{json_payload}

The checksum is computed over the raw JSON bytes (everything after the tab character). During WAL replay, each entry is verified:

Read WAL line
    |
    +-- Starts with "CRC:" ?
    |   +-- Yes: Extract checksum + payload
    |   |        Compute CRC-32 of payload
    |   |        Match?    --> Apply entry
    |   |        Mismatch? --> Log WARNING, skip entry
    |   |
    |   +-- No: Parse as legacy plain JSON (backward-compatible)
    |
    +-- Empty/unparseable: Skip

The implementation uses the crc32fast Rust crate, which leverages hardware-accelerated SSE 4.2 instructions on supported CPUs. The performance overhead is negligible -- CRC-32 runs at memory bandwidth speeds, adding microseconds per WAL entry.

Backward Compatibility

Entries written by older versions of FLIN (without the CRC: prefix) are parsed as legacy plain JSON. This means upgrading to the hardened version requires zero migration. Old entries are replayed normally. New entries are written with checksums. After the first checkpoint, the WAL is truncated and all future entries use checksums.

The Corruption Counter

ZeroCore tracks the total number of CRC mismatches encountered during recovery. If non-zero, the server logs a warning at startup:

WARNING: 3 WAL entries skipped due to CRC mismatch (possible disk corruption)

This is a fail-safe approach. Corrupted entries are skipped rather than causing a fatal error. The database recovers as much data as possible, reports how many entries were lost, and continues operating. In practice, a single corrupted entry in a WAL with thousands of entries means the database loses one mutation -- not the entire dataset.

Auto-Checkpointing: Bounding WAL Growth

The Problem

Without checkpointing, the WAL grows indefinitely. A busy application writing hundreds of records per hour accumulates a WAL file that takes increasingly longer to replay on recovery, consumes unbounded disk space, and makes backup files unnecessarily large.

The Solution

Auto-checkpointing with two configurable thresholds:

Threshold	Default	Environment Variable
Entry count	1,000	`FLIN_DB_MAX_WAL_ENTRIES`
Byte size	10 MB	`FLIN_DB_MAX_WAL_BYTES`

Whichever threshold is hit first triggers the checkpoint:

App writes --> WAL entry appended --> Check thresholds
                                          |
                               entries >= 1000 OR bytes >= 10MB?
                                          |
                                         Yes
                                          |
                                    CHECKPOINT:
                                    1. Write data/ files
                                    2. Write schema.flindb
                                    3. Truncate WAL

On clean server shutdown (Ctrl+C or SIGTERM), ZeroCore performs a final checkpoint regardless of thresholds. This ensures data files are always up-to-date when the server stops gracefully.

Configuration

bash# Default: checkpoint every 1000 entries or 10MB
flin dev myapp

# High-traffic: checkpoint more frequently
FLIN_DB_MAX_WAL_ENTRIES=500 FLIN_DB_MAX_WAL_BYTES=5242880 flin dev myapp

# Low-traffic: checkpoint less frequently
FLIN_DB_MAX_WAL_ENTRIES=5000 FLIN_DB_MAX_WAL_BYTES=52428800 flin dev myapp

Industry Comparison

Database	Auto-Checkpoint Strategy
FlinDB	Every 1,000 entries or 10 MB
SQLite	Every 1,000 pages (WAL mode)
PostgreSQL	Every 16 MB WAL segment
MySQL (InnoDB)	Fuzzy checkpoints based on dirty page ratio

FlinDB's defaults are conservative -- comparable to SQLite's. For most applications, the default thresholds provide a good balance between checkpoint frequency (affecting write latency) and WAL size (affecting recovery time).

File Locking: Preventing Concurrent Corruption

The Problem

Running two FLIN dev servers against the same .flindb/ directory simultaneously would cause both processes to write to the same WAL. Interleaved entries, data corruption, and unpredictable behavior would follow.

The Solution

A DbLock struct acquires an exclusive file lock on .flindb/lock at startup:

Server Start
    |
    +-- Create/open .flindb/lock
    +-- Acquire exclusive file lock (fs2 crate)
    |   +-- Success: Write PID to lock file, continue
    |   +-- Failure: "Database locked by another process" error, exit
    |
Server Running (lock held)
    |
Server Stop
    |
    +-- Lock released automatically (Rust Drop trait)

The lock uses the fs2 crate for cross-platform compatibility (Windows, macOS, Linux). The lock is exclusive -- no shared/read locks. The PID is written to the lock file for debugging ("which process holds the lock?").

Stale lock handling: If a server crashes without clean shutdown, the OS automatically releases the file lock. The next server start acquires the lock normally -- no manual intervention needed.

Per-Entity-Type Data Files

The Problem

The old storage format created one JSON file per record:

.flindb/data/
+-- Todo_1.json
+-- Todo_2.json
+-- Todo_3.json
+-- ...
+-- Todo_847.json

With a busy application, this means thousands of tiny files. Directory listing becomes slow. Filesystem inode usage increases. Backup tools struggle with many small files. Recovery requires reading and parsing hundreds or thousands of individual files.

The Solution

All records of an entity type are consolidated into a single .flindb file:

.flindb/data/
+-- Todo.flindb            # All 847 Todo records
+-- User.flindb            # All 203 User records
+-- ChatMessage.flindb     # All ChatMessage records

Each file contains a JSON array of all records, including full version history:

json[
  {
    "id": 1,
    "title": "Buy groceries",
    "done": false,
    "version": 1,
    "history": []
  },
  {
    "id": 2,
    "title": "Write docs",
    "done": true,
    "version": 3,
    "history": [
      { "version": 1, "data": {"title": "Write docs", "done": false} },
      { "version": 2, "data": {"title": "Write docs", "done": false} }
    ]
  }
]

The Improvement

Metric	Old Format	New Format
Files for 1,000 Todos	1,000 files	1 file
Directory listing	Slow (thousands of entries)	Fast (one per entity type)
Backup efficiency	Many small files	Few larger files
Filesystem overhead	High (inode per record)	Minimal
Recovery speed	Read + parse 1,000 files	Read + parse 1 file

Backward Compatibility

ZeroCore reads both formats during recovery. Files with .flindb extension are parsed as JSON arrays. Files matching the {Type}_{id}.json pattern are parsed as individual records. After the first checkpoint with the new code, all records are written in the consolidated format.

WAL History Deduplication

The Problem

Before Session 308, every Save WAL entry included the complete version history of the entity. For an entity updated 100 times, the 101st save entry included all 100 previous versions:

Save #1:   { data: {...}, history: [] }                          ~200 bytes
Save #2:   { data: {...}, history: [v1] }                        ~400 bytes
Save #3:   { data: {...}, history: [v1, v2] }                    ~600 bytes
...
Save #100: { data: {...}, history: [v1, v2, ..., v99] }          ~20,000 bytes
Save #101: { data: {...}, history: [v1, ..., v100] }             ~20,200 bytes

Total WAL size for 101 saves of one entity: approximately 1 MB. Mostly redundant history duplicated in every entry.

The Solution

Save WAL entries no longer include the history array. Only the current data is written to the WAL. The complete history is reconstructed during checkpoint from the sequence of WAL entries.

This changes WAL growth from quadratic to linear:

Save #1:   { data: {...} }    ~200 bytes
Save #2:   { data: {...} }    ~200 bytes
...
Save #100: { data: {...} }    ~200 bytes
Save #101: { data: {...} }    ~200 bytes

Total WAL size for 101 saves: approximately 20 KB. A 50x reduction for this example.

The reduction is more dramatic for entities with many fields or large text content. A blog post entity with a 10 KB body field that is edited 50 times would produce a WAL of approximately 25 MB with the old format (quadratic history accumulation) versus 500 KB with deduplication (linear growth).

Schema Persistence

The Problem

Before Session 308, the database was not self-describing. If you lost the FLIN source files, you could not interpret the data in .flindb/ -- the field names, types, validators, and constraints were only in the source code.

The Solution

schema.flindb persists the complete entity schema alongside the data:

json{
  "schemas": {
    "Todo": {
      "fields": [
        {"name": "title", "type": "String", "required": true},
        {"name": "done", "type": "Boolean", "required": true, "default": false}
      ],
      "constraints": [
        {"type": "Check", "field": "title", "condition": "title.length > 0"}
      ],
      "indexed_fields": ["id"]
    }
  },
  "version": 2,
  "updated_at": "2026-03-15T10:00:00Z"
}

This is analogous to SQLite's sqlite_master table, which stores the CREATE TABLE statements that define the schema. FlinDB's schema.flindb serves the same purpose -- making the database self-describing and recoverable without external information.

The schema is updated during every checkpoint. If the FLIN source code adds a new field or constraint, the next checkpoint captures it. If the source code is lost, the schema file provides enough information to read and interpret all data in the database.

Design Principles

Three principles guided the hardening work:

Zero configuration. Every hardening feature has sensible defaults. CRC checksums are always on. Auto-checkpointing uses reasonable thresholds. File locking happens automatically. The developer does not need to enable any of these features -- they are the default behavior.

Backward compatible. The hardened code reads legacy formats and auto-migrates. Old WAL entries without CRC prefixes are parsed normally. Old per-record JSON files are read during recovery. After the first checkpoint, everything is in the new format. No manual migration step.

Fail-safe. Corrupted WAL entries are skipped, not fatal. A corrupted entry logs a warning and continues. A stale lock file does not prevent restart. A missing schema file triggers schema derivation from entity declarations. The database recovers as much data as possible and keeps running.

These principles reflect the reality of production environments. Power will be lost. Disks will corrupt bits. Processes will crash without clean shutdown. Developers will upgrade without reading the changelog. A production database must handle all of these gracefully, without data loss, and without requiring manual intervention.

This is Part 13 of the "How We Built FlinDB" series, documenting how we built a complete embedded database engine for the FLIN programming language.

Series Navigation: - [066] Database Encryption and Configuration - [067] Tree Traversal and Integration Testing - [068] FlinDB Hardening for Production (you are here) - [069] FlinDB vs SQLite: Why We Built Our Own - [070] Persistence in the Browser

#068 -- FlinDB Hardening for Production

CRC-32 Checksums: Detecting Corruption

The Problem

The Solution

Backward Compatibility

The Corruption Counter

Auto-Checkpointing: Bounding WAL Growth

The Problem

The Solution

Configuration

Industry Comparison

File Locking: Preventing Concurrent Corruption

The Problem

The Solution

Per-Entity-Type Data Files

The Problem

The Solution

The Improvement

Backward Compatibility

WAL History Deduplication

The Problem

The Solution

Schema Persistence

The Problem

The Solution

Design Principles

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live