WAL Race Conditions and SQLite Lessons

SQLite's Write-Ahead Logging (WAL) mode is marketed as the solution to SQLite's concurrency limitations. Instead of locking the entire database for writes, WAL mode writes changes to a separate log file. Readers can continue reading the main database while a writer appends to the log. It sounds perfect. In practice, it introduced subtle race conditions that took multiple sessions to diagnose and were ultimately one of the primary reasons we migrated to PostgreSQL.

This article documents the WAL-related bugs we encountered, the patches we applied, and why those patches were never quite enough.

How WAL Mode Works

In default journal mode, SQLite locks the database file during writes. Any concurrent reader blocks until the write completes. WAL mode changes this:

Main Database File (.db)
    |
    +-- WAL File (.db-wal) -- sequential log of uncommitted changes
    |
    +-- SHM File (.db-shm) -- shared memory for coordination

Write operation:
  1. Append changes to WAL file
  2. Readers using old connections see the main .db file (pre-change)
  3. When WAL reaches a threshold, checkpoint merges WAL back into .db

The key property: new connections opened after a write may still see stale data if they connect before the WAL is checkpointed. The main database file is only updated during checkpoint operations.

The `get_provider()` vs. `get_instance()` Bug

Session 060 revealed one of the most confusing bugs we encountered. The provider registry had two methods: get_provider() (which returned a class) and get_instance() (which returned an initialized instance). Some code paths called get_provider() expecting an instance and got a class, leading to TypeError: 'type' object is not callable errors.

But the bug was intermittent. It worked sometimes and failed other times. The WAL connection explained why:

python# The provider registry loaded providers from the database at startup
class ProviderRegistry:
    def __init__(self):
        self._providers = {}
        self._instances = {}

    async def load_providers(self):
        """Load all provider configurations from database."""
        # This query might see stale data if WAL hasn't checkpointed
        providers = await db.execute(text("SELECT * FROM providers"))
        for provider in providers:
            self._providers[provider.name] = load_provider_class(provider.type)
            self._instances[provider.name] = self._providers[provider.name](
                config=provider.config
            )

    def get_provider(self, name: str):
        """Returns the provider CLASS."""
        return self._providers.get(name)

    def get_instance(self, name: str):
        """Returns an initialized provider INSTANCE."""
        return self._instances.get(name)

When a new provider was added to the database in one connection, and the registry reloaded in another connection, WAL mode could serve the old database state. The registry would load without the new provider. Subsequent calls to get_instance("new_provider") would return None, causing downstream failures.

The immediate fix was to use get_instance() consistently:

python# BEFORE: Mixed usage
provider_class = registry.get_provider(provider_name)  # Gets class
result = provider_class.process(...)  # Sometimes works, sometimes TypeError

# AFTER: Consistent usage
provider = registry.get_instance(provider_name)  # Gets instance
if not provider:
    raise HTTPException(400, f"Provider '{provider_name}' not found or not configured")
result = await provider.process(...)

PRAGMA `wal_checkpoint(PASSIVE)` After Critical Commits

Session 072 introduced explicit WAL checkpointing after operations where stale reads would be dangerous:

python# database.py
async def checkpoint_wal():
    """Force a WAL checkpoint to sync data to main database file."""
    async with get_db_connection() as conn:
        await conn.execute(text("PRAGMA wal_checkpoint(PASSIVE)"))

# Used after critical operations async def create_transaction(data: dict) -> Transaction: transaction = Transaction(**data) db.add(transaction) await db.commit() BLANK # Force checkpoint so webhook handlers see the transaction await checkpoint_wal() BLANK return transaction ```

We identified six critical commit points that required checkpointing:

Operation	Why Checkpoint Is Needed
Transaction creation	Webhook handlers must see the transaction immediately
Transaction status update	Dashboard queries must reflect the new status
Provider configuration	Provider registry reload must see new configs
API key creation	Authentication middleware must validate new keys
App creation	Routing engine must see new app configurations
User creation	OAuth callback must find the newly created user

Why `PASSIVE` and Not `FULL`?

SQLite offers three checkpoint modes:

sqlPRAGMA wal_checkpoint(PASSIVE);   -- Checkpoint as much as possible without blocking
PRAGMA wal_checkpoint(FULL);      -- Checkpoint everything, blocking writers
PRAGMA wal_checkpoint(TRUNCATE);  -- Checkpoint and truncate WAL file

PASSIVE was chosen because it does not block concurrent operations. It checkpoints all pages that are not currently in use by active readers or writers. In most cases, this syncs all recent writes. FULL would guarantee complete sync but would block all other database access during the checkpoint -- unacceptable for a web server handling concurrent requests.

The Invalid `updated_at` Column

During the WAL debugging, we discovered that the updated_at column on several models was not being set correctly. SQLite's ON UPDATE trigger behaved differently than expected with WAL mode:

python# The problem: updated_at stayed as the original creation time
class Transaction(Base):
    updated_at = Column(DateTime, onupdate=func.now())
    # In SQLite with WAL, the onupdate trigger sometimes fired
    # against a stale version of the row, producing incorrect timestamps

The fix was to set updated_at explicitly in the application code rather than relying on database-level triggers:

pythonasync def update_transaction_status(tx_id: str, new_status: str):
    transaction = await get_transaction(tx_id)
    transaction.status = new_status
    transaction.updated_at = datetime.utcnow()  # Explicit, not trigger-based
    await db.commit()
    await checkpoint_wal()

Non-Blocking Sync Fixes

The core challenge with WAL mode is that there is no reliable way to guarantee that a reader sees the latest writes without explicit coordination. We implemented several patterns to work around this:

Pattern 1: Read-After-Write on Same Connection

python# Use the SAME connection for write and subsequent read
async def create_and_return_transaction(data: dict) -> Transaction:
    async with get_db_connection() as conn:
        # Write
        await conn.execute(
            text("INSERT INTO transactions (...) VALUES (...)"),
            data
        )
        await conn.commit()

        # Read on same connection -- guaranteed to see the write
        result = await conn.execute(
            text("SELECT * FROM transactions WHERE id = :id"),
            {"id": data["id"]}
        )
        return result.fetchone()

Pattern 2: Retry With Exponential Backoff

pythonasync def get_transaction_with_retry(tx_id: str, max_retries: int = 3) -> Transaction:
    """Retry if WAL hasn't synced yet."""
    for attempt in range(max_retries):
        transaction = await get_transaction(tx_id)
        if transaction:
            return transaction

        if attempt < max_retries - 1:
            await asyncio.sleep(0.1 * (2 ** attempt))  # 100ms, 200ms, 400ms

    raise HTTPException(404, f"Transaction {tx_id} not found")

Pattern 3: Write-Through Cache

python# For frequently accessed data (providers, currencies),
# maintain an in-memory cache updated synchronously on writes
class WriteThrough:
    def __init__(self):
        self._cache = {}

    async def write(self, key: str, value: any):
        # Write to database
        await db.execute(...)
        await db.commit()

        # Update cache immediately (bypass WAL delay)
        self._cache[key] = value

    def read(self, key: str) -> any:
        # Read from cache (always up-to-date)
        return self._cache.get(key)

Why This Pushed Toward PostgreSQL

Every fix we applied was a patch over a fundamental limitation. SQLite's WAL mode is designed for single-application scenarios (mobile apps, desktop software) where one process owns the database. A web server with multiple concurrent connections pushes WAL mode beyond its design envelope.

The problems we encountered:

Problem	SQLite Workaround	PostgreSQL Solution
Stale reads after writes	Explicit PRAGMA checkpoint	MVCC -- readers always see committed state
Single writer at a time	Serialized writes with timeouts	Multiple concurrent writers
Connection sees stale data	Same-connection read-after-write	Any connection sees latest committed data
Updated_at trigger issues	Explicit timestamp setting	Reliable trigger execution
Checkpoint blocking	PASSIVE mode (incomplete sync)	No checkpointing needed

PostgreSQL's Multi-Version Concurrency Control (MVCC) eliminates every WAL-related issue we encountered. Each transaction sees a consistent snapshot of the database. Writes are immediately visible to new transactions (after commit). There is no separate log file that needs periodic syncing.

What We Learned

WAL mode does not make SQLite a concurrent database. It improves read concurrency, but the fundamental model is still single-writer. For any application that needs concurrent writes -- which includes every web server -- WAL mode is a mitigation, not a solution.

Intermittent bugs are the worst bugs. The WAL-related issues appeared randomly. A test would pass 9 out of 10 times and fail on the 10th. This made them extremely difficult to reproduce and diagnose. When you see intermittent failures with SQLite, suspect WAL stale reads.

Explicit checkpointing is a code smell. When your application code needs PRAGMA wal_checkpoint() calls after critical operations, you are fighting the database engine. The database should guarantee consistency without application-level intervention.

SQLite is phenomenal for what it is designed for. It is the best embedded database in the world. But a payment platform with concurrent API requests, webhook handlers, and background workers is not an embedded database use case. Choosing the right tool for the right job matters more than any amount of creative workarounds.

The WAL race conditions cost us approximately 15 hours across sessions 060, 072, and several debugging sessions in between. The PostgreSQL migration (Session 081) eliminated all of these issues permanently.

This article is part of the "How We Built 0fee.dev" series. 0fee.dev is a payment orchestrator covering 53+ providers across 200+ countries, built by Juste A. GNIMAVO and Claude from Abidjan with zero human engineers. Follow the series for the complete build story.

WAL Race Conditions and SQLite Lessons

How WAL Mode Works

The `get_provider()` vs. `get_instance()` Bug

PRAGMA `wal_checkpoint(PASSIVE)` After Critical Commits

Why `PASSIVE` and Not `FULL`?

The Invalid `updated_at` Column

Non-Blocking Sync Fixes

Pattern 1: Read-After-Write on Same Connection

Pattern 2: Retry With Exponential Backoff

Pattern 3: Write-Through Cache

Why This Pushed Toward PostgreSQL

What We Learned

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live

How WAL Mode Works

The get_provider() vs. get_instance() Bug

PRAGMA wal_checkpoint(PASSIVE) After Critical Commits

Why PASSIVE and Not FULL?

The Invalid updated_at Column

Non-Blocking Sync Fixes

Pattern 1: Read-After-Write on Same Connection

Pattern 2: Retry With Exponential Backoff

Pattern 3: Write-Through Cache

Why This Pushed Toward PostgreSQL

What We Learned

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live

The `get_provider()` vs. `get_instance()` Bug

PRAGMA `wal_checkpoint(PASSIVE)` After Critical Commits

Why `PASSIVE` and Not `FULL`?

The Invalid `updated_at` Column