SQLite's Write-Ahead Logging (WAL) mode is marketed as the solution to SQLite's concurrency limitations. Instead of locking the entire database for writes, WAL mode writes changes to a separate log file. Readers can continue reading the main database while a writer appends to the log. It sounds perfect. In practice, it introduced subtle race conditions that took multiple sessions to diagnose and were ultimately one of the primary reasons we migrated to PostgreSQL.
This article documents the WAL-related bugs we encountered, the patches we applied, and why those patches were never quite enough.
How WAL Mode Works
In default journal mode, SQLite locks the database file during writes. Any concurrent reader blocks until the write completes. WAL mode changes this:
Main Database File (.db)
|
+-- WAL File (.db-wal) -- sequential log of uncommitted changes
|
+-- SHM File (.db-shm) -- shared memory for coordination
Write operation:
1. Append changes to WAL file
2. Readers using old connections see the main .db file (pre-change)
3. When WAL reaches a threshold, checkpoint merges WAL back into .dbThe key property: new connections opened after a write may still see stale data if they connect before the WAL is checkpointed. The main database file is only updated during checkpoint operations.
The get_provider() vs. get_instance() Bug
Session 060 revealed one of the most confusing bugs we encountered. The provider registry had two methods: get_provider() (which returned a class) and get_instance() (which returned an initialized instance). Some code paths called get_provider() expecting an instance and got a class, leading to TypeError: 'type' object is not callable errors.
But the bug was intermittent. It worked sometimes and failed other times. The WAL connection explained why:
python# The provider registry loaded providers from the database at startup
class ProviderRegistry:
def __init__(self):
self._providers = {}
self._instances = {}
async def load_providers(self):
"""Load all provider configurations from database."""
# This query might see stale data if WAL hasn't checkpointed
providers = await db.execute(text("SELECT * FROM providers"))
for provider in providers:
self._providers[provider.name] = load_provider_class(provider.type)
self._instances[provider.name] = self._providers[provider.name](
config=provider.config
)
def get_provider(self, name: str):
"""Returns the provider CLASS."""
return self._providers.get(name)
def get_instance(self, name: str):
"""Returns an initialized provider INSTANCE."""
return self._instances.get(name)When a new provider was added to the database in one connection, and the registry reloaded in another connection, WAL mode could serve the old database state. The registry would load without the new provider. Subsequent calls to get_instance("new_provider") would return None, causing downstream failures.
The immediate fix was to use get_instance() consistently:
python# BEFORE: Mixed usage
provider_class = registry.get_provider(provider_name) # Gets class
result = provider_class.process(...) # Sometimes works, sometimes TypeError
# AFTER: Consistent usage
provider = registry.get_instance(provider_name) # Gets instance
if not provider:
raise HTTPException(400, f"Provider '{provider_name}' not found or not configured")
result = await provider.process(...)PRAGMA wal_checkpoint(PASSIVE) After Critical Commits
Session 072 introduced explicit WAL checkpointing after operations where stale reads would be dangerous:
python# database.py
async def checkpoint_wal():
"""Force a WAL checkpoint to sync data to main database file."""
async with get_db_connection() as conn:
await conn.execute(text("PRAGMA wal_checkpoint(PASSIVE)"))
# Used after critical operations async def create_transaction(data: dict) -> Transaction: transaction = Transaction(**data) db.add(transaction) await db.commit() BLANK # Force checkpoint so webhook handlers see the transaction await checkpoint_wal() BLANK return transaction ```
We identified six critical commit points that required checkpointing:
| Operation | Why Checkpoint Is Needed |
|---|---|
| Transaction creation | Webhook handlers must see the transaction immediately |
| Transaction status update | Dashboard queries must reflect the new status |
| Provider configuration | Provider registry reload must see new configs |
| API key creation | Authentication middleware must validate new keys |
| App creation | Routing engine must see new app configurations |
| User creation | OAuth callback must find the newly created user |
Why PASSIVE and Not FULL?
SQLite offers three checkpoint modes:
sqlPRAGMA wal_checkpoint(PASSIVE); -- Checkpoint as much as possible without blocking
PRAGMA wal_checkpoint(FULL); -- Checkpoint everything, blocking writers
PRAGMA wal_checkpoint(TRUNCATE); -- Checkpoint and truncate WAL filePASSIVE was chosen because it does not block concurrent operations. It checkpoints all pages that are not currently in use by active readers or writers. In most cases, this syncs all recent writes. FULL would guarantee complete sync but would block all other database access during the checkpoint -- unacceptable for a web server handling concurrent requests.
The Invalid updated_at Column
During the WAL debugging, we discovered that the updated_at column on several models was not being set correctly. SQLite's ON UPDATE trigger behaved differently than expected with WAL mode:
python# The problem: updated_at stayed as the original creation time
class Transaction(Base):
updated_at = Column(DateTime, onupdate=func.now())
# In SQLite with WAL, the onupdate trigger sometimes fired
# against a stale version of the row, producing incorrect timestampsThe fix was to set updated_at explicitly in the application code rather than relying on database-level triggers:
pythonasync def update_transaction_status(tx_id: str, new_status: str):
transaction = await get_transaction(tx_id)
transaction.status = new_status
transaction.updated_at = datetime.utcnow() # Explicit, not trigger-based
await db.commit()
await checkpoint_wal()Non-Blocking Sync Fixes
The core challenge with WAL mode is that there is no reliable way to guarantee that a reader sees the latest writes without explicit coordination. We implemented several patterns to work around this:
Pattern 1: Read-After-Write on Same Connection
python# Use the SAME connection for write and subsequent read
async def create_and_return_transaction(data: dict) -> Transaction:
async with get_db_connection() as conn:
# Write
await conn.execute(
text("INSERT INTO transactions (...) VALUES (...)"),
data
)
await conn.commit()
# Read on same connection -- guaranteed to see the write
result = await conn.execute(
text("SELECT * FROM transactions WHERE id = :id"),
{"id": data["id"]}
)
return result.fetchone()Pattern 2: Retry With Exponential Backoff
pythonasync def get_transaction_with_retry(tx_id: str, max_retries: int = 3) -> Transaction:
"""Retry if WAL hasn't synced yet."""
for attempt in range(max_retries):
transaction = await get_transaction(tx_id)
if transaction:
return transaction
if attempt < max_retries - 1:
await asyncio.sleep(0.1 * (2 ** attempt)) # 100ms, 200ms, 400ms
raise HTTPException(404, f"Transaction {tx_id} not found")Pattern 3: Write-Through Cache
python# For frequently accessed data (providers, currencies),
# maintain an in-memory cache updated synchronously on writes
class WriteThrough:
def __init__(self):
self._cache = {}
async def write(self, key: str, value: any):
# Write to database
await db.execute(...)
await db.commit()
# Update cache immediately (bypass WAL delay)
self._cache[key] = value
def read(self, key: str) -> any:
# Read from cache (always up-to-date)
return self._cache.get(key)Why This Pushed Toward PostgreSQL
Every fix we applied was a patch over a fundamental limitation. SQLite's WAL mode is designed for single-application scenarios (mobile apps, desktop software) where one process owns the database. A web server with multiple concurrent connections pushes WAL mode beyond its design envelope.
The problems we encountered:
| Problem | SQLite Workaround | PostgreSQL Solution |
|---|---|---|
| Stale reads after writes | Explicit PRAGMA checkpoint | MVCC -- readers always see committed state |
| Single writer at a time | Serialized writes with timeouts | Multiple concurrent writers |
| Connection sees stale data | Same-connection read-after-write | Any connection sees latest committed data |
| Updated_at trigger issues | Explicit timestamp setting | Reliable trigger execution |
| Checkpoint blocking | PASSIVE mode (incomplete sync) | No checkpointing needed |
PostgreSQL's Multi-Version Concurrency Control (MVCC) eliminates every WAL-related issue we encountered. Each transaction sees a consistent snapshot of the database. Writes are immediately visible to new transactions (after commit). There is no separate log file that needs periodic syncing.
What We Learned
WAL mode does not make SQLite a concurrent database. It improves read concurrency, but the fundamental model is still single-writer. For any application that needs concurrent writes -- which includes every web server -- WAL mode is a mitigation, not a solution.
Intermittent bugs are the worst bugs. The WAL-related issues appeared randomly. A test would pass 9 out of 10 times and fail on the 10th. This made them extremely difficult to reproduce and diagnose. When you see intermittent failures with SQLite, suspect WAL stale reads.
Explicit checkpointing is a code smell. When your application code needs PRAGMA wal_checkpoint() calls after critical operations, you are fighting the database engine. The database should guarantee consistency without application-level intervention.
SQLite is phenomenal for what it is designed for. It is the best embedded database in the world. But a payment platform with concurrent API requests, webhook handlers, and background workers is not an embedded database use case. Choosing the right tool for the right job matters more than any amount of creative workarounds.
The WAL race conditions cost us approximately 15 hours across sessions 060, 072, and several debugging sessions in between. The PostgreSQL migration (Session 081) eliminated all of these issues permanently.
This article is part of the "How We Built 0fee.dev" series. 0fee.dev is a payment orchestrator covering 53+ providers across 200+ countries, built by Juste A. GNIMAVO and Claude from Abidjan with zero human engineers. Follow the series for the complete build story.