Fixing Infinite Loops and 500 Errors

Every software project has its war stories. The bugs that take hours to diagnose. The errors that appear in production but never in development. The infinite loops that peg the CPU at 100% with no obvious cause. 0fee.dev had all of these, spread across multiple sessions.

This article documents the worst of them: SQLAdmin column_filters crashing with 500 errors, Redis connections hanging indefinitely, webhook function signature mismatches, provider registry confusion, invoice creation parameter errors, infinite recursion in a currency getter, and building the wrong JavaScript file format.

Session 038: SQLAdmin column_filters 500 Errors

After migrating to SQLAdmin, the admin panel worked perfectly -- until you clicked a filter button. Every filter action produced a 500 Internal Server Error with no useful traceback in the logs.

The problem was how column_filters was defined:

python# BEFORE: Caused 500 errors on filter click
class TransactionAdmin(ModelView, model=Transaction):
    column_filters = [
        Transaction.status,       # Model attribute reference
        Transaction.provider,
        Transaction.created_at,
    ]

SQLAdmin 0.16.x expected string column names for filters, not model attribute references. The attribute references worked for column_list and column_searchable_list, but the filter generation code path handled them differently:

python# AFTER: String-based column references work correctly
class TransactionAdmin(ModelView, model=Transaction):
    column_filters = [
        "status",
        "provider",
        "created_at",
    ]

The frustrating part was that no exception was raised during initialization. SQLAdmin accepted the attribute references silently. The error only manifested when the filter UI attempted to generate the filter form, deep inside SQLAdmin's internal rendering code. The traceback pointed to SQLAdmin internals, not to our configuration.

Lesson: When a library offers multiple ways to reference columns (attributes vs. strings), test each usage context independently. What works for listing may not work for filtering.

Session 040: Redis Hanging

Redis was used for caching, rate limiting, and as Celery's message broker. In Session 040, the entire backend became unresponsive. Every request timed out. The CPU was idle. Memory was fine. The server was alive but not responding.

The culprit: Redis connection hanging without timeout.

python# BEFORE: No timeout, connection hangs forever
import redis

redis_client = redis.Redis(host="localhost", port=6379, db=0)

async def get_cached_provider(name: str):
    # If Redis is down, this blocks forever
    cached = redis_client.get(f"provider:{name}")
    ...

When the Redis server became temporarily unreachable (a common occurrence in development and occasionally in production), every request that touched Redis would hang indefinitely. Since Redis was used in the rate limiting middleware, which ran on every request, the entire API became unresponsive.

The fix had three parts:

Part 1: Connection Timeouts

python# AFTER: 5-second timeout on all Redis operations
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    db=0,
    socket_timeout=5,          # 5s timeout for individual operations
    socket_connect_timeout=5,  # 5s timeout for initial connection
    retry_on_timeout=True,     # Retry once on timeout
    health_check_interval=30,  # Check connection health every 30s
)

Part 2: Graceful Fallbacks

python# AFTER: Graceful fallback when Redis is unavailable
async def get_cached_provider(name: str):
    try:
        cached = redis_client.get(f"provider:{name}")
        if cached:
            return json.loads(cached)
    except (redis.ConnectionError, redis.TimeoutError) as e:
        logger.warning(f"Redis unavailable, falling back to database: {e}")

    # Fallback: query database directly
    provider = await db.get(Provider, name)
    return provider

async def check_rate_limit(client_ip: str) -> bool: try: key = f"rate_limit:{client_ip}" count = redis_client.incr(key) if count == 1: redis_client.expire(key, 60) return count <= 100 # 100 requests per minute except (redis.ConnectionError, redis.TimeoutError): # If Redis is down, allow the request (fail open for rate limiting) logger.warning("Redis unavailable, rate limiting disabled") return True ```

The rate limiter uses a "fail open" strategy: when Redis is down, requests are allowed rather than denied. This is a deliberate choice. For a payment platform, denying all requests because the rate limiter is down is worse than temporarily having no rate limiting.

Part 3: Fix Hardcoded localhost

python# BEFORE: Hardcoded localhost (breaks in Docker/production)
redis_client = redis.Redis(host="localhost", port=6379)

# AFTER: Configurable from environment
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = int(os.getenv("REDIS_PORT", "6379"))
redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT)

In Docker, Redis runs in a separate container. localhost inside the backend container is the backend container itself, not the Redis container. The Redis host needs to be configurable.

Session 040: Webhook Function Signature Mismatch

In the same session, webhook delivery was silently failing. No errors in logs, no retries, just webhooks that never arrived at their destination.

The issue was a function signature mismatch between the webhook sender and the Celery task:

python# BEFORE: Signature mismatch
# The function expected
async def send_webhook(url: str, payload: dict, secret: str, attempt: int = 1):
    ...

# But the Celery task called it as
@celery_app.task
def deliver_webhook(webhook_id: str, transaction_id: str):
    webhook = get_webhook(webhook_id)
    transaction = get_transaction(transaction_id)
    # Missing 'secret' parameter, passing wrong arguments
    send_webhook(webhook.url, transaction.to_dict())  # TypeError swallowed by Celery

Celery tasks swallow exceptions by default unless you configure task_always_eager = True or check task results. The TypeError from the missing parameter was silently caught, and the task was marked as successful (with no actual webhook delivery).

python# AFTER: Correct function call with proper error handling
@celery_app.task(bind=True, max_retries=5, default_retry_delay=60)
def deliver_webhook(self, webhook_id: str, transaction_id: str):
    try:
        webhook = get_webhook(webhook_id)
        transaction = get_transaction(transaction_id)

        payload = build_webhook_payload(transaction)
        signature = compute_webhook_signature(payload, webhook.secret)

        response = httpx.post(
            webhook.url,
            json=payload,
            headers={
                "X-0fee-Signature": signature,
                "X-0fee-Event": transaction.status,
            },
            timeout=10.0,
        )
        response.raise_for_status()

        # Record successful delivery
        record_delivery(webhook_id, transaction_id, "success", response.status_code)

    except httpx.HTTPError as e:
        record_delivery(webhook_id, transaction_id, "failed", getattr(e.response, "status_code", None))
        self.retry(exc=e)
    except Exception as e:
        record_delivery(webhook_id, transaction_id, "error", None)
        logger.error(f"Webhook delivery error: {e}", exc_info=True)
        self.retry(exc=e)

Session 060: Provider Registry Confusion

The provider registry had two methods with similar names and different return types (covered in detail in the WAL article). The confusion caused TypeError exceptions that appeared randomly:

python# get_provider() returns a CLASS
# get_instance() returns an INSTANCE

# Some code called get_provider() expecting an instance
provider = registry.get_provider("stripe")
result = await provider.process_payment(...)  # TypeError: class is not awaitable

# Other code called get_instance() correctly
provider = registry.get_instance("stripe")
result = await provider.process_payment(...)  # Works

The fix was twofold: make get_instance() the public API and deprecate get_provider():

pythonclass ProviderRegistry:
    def get_instance(self, name: str) -> BaseProvider | None:
        """Get an initialized provider instance. This is the primary API."""
        return self._instances.get(name)

    def get_provider(self, name: str) -> type[BaseProvider] | None:
        """DEPRECATED: Use get_instance(). Returns the provider class, not an instance."""
        import warnings
        warnings.warn("get_provider() is deprecated, use get_instance()", DeprecationWarning)
        return self._providers.get(name)

Session 060: Invoice Creation Parameter Mismatch

Invoice generation was failing with a confusing error about unexpected keyword arguments:

python# The invoice creation function had been refactored to accept a dict
async def create_invoice(data: dict) -> Invoice:
    ...

# But callers were still passing individual parameters
invoice = await create_invoice(
    user_id=user.id,
    app_id=app.id,
    amount=fees_total,
    period_start=cycle_start,
    period_end=cycle_end,
)
# TypeError: create_invoice() got unexpected keyword arguments

This is a classic refactoring hazard. The function signature changed but not all call sites were updated. The fix was to update all callers and add type hints to prevent future mismatches:

pythonclass InvoiceCreate(BaseModel):
    user_id: str
    app_id: str
    amount: float
    currency: str
    period_start: datetime
    period_end: datetime
    items: list[InvoiceItemCreate]

async def create_invoice(data: InvoiceCreate) -> Invoice:
    ...

Using a Pydantic model for the input makes the function signature explicit and provides validation at the call site.

Session 077: Infinite Recursion in Currency Getter

This was the most dramatic bug. The backend process would consume 100% CPU and eventually crash with a RecursionError:

python# BEFORE: Infinite recursion
class App(Base):
    @property
    def settlement_currency(self) -> str:
        """Get the app's settlement currency."""
        if self.settings and self.settings.get("settlement_currency"):
            return self.settings["settlement_currency"]
        # Fallback to user's default currency
        return self.user.default_currency  # This triggers a lazy load...
        # which loads the User object...
        # which accesses user.apps...
        # which loads this App...
        # which accesses app.settlement_currency...
        # INFINITE RECURSION

The SQLAlchemy relationship between User and App created a circular lazy-load chain. Accessing self.user triggered a lazy load of the User object. The User's __repr__ method (or a property on User) accessed user.apps, which loaded all App objects. Each App object's initialization accessed settlement_currency, which accessed self.user, completing the loop.

python# AFTER: Break the recursion with explicit query
class App(Base):
    @property
    def settlement_currency(self) -> str:
        if self.settings and self.settings.get("settlement_currency"):
            return self.settings["settlement_currency"]
        return "USD"  # Safe default, no lazy-load chain

    async def get_settlement_currency(self, db: AsyncSession) -> str:
        """Get settlement currency with explicit user lookup if needed."""
        if self.settings and self.settings.get("settlement_currency"):
            return self.settings["settlement_currency"]
        user = await db.get(User, self.user_id)
        return user.default_currency if user else "USD"

The property version uses a safe default. The async method does an explicit, non-recursive database lookup.

Session 077: Wrong Build File (ES vs. IIFE)

The checkout widget was built as an ES module (widget.es.js) but needed to be an IIFE (widget.iife.js) for embedding on merchant websites. ES modules require <script type="module"> and do not work with the simple <script src="..."> tag that merchants were given:

javascript// Vite build configuration
// BEFORE: Produced ES module
export default defineConfig({
    build: {
        lib: {
            entry: 'src/widget.ts',
            formats: ['es'],  // Only ES module
            fileName: 'widget',
        }
    }
});

// AFTER: Produce IIFE for embedding
export default defineConfig({
    build: {
        lib: {
            entry: 'src/widget.ts',
            formats: ['iife'],  // IIFE for <script> tag embedding
            name: 'ZeroFeeWidget',
            fileName: 'widget',
        }
    }
});

The ES module build worked in the development environment (where Vite serves everything as modules) but failed in production when merchants added <script src="https://0fee.dev/widget.js"> to their pages. The IIFE format wraps the entire widget in a self-executing function, making it compatible with any <script> tag.

Patterns We Now Follow

After fixing these bugs, we established several defensive patterns:

Every Redis operation has a timeout and a fallback. No Redis call blocks indefinitely.
Every Celery task has explicit error handling and logging. Swallowed exceptions are not acceptable.
Properties never trigger lazy loads that could recurse. Use explicit async methods for database lookups.
Function signatures use Pydantic models. No ambiguity about expected parameters.
Build formats are tested with the actual embedding method. If merchants use <script> tags, test with <script> tags.
SQLAdmin configuration is tested feature by feature. List, filter, search, create, edit -- each is tested independently.

This article is part of the "How We Built 0fee.dev" series. 0fee.dev is a payment orchestrator covering 53+ providers across 200+ countries, built by Juste A. GNIMAVO and Claude from Abidjan with zero human engineers. Follow the series for the complete build story.