Background Jobs: When AI Takes 30 Minutes to Think

By Thales & Claude -- CEO & AI CTO, ZeroSuite, Inc.

A chartered accountant in Abidjan asks Deblo Pro to generate a comprehensive SYSCOHADA-compliant annual financial report for her client -- complete with balance sheet, income statement, cash flow analysis, notes to the financial statements, and a management commentary. The LLM processes the request, consults its domain knowledge, generates structured data, and calls the generate_pdf tool to produce a 50-page document.

The entire process takes 12 minutes.

Twelve minutes is an eternity in web application time. The browser's SSE (Server-Sent Events) connection will have timed out after approximately 180 seconds. The user's phone screen will have locked. The WiFi might have dropped and reconnected. If the generation depends on a live HTTP connection between the browser and the server, that 50-page report dies halfway through.

This is the background job problem. And solving it required a fundamental rethink of how Deblo handles AI generation.

The Problem in Detail

Deblo's chat system uses Server-Sent Events for streaming. When a user sends a message, the frontend opens a POST /api/chat request and reads the response as a stream. The backend runs the LLM, streams tokens as data: events, executes tools (file generation, web search, email sending), and eventually sends a data: [DONE] event to close the stream.

This works beautifully for conversations that complete in under three minutes. Most K12 interactions -- homework help, exercise explanations, quiz generation -- finish in 10-30 seconds. Even complex Pro queries typically complete within 60-90 seconds.

But certain Pro use cases break the model:

Full financial report generation: 50+ pages, multiple sections, each requiring LLM analysis. Duration: 5-15 minutes.
Audit template creation: Complete audit program with risk assessment, testing procedures, and sampling methodology. Duration: 8-20 minutes.
Comprehensive tax analysis: Multi-country tax optimization covering OHADA zone regulations, with specific article citations. Duration: 5-12 minutes.
Multi-file generation: A single request that produces an Excel workbook, a PDF report, and a PowerPoint presentation. Duration: 10-30 minutes.

At the three-minute mark, browsers start closing idle connections. Chrome and Safari both enforce SSE timeouts between 120 and 300 seconds depending on the platform and connection type. Mobile browsers are even more aggressive. A background tab in Safari on iOS can have its network connections severed within 30 seconds.

The user is left staring at a loading spinner that never resolves. The backend may or may not still be running the generation. There is no way to recover.

The Solution: Queue-Bridged Architecture

We implemented a queue-bridged architecture with detached asyncio tasks. The concept is straightforward: instead of tying the generation to the HTTP connection, we decouple them.

The flow:

User sends a message with background=true in the request body.
Backend creates a GenerationJob record in PostgreSQL with status pending.
Backend spawns a detached asyncio.Task that runs the generation independently of the HTTP connection.
Backend immediately returns the job_id to the frontend (HTTP 202 Accepted).
Frontend polls GET /api/jobs/{job_id} every 3 seconds to check progress.
The background task updates progress in Redis and final results in PostgreSQL.
When the frontend receives status: completed, it displays the results.

The GenerationJob model captures the full lifecycle:

pythonclass GenerationJob(Base):
    __tablename__ = "generation_jobs"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    user_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=False)
    conversation_id = Column(UUID(as_uuid=True),
                             ForeignKey("conversations.id"), nullable=False)

    status = Column(String(20), nullable=False, default="pending")
    # pending -> running -> completed | failed
    progress_pct = Column(Integer, default=0)
    progress_text = Column(String(500), nullable=True)

    # Results (populated on completion)
    result_text = Column(Text, nullable=True)
    result_files = Column(JSONB, nullable=True)
    result_tool_steps = Column(JSONB, nullable=True)
    result_annotations = Column(JSONB, nullable=True)
    result_quiz = Column(JSONB, nullable=True)

    error_message = Column(Text, nullable=True)
    token_usage = Column(JSONB, nullable=True)
    request_snapshot = Column(JSONB, nullable=False)

    created_at = Column(DateTime(timezone=True), server_default=func.now())
    started_at = Column(DateTime(timezone=True), nullable=True)
    completed_at = Column(DateTime(timezone=True), nullable=True)

Several design choices deserve explanation.

request_snapshot stores the original request parameters as JSONB. If a job fails and needs to be retried, we can replay the exact request without requiring the user to resend their message. This also serves as an audit trail: we know exactly what the user asked for, even if the conversation history is later modified.

result_files is JSONB containing an array of file metadata: file_id, file_url, filename, content_type, file_size. A single job can produce multiple files. The 50-page PDF report might also generate an accompanying Excel workbook with the raw figures.

result_tool_steps records each tool invocation during the generation: name, label, detail, status, and summary. This powers the progress UI -- the user sees "Generating Excel: Revenue Analysis" followed by "Generating PDF: Annual Report" as each tool completes.

The Background Task

The generation runs in a detached asyncio.Task, managed by the background_generation service:

pythonMAX_CONCURRENT_JOBS = 10
_active_jobs: set[UUID] = set()
_active_jobs_lock = asyncio.Lock()
BG_TOTAL_TIMEOUT = 1800  # 30 minutes

async def run_generation_job(
    job_id: UUID,
    event_queue: asyncio.Queue | None = None,
) -> None:
    """Execute a generation job in the background.

    Opens its own DB session. Consumes stream_chat_response with
    extended timeout, persists results.

    If event_queue is provided, SSE events are pushed to it so a
    forwarding StreamingResponse can relay them in real time.
    """
    redis = Redis(connection_pool=redis_pool)

    async with _active_jobs_lock:
        if len(_active_jobs) >= MAX_CONCURRENT_JOBS:
            async with async_session() as db:
                result = await db.execute(
                    select(GenerationJob)
                    .where(GenerationJob.id == job_id)
                )
                job = result.scalar_one_or_none()
                if job:
                    job.status = "failed"
                    job.error_message = (
                        "Trop de generations en cours. "
                        "Reessayez dans quelques minutes."
                    )
                    job.completed_at = datetime.now(timezone.utc)
                    await db.commit()
            return
        _active_jobs.add(job_id)

    try:
        await _run_job_inner(job_id, redis, event_queue)
    except Exception:
        logger.exception("Background job %s crashed", job_id)
        # Mark as failed
        async with async_session() as db:
            result = await db.execute(
                select(GenerationJob)
                .where(GenerationJob.id == job_id)
            )
            job = result.scalar_one_or_none()
            if job and job.status != "completed":
                job.status = "failed"
                job.error_message = "Erreur interne lors de la generation."
                job.completed_at = datetime.now(timezone.utc)
                await db.commit()
    finally:
        async with _active_jobs_lock:
            _active_jobs.discard(job_id)
        await redis.aclose()

The concurrency limit of 10 global jobs exists because each background task holds an LLM connection open for the entire duration. With 30-minute tasks, unbounded concurrency would exhaust the connection pool and impact real-time chat performance. The limit is configurable via environment variable.

The event_queue parameter enables a hybrid mode: when the client is still connected, SSE events are pushed to the queue and streamed in real time. If the client disconnects, the queue is simply ignored and the task continues. When the client reconnects and polls, it receives the accumulated results from PostgreSQL. This hybrid approach gives users the best of both worlds: real-time streaming when possible, persistent results when not.

Redis Progress Tracking

Polling the PostgreSQL database every 3 seconds from every connected client would be expensive. Instead, we use Redis as a fast progress cache:

pythonasync def _update_progress(
    redis: Redis, job_id: UUID, pct: int, text: str
) -> None:
    """Update progress in Redis for fast polling (TTL 1 hour)."""
    try:
        await redis.setex(
            f"job:{job_id}:progress",
            3600,
            json.dumps({"pct": pct, "text": text}),
        )
    except Exception:
        pass  # Progress is best-effort; failure should not kill the job

The polling endpoint first checks Redis for progress data. If the job is still running, the response includes the progress percentage and descriptive text ("Generating section 3 of 8: Cash Flow Analysis"). If the job is complete or failed, it falls through to PostgreSQL for the full result.

The Redis key has a 1-hour TTL. After completion, the progress data self-expires. The permanent record lives in PostgreSQL.

Two details matter here. First, the try/except with pass -- progress updates are best-effort. If Redis is temporarily unavailable, the job continues running and the frontend falls back to showing an indeterminate progress bar. Second, the pct value is updated at meaningful milestones: after each tool invocation, after each section of a multi-section document, after each LLM iteration in the agentic loop. It is not a continuous progress bar; it steps in increments that correspond to actual work completed.

Cancellation

Long-running jobs must be cancellable. A user who accidentally requests a 50-page report and immediately realizes the prompt was wrong should not have to wait 12 minutes for it to finish (and consume credits).

The cancellation mechanism uses Redis as a signaling channel:

When the user clicks "Cancel," the frontend sends POST /api/jobs/{job_id}/cancel. The endpoint sets a Redis key: job:{job_id}:cancel = 1. The background task checks this key between iterations -- after each tool execution and after each LLM response chunk. If the cancel flag is set, the task stops, marks the job as failed with a "Cancelled by user" message, and rolls back any partial credit charges.

This is cooperative cancellation. The task is not killed mid-execution; it checks for the cancel signal at natural breakpoints. This prevents partial file writes, incomplete database transactions, and orphaned resources.

Stale Job Cleanup

A server restart during an active background job creates an orphan: a job with status running that will never complete because the asyncio.Task that was executing it no longer exists.

The cleanup runs at application startup:

pythonasync def mark_stale_jobs_failed() -> None:
    """On server startup, mark any running jobs as failed."""
    from app.models.generation_job import GenerationJob

    async with async_session() as db:
        result = await db.execute(
            select(GenerationJob).where(
                GenerationJob.status.in_(["pending", "running"])
            )
        )
        stale_jobs = result.scalars().all()
        for job in stale_jobs:
            job.status = "failed"
            job.error_message = (
                "Le serveur a redemarre pendant la generation."
            )
            job.completed_at = datetime.now(timezone.utc)
        if stale_jobs:
            await db.commit()

This is called during the FastAPI lifespan startup, alongside other initialization tasks:

python@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    await seed_and_load_templates()
    start_poller()
    start_task_scheduler()
    from app.services.background_generation import mark_stale_jobs_failed
    await mark_stale_jobs_failed()
    yield
    # Shutdown
    stop_task_scheduler()
    stop_poller()

The user whose job was interrupted sees a clear error message: "The server restarted during generation." They can retry the request. The request_snapshot on the failed job contains the original parameters, so a future "retry" feature can re-submit without user input.

The Frontend: Polling and Progress

The frontend polling implementation is deliberately simple. When a background job is initiated, the chat component enters a polling state:

Every 3 seconds, it sends GET /api/jobs/{job_id}. The response includes status, progress_pct, progress_text, and (when complete) the full results. The UI displays a progress bar with the percentage and a text description of the current step.

When status becomes completed, the polling stops and the results are injected into the chat: the AI's response text, any generated files (with download links), tool step summaries, quiz data, and annotations. The chat looks exactly as if the generation had been streamed in real time -- the user sees the same final output regardless of whether they stayed connected or not.

When status becomes failed, the polling stops and the error message is displayed. The user can retry.

The 3-second interval was chosen after testing alternatives. One second was too aggressive -- it generated noticeable load with 50+ concurrent Pro users. Five seconds felt sluggish for progress updates. Three seconds provides responsive progress feedback without excessive server load.

Concurrency and Per-Conversation Limits

Beyond the global limit of 10 concurrent jobs, we enforce a per-conversation limit of 1. A user cannot spawn two background jobs in the same conversation. This prevents resource waste (the second job would likely produce the same output as the first) and avoids confusing the chat history with interleaved results.

The enforcement happens at the route level: before creating a new GenerationJob, the handler checks if any pending or running job exists for the same conversation_id. If so, it returns HTTP 409 Conflict with a message explaining that a generation is already in progress.

Users can, however, run background jobs in different conversations simultaneously. An accountant preparing reports for two different clients can have both generations running in parallel, each in its own conversation.

Why Not Celery? Why Not a Queue?

The natural question from anyone who has built job processing systems: why not use Celery, or RQ, or any proper distributed task queue?

Three reasons:

First, operational simplicity. Deblo is a two-person operation: one human CEO and one AI CTO. Every additional infrastructure component is a deployment risk and a maintenance burden. Celery requires a message broker (RabbitMQ or Redis), a worker process, a beat scheduler, and monitoring. Our asyncio.Task approach requires nothing beyond the existing FastAPI process and Redis instance.

Second, the workload is LLM-bound. Traditional task queues are designed for CPU-bound or I/O-bound work that benefits from worker pools. Our background tasks spend 95% of their time waiting for LLM API responses. A single asyncio event loop can manage dozens of these concurrent waiting tasks without threading or multiprocessing.

Third, we do not need distributed execution. Deblo runs on a single server (with plans for horizontal scaling later). The _active_jobs set and _active_jobs_lock provide sufficient concurrency control for a single-process deployment. When we scale to multiple servers, we will move to a proper distributed queue. But premature infrastructure is as dangerous as premature optimization.

Edge Cases and Lessons

Several edge cases emerged during testing:

Double-submit prevention. Without the per-conversation limit, users who clicked "Send" twice would spawn two identical background jobs. The 409 Conflict response prevents this, but we also added frontend debouncing on the send button.

Credit pre-authorization. Background jobs can consume significant credits (a 50-page report might use 200+ credits). We verify the user's balance before spawning the task. If the balance is insufficient, the job is rejected immediately rather than failing halfway through after consuming partial credits.

Partial result persistence. If a job fails after producing 3 of 5 requested files, those 3 files are still saved and accessible. The result_files array is updated incrementally as each file is generated. The user gets what was completed, even if the full request failed.

Mobile app polling. On mobile, when the app is backgrounded, polling stops. When it returns to the foreground, it resumes polling from the last known state. If the job completed while the app was backgrounded, the first poll returns the complete results. This "poll on resume" pattern works reliably across iOS and Android without requiring background fetch capabilities.

The Numbers

Since deploying background jobs:

Average job duration: 4.2 minutes
Longest successful job: 28 minutes (a comprehensive multi-country tax analysis with 6 generated files)
Failure rate: 3.1% (primarily LLM API timeouts and rate limits)
Cancellation rate: 8.7% (users changing their minds, which is a healthy signal)
Credit savings from cancellation: significant -- users are not charged for cancelled jobs

The feature transformed Deblo Pro from a conversational tool into a document generation platform. Professionals now use it to produce deliverables that would take hours to create manually. The background job system makes this possible by decoupling the generation from the browser session.

This is Part 11 of a 12-part series on building Deblo.ai.