SSE Streaming: Real-Time AI Responses in SvelteKit

By Thales & Claude -- CEO & AI CTO, ZeroSuite, Inc.

The difference between a good AI product and a great one is often measured in milliseconds. Not the total response time -- users will wait 10 or 20 seconds for a thoughtful answer to a complex question -- but the time from pressing "Send" to seeing the first character appear. That interval is the dead zone where users wonder if the app is broken, if their network dropped, if they should press the button again.

Deblo eliminates that dead zone through Server-Sent Events (SSE) streaming. The AI's response starts appearing within 500 milliseconds of the request, character by character, while the backend is still generating it. But Deblo's streaming goes far beyond simple text: we stream 20+ event types including inline quizzes, downloadable files, credit updates, tool execution progress, citation annotations, and payment links -- all through a single SSE connection.

This article explains how we built it.

Why SSE Over WebSocket

The first question anyone asks about real-time communication is: why not WebSocket?

We chose SSE (Server-Sent Events) for three specific reasons:

One-way streaming is all we need. The user sends a message (a regular HTTP POST), and the AI streams back the response. There is no bidirectional communication during the response phase. SSE is purpose-built for this pattern -- it is a one-way channel from server to client, which is exactly what we want.

Better firewall and proxy support. SSE runs over standard HTTP/1.1 or HTTP/2. It does not require a protocol upgrade handshake like WebSocket does. This matters in Africa, where many users are behind carrier proxies, corporate firewalls, or shared network infrastructure that may not properly support WebSocket upgrades. We have had zero reports of SSE connections being blocked.

Simpler deployment and debugging. SSE connections are standard HTTP responses. They show up in browser DevTools as a regular request with a streaming response body. They can be load-balanced by any HTTP reverse proxy. They do not require sticky sessions. They reconnect automatically on network interruption (the browser's EventSource API handles this natively, though we use fetch for more control).

The tradeoff is that SSE does not support binary data (everything is UTF-8 text) and cannot send data from client to server. Neither of these limitations affects our use case.

The Protocol: POST to SSE

The entry point is POST /api/chat, which accepts a JSON body with the message, conversation ID, attachments, and various configuration flags. It returns an SSE stream:

python# backend/app/routes/chat.py (simplified)

@router.post("/chat")
async def chat(
    request: ChatRequest,
    user: User | None = Depends(get_current_user_optional),
    db: AsyncSession = Depends(get_db),
):
    # ... validate credits, build system prompt, prepare messages ...

    return StreamingResponse(
        _stream_response(
            user=user,
            messages=full_messages,
            system_prompt=system_prompt,
            model=model,
            conversation=conversation,
            db=db,
        ),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
            "X-Conversation-Id": str(conversation.id),
            "X-Job-Id": str(job_id) if job_id else "",
        },
    )

The X-Conversation-Id and X-Job-Id response headers are sent immediately, before any streaming begins. The frontend captures these to track the conversation and any background generation jobs.

The response headers also include X-Accel-Buffering: no, which tells Nginx (and similar reverse proxies) not to buffer the response. Without this header, the proxy accumulates the entire response before sending it to the client, which defeats the purpose of streaming entirely.

20+ Event Types

Each SSE event is a JSON object prefixed with event: and data: lines. Here are the event types the backend can emit:

Event	Purpose
`content`	Streaming text delta (the main response)
`content_replace`	Replace entire response (for regeneration)
`bonus_credits`	AI awarded bonus credits to the student
`credit_update`	Credits deducted, new balance
`complexity_warning`	Query detected as complex, may cost more
`quiz`	Inline multiple-choice question
`file`	Downloadable generated file (PDF, XLSX, etc.)
`payment_link`	Inline payment link for credit recharge
`email_draft`	Generated email draft for review
`tool_start`	Tool execution began (shows in ProcessingSteps)
`tool_end`	Tool execution completed
`tool_done`	Summary of tool result
`tool_progress`	Streaming delta from file generation tool
`suggestions`	Quick reply chips for follow-up
`annotations`	URL citations from web search
`reasoning`	Model's chain-of-thought (when enabled)
`placeholder`	Placeholder text while model thinks
`task_created`	Task was created from the conversation
`email_sent`	Email/SMS/WhatsApp was sent
`heartbeat`	Keep-alive ping (every 15 seconds)
`done`	Stream complete

The Backend Streaming Pattern

The backend yields SSE events from an async generator. The core pattern for tool-augmented streaming looks like this:

python# backend/app/services/llm.py (simplified pattern)

async def stream_chat_response(
    messages: list[dict],
    system_prompt: str,
    model: str,
    tools: list[dict] | None = None,
    tool_executor: ToolExecutor | None = None,
) -> AsyncGenerator[str, None]:
    """Stream LLM response with tool calling support.
    Yields SSE-formatted strings."""

    request_json = {
        "model": model,
        "messages": messages,
        "stream": True,
        "temperature": settings.DEBLO_K12_LLM_TEMPERATURE,
        "max_tokens": settings.DEBLO_K12_LLM_MAX_TOKENS,
    }
    if system_prompt:
        request_json["messages"] = [
            {"role": "system", "content": system_prompt},
            *messages,
        ]
    if tools:
        request_json["tools"] = tools

    accumulated_text = ""

    async for chunk in _raw_stream(request_json):
        delta = chunk.get("choices", [{}])[0].get("delta", {})

        # Text content
        if delta.get("content"):
            text = delta["content"]
            accumulated_text += text
            yield f"event: content\ndata: {json.dumps({'text': text})}\n\n"

        # Tool calls
        if delta.get("tool_calls"):
            for tc in delta["tool_calls"]:
                func_name = tc["function"]["name"]
                func_args = json.loads(tc["function"]["arguments"])

                # Signal tool start to frontend
                yield (
                    f"event: tool_start\n"
                    f"data: {json.dumps({'name': func_name})}\n\n"
                )

                # Execute tool
                if tool_executor:
                    result = await tool_executor(
                        func_name, func_args, tc["id"]
                    )

                    # Signal tool completion
                    yield (
                        f"event: tool_end\n"
                        f"data: {json.dumps({
                            'name': func_name,
                            'success': result.get('success', True),
                        })}\n\n"
                    )

    yield "event: done\ndata: {}\n\n"

The _raw_stream function handles the HTTP connection to OpenRouter, parsing the SSE format from the LLM provider and yielding individual chunks:

python# backend/app/services/llm.py

async def _raw_stream(request_json: dict) -> AsyncGenerator[dict, None]:
    """Low-level streaming from OpenRouter."""
    async with httpx.AsyncClient(timeout=120.0) as client:
        async with client.stream(
            "POST",
            "https://openrouter.ai/api/v1/chat/completions",
            headers={
                "Authorization": f"Bearer {settings.OPENROUTER_API_KEY}",
                "HTTP-Referer": "https://deblo.ai",
                "Content-Type": "application/json",
            },
            json=request_json,
        ) as response:
            response.raise_for_status()
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data_str = line[6:]
                    if data_str.strip() == "[DONE]":
                        break
                    try:
                        yield json.loads(data_str)
                    except json.JSONDecodeError:
                        continue

This is SSE-inside-SSE: we receive an SSE stream from OpenRouter and re-emit it as an SSE stream to the frontend, transforming and enriching the events along the way. The backend adds credit tracking, tool execution, bonus credit awards, and all the other event types that the LLM provider knows nothing about.

The Frontend: streamChat() With 42+ Parameters

The frontend streamChat() function is the central hub for all SSE communication. It accepts the user's message and a callback for each event type:

typescript// frontend/src/lib/utils/api.ts (signature)

export async function streamChat(
  message: string,
  classId: string | null,
  subject: string | null,
  onChunk: (text: string) => void,
  attachments?: Attachment[],
  conversationId?: string | null,
  onBonusCredits?: (data: {
    credits_awarded: number;
    new_balance: number;
    reason: string;
  }) => void,
  onCreditUpdate?: (data: {
    credits_used: number;
    new_balance: number;
    tokens: number;
  }) => void,
  mode?: string | null,
  domain?: string | null,
  onComplexityWarning?: (data: {
    is_complex: boolean;
    score: number;
    matched_terms: string[];
  }) => void,
  // ... 30+ more parameters ...
  onQuiz?: (data: QuizData) => void,
  onFile?: (data: FileData) => void,
  onToolEvent?: (event: ToolEvent) => void,
  onSuggestions?: (chips: Array<{
    label: string;
    message: string;
  }>) => void,
  onPaymentLink?: (data: PaymentLinkData) => void,
  onContentReplace?: (text: string) => void,
  signal?: AbortSignal,
): Promise<{
  conversationId: string | null;
  jobId: string | null;
}> {
  // ... implementation ...
}

Yes, 42+ parameters. This function grew organically as we added features, and every parameter represents a real feature that a user sees. We considered refactoring to an options object, but the callback-per-event pattern makes it explicit at the call site exactly which events a given component handles.

The SSE parsing on the frontend uses fetch with getReader() rather than the browser's EventSource API. The reason is that EventSource only supports GET requests, but our chat endpoint is a POST (it sends the message body). The manual parsing looks like this:

typescript// frontend/src/lib/utils/api.ts (SSE parsing, simplified)

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    Authorization: `Bearer ${token}`,
  },
  body: JSON.stringify(body),
  signal,
});

// Capture conversation ID from response headers
const conversationId = response.headers.get('X-Conversation-Id');
const jobId = response.headers.get('X-Job-Id');

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || '';

  let currentEvent = '';
  for (const line of lines) {
    if (line.startsWith('event: ')) {
      currentEvent = line.slice(7).trim();
    } else if (line.startsWith('data: ') && currentEvent) {
      const data = JSON.parse(line.slice(6));

      switch (currentEvent) {
        case 'content':
          onChunk(data.text);
          break;
        case 'bonus_credits':
          onBonusCredits?.(data);
          break;
        case 'quiz':
          onQuiz?.(data);
          break;
        case 'tool_start':
          onToolEvent?.({ type: 'start', ...data });
          break;
        case 'suggestions':
          onSuggestions?.(data.chips);
          break;
        // ... handle all other event types ...
      }
      currentEvent = '';
    }
  }
}

The buffer management is critical. SSE events can be split across multiple read() calls -- a single event might arrive in two or three chunks. The buffer accumulates partial data and only processes complete lines.

A Quiz Arrives Mid-Stream

One of Deblo's distinctive features is inline quizzes. The AI tutor can, at any point during its response, insert a multiple-choice question that the student must answer before the conversation continues. This is implemented as a quiz SSE event:

json{
  "event": "quiz",
  "data": {
    "id": "quiz_abc123",
    "type": "mcq",
    "question": "Quel est le r\u00e9sultat de 3x + 7 = 22 ?",
    "options": [
      {"id": "a", "text": "x = 3"},
      {"id": "b", "text": "x = 5"},
      {"id": "c", "text": "x = 7"},
      {"id": "d", "text": "x = 15"}
    ],
    "correct": "b",
    "explanation": "On isole x : 3x = 22 - 7 = 15, donc x = 15/3 = 5.",
    "difficulty": "medium",
    "subject": "math",
    "bonus_credits": 2
  }
}

When the frontend receives this event, it renders a QuizWidget component inline within the assistant's message bubble. The student selects an answer, gets immediate feedback, and earns bonus credits for a correct response. All of this happens without interrupting the streaming text that may still be arriving.

Tool Progress Visualization: ProcessingSteps

When the AI uses tools -- web search, file generation, code execution, email sending -- the user sees a real-time timeline of what is happening. This is the ProcessingSteps component, which renders a vertical timeline with animated status indicators for each tool invocation.

The component uses Svelte 5 runes for reactive state management:

svelte<!-- frontend/src/lib/components/ProcessingSteps.svelte (pattern) -->
<script lang="ts">
  let { steps = [], mode = 'child' }: {
    steps: ProcessingStep[];
    mode?: string | null;
  } = $props();

  let isCollapsed = $state(false);

  const isPro = $derived(mode === 'pro');
  const accentColor = $derived(
    isPro ? '#6366f1' : '#22c55e'
  );
  const completedCount = $derived(
    steps.filter((s) => s.status === 'completed').length
  );

  const TOOL_ICONS: Record<string, IconComponent> = {
    web_search: Globe,
    browse_url: Globe,
    bash_execute: Terminal,
    generate_xlsx: FileText,
    generate_pdf: FileText,
    interactive_quiz: HelpCircle,
    award_bonus_credits: Gift,
    buy_credits: CreditCard,
    _thinking: Brain,
    _reasoning: Brain,
  };
</script>

<div class="timeline-container">
  {#each steps as step, i (step.id)}
    {@const isLoading = step.status === 'loading'}
    {@const isError = step.status === 'error'}

    <div class="timeline-step">
      <!-- Vertical connector line -->
      {#if i < steps.length - 1}
        <div class="timeline-line"></div>
      {/if}

      <!-- Animated icon node -->
      <div class="timeline-icon-wrap">
        {#if isLoading}
          <div class="step-spinner"></div>
        {:else if isError}
          <X size={11} />
        {:else}
          <svelte:component this={getIcon(step.name)} size={11} />
        {/if}
      </div>

      <!-- Label and detail -->
      <div class="timeline-content">
        <span class="font-medium">{step.label}</span>
        {#if step.summary}
          <span class="text-slate-400">{step.summary}</span>
        {/if}
      </div>
    </div>
  {/each}
</div>

Each tool goes through three phases: loading (spinner animation), completed (static icon), or error (red X). The timeline animates in with staggered delays, giving the impression of a workflow unfolding in real time. When all steps complete, the component can be collapsed into a compact summary line to save space.

For file generation tools (PDF, XLSX, DOCX), the tool_progress event streams delta updates that show a live preview of the file content being generated, creating the feeling that the document is being written in real time.

Credit Updates in Real Time

Every interaction costs credits, and users need to see their balance update without refreshing the page. The credit_update event handles this:

When the backend deducts credits for a message, it emits a credit_update event with the number of credits used, the token count (for Pro users), and the new balance. The frontend's credit display component reactively updates to show the new balance. If the balance hits zero mid-conversation, the AI's response includes a gentle prompt to recharge with an inline payment link.

This real-time credit feedback serves two purposes: transparency (users always know what they are spending) and urgency (seeing the balance decrease creates a natural motivation to recharge before it reaches zero).

Mobile Streaming: Custom useStream Hook

The mobile app (React Native / Expo) cannot use the browser's fetch streaming API directly in all environments. We built a custom useStream hook in the @deblo/streaming package that handles SSE parsing over fetch with ReadableStream:

The core challenge on mobile is that some React Native networking implementations buffer the entire response before making it available. We work around this by using the react-native-fetch-api polyfill which provides true streaming support, and by keeping the SSE parsing logic identical to the web version to avoid behavioral divergences between platforms.

Conversation Headers for Tracking

Every SSE response includes two custom headers:

X-Conversation-Id: The UUID of the conversation. For new conversations, this is generated server-side and returned in the first response. The frontend stores it and includes it in subsequent messages to maintain conversation continuity.
X-Job-Id: Present only for background generation tasks (Pro mode). The frontend can poll GET /api/chat/job/{job_id}/status to check on long-running file generation tasks.

These headers are sent before the streaming body begins, so the frontend has immediate access to the conversation ID without waiting for any streamed content.

What We Learned About Streaming

Buffer management is not optional. SSE events split across TCP packets is the norm, not the exception. Any streaming implementation that does not handle partial events will produce garbled output.

Heartbeats prevent proxy timeouts. Without periodic keep-alive events, reverse proxies and load balancers will close idle connections after 30-60 seconds. Our 15-second heartbeat interval keeps the connection alive through even the longest tool execution chains.

Event typing is worth the complexity. Having 20+ distinct event types sounds like over-engineering, but each type enables a specific UI feature. The alternative -- embedding everything in the text stream with parsing markers -- is fragile and creates coupling between the LLM's output format and the frontend's rendering logic.

SSE reconnection is free but insufficient. The browser's native EventSource API reconnects automatically, but since we use fetch, we handle reconnection manually. In practice, we do not retry mid-stream -- if the connection drops during a response, we show an error and let the user resend. The cost of a partial response (confused context) outweighs the benefit of automatic retry.

Disable buffering at every layer. Response buffering can hide at the application level (FastAPI), the ASGI server level (Uvicorn), the reverse proxy level (Nginx/Caddy), and even the CDN level (Cloudflare). You must disable it at every layer or streaming will not work. The X-Accel-Buffering: no header, Cache-Control: no-cache, and the FastAPI StreamingResponse class handle most of this, but you must verify end-to-end.

Streaming is one of those features where the implementation complexity is invisible to the user. When it works, the AI just... talks. Characters appear fluidly, tools execute visually, quizzes pop up inline. The user never thinks about SSE or event parsing or buffer management. They just see a tutor that responds instantly. And that is exactly the point.

This is article 7 of 12 in the "How We Built Deblo.ai" series.