Voice Calls With AI: Ultravox, LiveKit, and WebRTC

By Thales & Claude -- CEO & AI CTO, ZeroSuite, Inc.

There is a moment in every educational product's life where text stops being enough. For Deblo, that moment came when we watched a 9-year-old student in Abidjan try to type a mathematics question. She knew what she wanted to ask -- she could articulate it perfectly in spoken French -- but translating that thought into typed text on a small phone screen was a barrier that text-based AI cannot solve.

Voice was the obvious answer. Not voice notes that get transcribed and answered asynchronously, but real-time voice conversation -- the student speaks, the AI listens, thinks, and speaks back. A phone call with an AI tutor.

Building this required integrating three technologies: Ultravox for the voice AI model, LiveKit for the real-time audio transport layer, and WebRTC for the browser and mobile client connections. This article covers the full stack, from the API call that creates a voice session to the credit calculation when the call ends.

The Architecture: Three Layers

The voice call system has three distinct layers:

Deblo Backend (FastAPI): Creates the session, manages credits, stores transcripts. This is the orchestration layer.
Ultravox: The voice AI platform. Hosts the language model that can listen and speak in real time. Exposes a REST API for session management and a WebSocket/WebRTC endpoint for the actual audio stream.
LiveKit: The real-time communication infrastructure. Provides the WebRTC rooms, handles audio encoding/decoding, manages connectivity. On mobile, we use LiveKit's React Native SDK; on web, Ultravox's own WebRTC client connects directly.

The flow is: Deblo backend creates an Ultravox call, receives a joinUrl, returns it to the client. The client connects to that URL via WebRTC. Audio flows between the user and Ultravox's voice model. When the call ends, the backend retrieves the transcript, calculates credits, and stores everything.

Creating a Voice Session

The voice call starts with a POST to /voice/call. The backend builds a voice-specific system prompt, creates an Ultravox call via their REST API, and sets up the database records:

python# backend/app/routes/voice.py

@router.post("/voice/call", response_model=StartCallResponse)
async def start_voice_call(
    user: User | None = Depends(get_current_user_optional),
    db: AsyncSession = Depends(get_db),
):
    """Create an Ultravox call (students and guests, not professionals)."""
    # Professionals use voice notes, not calls
    if user and user.user_type == "professional":
        raise HTTPException(
            status_code=403,
            detail="Les appels vocaux sont r\u00e9serv\u00e9s aux \u00e9l\u00e8ves.",
        )

    # Verify minimum credits (at least 1 minute worth)
    if user:
        cost_per_min = await get_setting(
            "credit_cost_voice_per_minute", db,
            settings.CREDIT_COST_VOICE_PER_MINUTE,
        )
        if not await check_credits(user, cost_per_min, db):
            raise HTTPException(
                status_code=402,
                detail=f"Cr\u00e9dits insuffisants (minimum {cost_per_min} "
                       f"pour 1 minute).",
            )

    # Build voice-specific prompt
    from app.prompts.voice import build_voice_prompt
    voice_prompt = build_voice_prompt(
        user_name=user.name if user else None,
        class_id=user.preferred_class if user else None,
    )

    # Create the Ultravox call
    call_data = await create_ultravox_call(
        system_prompt=voice_prompt,
        voice=voice,
        language_hint="fr",
        temperature=0.7,
        max_duration=max_duration,
        selected_tools=VOICE_TOOLS,
    )

    join_url = call_data.get("joinUrl", "")

    # Create conversation and voice session records
    conversation = Conversation(
        id=uuid4(),
        user_id=user.id if user else None,
        mode="child",
        category="voice",
        title="Appel vocal avec D\u00e9blo",
        messages=[],
    )
    session = VoiceSession(
        id=uuid4(),
        user_id=user.id if user else None,
        conversation_id=conversation.id,
        ultravox_call_id=call_data.get("callId", ""),
        join_url=join_url,
        status="created",
        started_at=datetime.now(timezone.utc),
    )
    db.add(conversation)
    db.add(session)
    await db.flush()

    return StartCallResponse(
        session_id=str(session.id),
        join_url=join_url,
        conversation_id=str(conversation.id),
        max_minutes=max_duration // 60,
    )

Several design decisions are embedded here:

Professionals do not get voice calls. This is a product decision, not a technical one. K12 students get full voice conversations because speaking is more natural for children than typing. Professional users get voice notes instead -- they record an audio message, it gets transcribed, and the AI responds in text. The reasoning is that professionals need precise, reviewable output (financial calculations, legal references) that is better delivered as text.

Guests can make calls too. Unauthenticated users (guests) get a capped voice session -- shorter duration, no credit tracking. This lets potential users experience the voice feature before signing up.

Credits are checked at the start. We verify the user has at least enough credits for one minute before creating the session. Further credit deduction happens at the end based on actual duration.

The Ultravox Integration

Ultravox provides the voice AI model -- a model specifically designed for real-time spoken conversation, not just text generation. The API is straightforward:

python# backend/app/services/ultravox.py

ULTRAVOX_BASE_URL = "https://api.ultravox.ai/api"

async def create_ultravox_call(
    system_prompt: str,
    voice: str = "",
    language_hint: str = "fr",
    temperature: float = 0.7,
    max_duration: int = 900,
    selected_tools: list[dict] | None = None,
) -> dict:
    """POST /api/calls -- create an Ultravox call."""
    payload = {
        "systemPrompt": system_prompt,
        "model": "fixie-ai/ultravox-v0.7",
        "voice": voice or settings.ULTRAVOX_VOICE,
        "languageHint": language_hint,
        "temperature": temperature,
        "maxDuration": f"{max_duration}s",
        "firstSpeaker": "FIRST_SPEAKER_AGENT",
        "initialOutputMedium": "MESSAGE_MEDIUM_VOICE",
    }

    if selected_tools:
        payload["selectedTools"] = selected_tools

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            f"{ULTRAVOX_BASE_URL}/calls",
            json=payload,
            headers={"X-API-Key": settings.ULTRAVOX_API_KEY},
        )
        resp.raise_for_status()
        return resp.json()

Key configuration choices:

firstSpeaker: FIRST_SPEAKER_AGENT: The AI speaks first when the student joins. It greets them by name (if known) and asks how it can help. This is important for children -- a silent AI waiting for input is confusing; an AI that says "Bonjour Aminata, comment puis-je t'aider ?" is welcoming.
initialOutputMedium: MESSAGE_MEDIUM_VOICE: The AI starts in voice mode (as opposed to text mode). The student hears the greeting spoken aloud.
maxDuration: 900s: 15-minute maximum per session. This is a cost control measure -- at 5 credits per minute, a 15-minute call costs 75 credits. We do not want a student accidentally leaving a call connected for hours.
languageHint: fr: French is the primary language. Ultravox uses this for speech recognition optimization.

Photo Analysis During Voice Calls

One of Deblo's distinctive voice features is the ability to photograph an exercise during a call. A student can be speaking with the AI tutor, say "attends, je vais te montrer l'exercice" (wait, I will show you the exercise), and then take a photo. The AI analyzes the photo and continues the conversation with full context of what is in the image.

This is implemented through Ultravox's client-side tool calling system. We register an upload_photo tool that the AI can invoke when the student mentions wanting to show something:

python# backend/app/routes/voice.py

VOICE_TOOLS = [
    {
        "temporaryTool": {
            "modelToolName": "upload_photo",
            "description": (
                "Demande \u00e0 l'\u00e9l\u00e8ve de prendre une photo de "
                "son exercice, devoir ou document. L'\u00e9l\u00e8ve verra "
                "une interface cam\u00e9ra et pourra prendre une photo."
            ),
            "dynamicParameters": [
                {
                    "name": "context",
                    "location": "PARAMETER_LOCATION_BODY",
                    "schema": {
                        "type": "string",
                        "description": "Contexte de ce que l'\u00e9l\u00e8ve "
                                      "veut montrer",
                    },
                    "required": False,
                }
            ],
            "client": {},
        }
    }
]

The "client": {} key is significant. It tells Ultravox that this is a client-side tool -- the tool execution happens on the student's device (opening the camera, capturing the photo), not on the server. When the AI decides to call upload_photo, the Ultravox client SDK fires a callback on the frontend, which opens the camera UI.

Once the student takes a photo, the frontend sends it to POST /voice/analyze-photo, which uses a vision model to describe the image content in spoken-friendly language:

python# backend/app/routes/voice.py

VOICE_VISION_PROMPT = (
    "Tu es un assistant \u00e9ducatif pour enfants africains. "
    "D\u00e9cris le contenu de cette image de mani\u00e8re concise et orale "
    "(pas de Markdown, pas de LaTeX, pas de listes, pas de symboles). "
    "Le texte sera lu \u00e0 voix haute \u00e0 un enfant. "
    "Si c'est un exercice scolaire, d\u00e9cris clairement les questions "
    "ou probl\u00e8mes visibles. "
    "R\u00e9ponds en fran\u00e7ais simple et court (maximum 200 mots)."
)

async def _analyze_image_for_voice(
    image_base64: str,
    mime_type: str,
    context: str = "",
    model: str = "",
) -> str:
    """Non-streaming call to OpenRouter with a vision model."""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": VOICE_VISION_PROMPT},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{mime_type};base64,{image_base64}",
                    },
                },
            ],
        }
    ]

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            "https://openrouter.ai/api/v1/chat/completions",
            json={
                "model": model,
                "messages": messages,
                "max_tokens": 800,
                "temperature": 0.3,
            },
            headers={
                "Authorization": f"Bearer {settings.OPENROUTER_API_KEY}",
            },
        )
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]

The vision prompt is specifically designed for voice output: no Markdown formatting, no LaTeX, no bullet points -- just plain spoken French that sounds natural when read aloud by the AI voice. The analysis is returned to the Ultravox session as the tool result, and the AI incorporates it into the ongoing conversation seamlessly.

If the primary vision model (OpenRouter) fails, we fall back to Granite Vision via Replicate. This dual-provider approach ensures photo analysis works even during API outages.

Voice States: The Client-Side State Machine

On the client side, the voice call goes through a well-defined set of states:

idle -> connecting -> active -> listening/thinking/speaking -> ended

The state machine governs the UI:

idle: The call button is visible. No active session.
connecting: The WebRTC connection is being established. A spinner and "Connexion en cours..." message are shown.
active: The connection is established. The AI is either listening, thinking, or speaking.
- listening: The AI is receiving audio from the student. A subtle animation indicates active listening.
- thinking: The AI is processing a response. A thinking indicator shows.
- speaking: The AI is speaking. A waveform animation visualizes the audio output.
ended: The call is over. The transcript is displayed and credits are deducted.

The state transitions are driven by events from the Ultravox client SDK. On web, these come through the WebRTC data channel. On mobile, they come through a combination of LiveKit room events and Ultravox WebSocket messages.

Mobile Implementation: LiveKit + React Native

The mobile implementation is the most complex part of the voice system. Expo Go does not support native WebRTC modules, so voice calls require a native build (Expo Dev Client or standalone build).

We use @livekit/react-native for the audio transport layer. LiveKit handles all the WebRTC complexity -- ICE candidate negotiation, DTLS handshakes, audio codec selection, network quality adaptation -- through a high-level React Native API.

The mobile voice service manages the connection lifecycle conceptually like this:

typescript// Mobile voiceService pattern (conceptual)

type VoiceState =
  | 'idle'
  | 'connecting'
  | 'active'
  | 'listening'
  | 'thinking'
  | 'speaking'
  | 'ended';

interface VoiceSession {
  sessionId: string;
  conversationId: string;
  joinUrl: string;
  maxMinutes: number;
}

class VoiceService {
  private state: VoiceState = 'idle';
  private room: Room | null = null;
  private transcript: TranscriptEntry[] = [];
  private startTime: number = 0;

  async startCall(): Promise<VoiceSession> {
    this.state = 'connecting';

    // 1. Create session via Deblo API
    const response = await api.post('/voice/call');
    const { session_id, join_url, conversation_id, max_minutes } =
      response.data;

    // 2. Connect to LiveKit room
    this.room = new Room();
    await this.room.connect(join_url);

    // 3. Enable microphone
    await this.room.localParticipant.setMicrophoneEnabled(true);

    // 4. Listen for agent audio and state changes
    this.room.on(RoomEvent.TrackSubscribed, (track) => {
      if (track.kind === Track.Kind.Audio) {
        // Agent audio is playing
        this.state = 'speaking';
      }
    });

    this.state = 'active';
    this.startTime = Date.now();

    return { sessionId: session_id, joinUrl: join_url,
             conversationId: conversation_id, maxMinutes: max_minutes };
  }

  async endCall(): Promise<EndCallResult> {
    // 1. Disconnect from LiveKit
    this.room?.disconnect();

    // 2. Notify backend to end session and calculate credits
    const response = await api.post(
      `/voice/call/${this.sessionId}/end`
    );

    this.state = 'ended';
    return response.data;
  }
}

The Ultravox WebSocket protocol runs alongside the LiveKit connection for transcript data and tool invocations. When the AI invokes the upload_photo tool, the Ultravox client fires a callback that the React Native layer catches to open the camera.

Ending the Call: Transcripts and Credits

When a call ends (either the user hangs up, the 15-minute limit is reached, or the connection drops), the backend processes the session:

python# backend/app/routes/voice.py

@router.post("/voice/call/{session_id}/end")
async def end_voice_call(
    session_id: UUID,
    user: User | None = Depends(get_current_user_optional),
    db: AsyncSession = Depends(get_db),
):
    """End a voice call and charge credits."""
    session = await _find_session(session_id, user, db)

    # Retrieve call details from Ultravox API
    duration_seconds = 0
    transcript_data = []

    if session.ultravox_call_id:
        call_info = await get_ultravox_call(session.ultravox_call_id)
        duration_seconds = call_info.get("duration", 0)
        transcript_data = await get_ultravox_transcript(
            session.ultravox_call_id
        )

    # Calculate and deduct credits (authenticated only)
    actual_cost = 0
    if user:
        cost_per_min = await get_setting(
            "credit_cost_voice_per_minute", db,
            settings.CREDIT_COST_VOICE_PER_MINUTE,
        )
        minutes = max(1, ceil(duration_seconds / 60))
        total_cost = minutes * cost_per_min

        balance = await get_balance(user, db)
        actual_cost = min(total_cost, balance["total"])

        if actual_cost > 0:
            await deduct_credits(
                user, actual_cost, "voice",
                session.conversation_id, db,
            )

    # Update session record
    session.status = "ended"
    session.ended_at = datetime.now(timezone.utc)
    session.duration_seconds = duration_seconds
    session.credits_charged = actual_cost
    session.transcript = transcript_data

    return EndCallResponse(
        duration_seconds=duration_seconds,
        credits_charged=actual_cost,
        transcript=transcript_data,
        new_balance=balance["total"] if user else 0,
    )

The credit calculation is deliberately lenient: we round up to the nearest minute (so a 30-second call costs 1 minute), but if the user's balance is less than the total cost, we deduct only what they have rather than denying the charge. The student already had the conversation -- punishing them retroactively would be a poor experience.

The transcript is stored as a JSONB array on both the VoiceSession and the associated Conversation. Each entry contains the speaker role, the text, and a timestamp. This allows the conversation to appear in the user's chat history just like any text conversation, with the transcript formatted as alternating user and assistant messages.

WhatsApp-Style Voice Notes for Pro Users

Professional users do not get full voice calls, but they do get voice notes. The UI presents a large indigo button that, when held, records audio with a real-time waveform visualization -- similar to WhatsApp's voice note recording interface.

The recorded audio is sent to the backend as a base64-encoded attachment alongside the text message. The backend uses an audio-capable model (via OpenRouter) to transcribe the audio and incorporates the transcription into the conversation context. The AI responds in text, not voice, because professional output needs to be reviewable and copyable.

This two-tier approach -- voice calls for students, voice notes for professionals -- reflects the different use cases. A child explaining a math problem benefits from real-time back-and-forth dialogue. A chartered accountant describing a financial situation benefits from dictation that produces a precise written response.

What We Learned About Voice AI

Voice-first changes everything about prompt engineering. Prompts for voice models must explicitly prohibit Markdown, LaTeX, bullet points, and any formatting that does not translate to speech. We learned this the hard way when the AI started dictating LaTeX formulas aloud: "backslash frac open brace x plus 3 close brace..."

The first speaker matters. Having the AI speak first when the student joins eliminates the awkward "hello? is this working?" moment. Children expect someone to greet them when they call.

Photo during call is a differentiator. The ability to photograph an exercise mid-conversation, without ending the call, transforms the voice feature from a novelty into a genuine learning tool. The student can say "I do not understand question 3" and then show question 3, and the AI sees it while still in the conversation.

15 minutes is the right limit. Long calls are expensive (in credits and in API costs), and students' attention spans are finite. Fifteen minutes is long enough for a meaningful tutoring session on a single topic and short enough to prevent runaway costs.

Native builds are unavoidable for WebRTC on mobile. We spent two days trying to make voice calls work in Expo Go before accepting that WebRTC requires native modules. The @livekit/react-native SDK is excellent but demands a native build chain (Expo Dev Client or bare workflow). This added complexity to our mobile development pipeline but was non-negotiable.

Voice calls are, more than any other feature, what makes Deblo feel like a real tutor rather than a chatbot. When a student speaks to the AI and the AI speaks back -- in fluent, natural French, with an encouraging tone and the patience to explain the same concept three different ways -- the technology disappears. What remains is a child learning.

This is article 8 of 12 in the "How We Built Deblo.ai" series.

Voice Calls With AI: Ultravox, LiveKit, and WebRTC

The Architecture: Three Layers

Creating a Voice Session

The Ultravox Integration

Photo Analysis During Voice Calls

Voice States: The Client-Side State Machine

Mobile Implementation: LiveKit + React Native

Ending the Call: Transcripts and Credits

WhatsApp-Style Voice Notes for Pro Users

What We Learned About Voice AI

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live