Observability: Tracking Every LLM Call in Production

You cannot improve what you cannot measure. This is true for any software product, but it is especially true for AI products where the core behavior -- what the model says, how long it takes, how much it costs -- is fundamentally non-deterministic.

When a student asks "Explique-moi les fractions," the response depends on the model, the temperature, the system prompt, the conversation history, and the random seed. The same question asked twice may produce different answers of different lengths at different costs. If you are not logging every call, you are flying blind.

We log everything. Every LLM API call. Every tool invocation. Every credit movement. Every exercise result. Every admin action. This article covers the observability infrastructure that lets us monitor, debug, and optimize an AI education platform serving students across Africa.

The AILog Model

At the center of our observability system is the AILog table. It records every single LLM API call made through OpenRouter:

pythonclass AILog(Base):
    __tablename__ = "ai_logs"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    user_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=True)
    conversation_id = Column(UUID(as_uuid=True), nullable=True)
    model_used = Column(String(100))
    has_images = Column(Boolean, default=False)
    input_tokens = Column(Integer, nullable=True)
    output_tokens = Column(Integer, nullable=True)
    response_time_ms = Column(Integer, nullable=True)
    error = Column(Text, nullable=True)
    created_at = Column(DateTime(timezone=True), server_default=func.now())

Every field serves a specific monitoring purpose:

model_used tracks which model processed the request. We use multiple models: DeepSeek V3 for text conversations, GPT-4o Mini for vision (photo homework help), Mistral Large for memory summarization. This field lets us break down costs and performance by model.

has_images flags multimodal requests. Image processing costs significantly more than text-only requests (both in tokens and in API pricing). Tracking this lets us monitor how frequently students use the photo input feature and what it costs.

input_tokens and output_tokens come directly from the OpenRouter response. These are the ground truth for cost calculation. OpenRouter reports exact token counts in the response headers, and we extract and store them immediately.

response_time_ms measures end-to-end latency from the moment we send the request to OpenRouter until we receive the complete response (or the last SSE chunk for streaming requests). This is our primary performance metric.

error captures the error message when an API call fails. This includes HTTP errors (429 rate limit, 500 server error), timeout errors, and JSON parsing errors. A non-null error field means the user experienced a failure.

The log entry is created at the end of each LLM call in the streaming loop. For a single user message that triggers 3 tool iterations (agentic loop), we create 3 AILog entries -- one per LLM call. This granularity lets us analyze per-iteration behavior: how much does the first LLM call cost versus the third? How does latency change as the context grows within a single agentic loop?

The Admin Dashboard

The admin dashboard at /admin-7f3a9c2d/ (more on this URL later) provides a real-time view of platform health. The stats endpoint aggregates data across users, conversations, and financial metrics:

python@router.get("/stats")
async def dashboard_stats(
    period: str = Query(default="today"),
    country: str | None = Query(default=None),
    admin: User = Depends(get_admin_user),
    db: AsyncSession = Depends(get_db),
):
    start, end = _period_range(period)

    # Country filter helper
    def user_country_filter(q):
        if country:
            return q.where(User.country == country)
        return q

    # Users
    total_users = (await db.execute(user_country_filter(
        select(func.count(User.id))
    ))).scalar() or 0

    new_in_period = (await db.execute(user_country_filter(
        select(func.count(User.id))
        .where(User.created_at >= start, User.created_at < end)
    ))).scalar() or 0

    active_users = (await db.execute(user_country_filter(
        select(func.count(User.id)).where(User.is_active == True)
    ))).scalar() or 0

    pro_count = (await db.execute(user_country_filter(
        select(func.count(User.id)).where(User.user_type == "professional")
    ))).scalar() or 0

    # ... conversations, messages, revenue, credit usage, voice sessions

The endpoint supports five period filters: today, yesterday, week, month, year. Each filter defines a (start, end) datetime range:

today: midnight UTC to now
yesterday: yesterday midnight to today midnight
week: Monday midnight of current week to now
month: first of current month to now
year: January 1st to now

The optional country filter lets us drill down into specific markets. We can answer questions like "How many new users did we get in Cote d'Ivoire this week?" or "What is the total revenue from Senegal this month?" instantly.

The dashboard displays: - New users and active users for the period - Child vs. professional user split - Total conversations and messages - Total revenue (credit purchases) - Conversation breakdown by class and subject - Voice session count and duration

This is not a third-party analytics tool. It is a custom-built dashboard querying our PostgreSQL database directly. The queries are fast because the tables have appropriate indexes, and the data volumes are manageable (tens of thousands of rows, not millions -- yet).

Dynamic Configuration: SystemSetting

One of the most powerful tools in our observability arsenal is not a monitoring tool at all -- it is a configuration system that lets us change the platform's behavior without deploying code:

pythonclass SystemSetting(Base):
    __tablename__ = "system_settings"

    key = Column(String(100), primary_key=True)
    value = Column(Text, nullable=False)  # JSON-encoded
    updated_at = Column(DateTime(timezone=True), server_default=func.now(),
                        onupdate=func.now())

A simple key-value store with JSON-encoded values. But the impact is enormous. Here is what we store in SystemSetting:

root_prompt: the system prompt for child mode. We can edit the AI's personality, teaching style, and rules without a code deploy.
pro_root_prompt: the system prompt for professional mode.
domain_overlays: per-domain prompt overlays for the 101 AI advisors.
llm_model: the model name for text conversations. We can switch from DeepSeek V3 to Claude or GPT-4o by changing one database row.
llm_vision_model: the model for image analysis.
llm_memory_model: the model for summarization.
credit_costs: per-action credit costs (e.g., text message = 1 credit, image analysis = 3 credits, voice call = 5 credits/minute).
maintenance_mode: a boolean that, when true, returns a maintenance page for all users except admins.

The setting is read via a helper function that caches in memory with a configurable TTL:

pythonfrom app.models.system_setting import SystemSetting
from sqlalchemy import select

async def get_setting(db: AsyncSession, key: str, default=None):
    result = await db.execute(
        select(SystemSetting.value).where(SystemSetting.key == key)
    )
    row = result.scalar_one_or_none()
    if row is None:
        return default
    return json.loads(row)

async def set_setting(db: AsyncSession, key: str, value):
    from sqlalchemy.dialects.postgresql import insert
    stmt = insert(SystemSetting).values(
        key=key, value=json.dumps(value)
    ).on_conflict_do_update(
        index_elements=["key"],
        set_={"value": json.dumps(value), "updated_at": func.now()},
    )
    await db.execute(stmt)
    await db.commit()

The set_setting function uses PostgreSQL's ON CONFLICT DO UPDATE (upsert) to atomically create or update a setting. The admin dashboard exposes these settings through a UI, allowing us to:

Tweak the system prompt in real time and observe the effect on conversation quality
Switch LLM models during an outage (if DeepSeek is down, switch to GPT-4o in seconds)
Adjust credit costs based on market feedback
Enable maintenance mode during database migrations

This is observability in the broadest sense -- not just watching the system, but being able to act on what you observe without going through a deploy cycle.

ExerciseResult: Learning Analytics

For an educational platform, the most important metric is not latency or cost -- it is whether students are learning. The ExerciseResult model tracks individual quiz outcomes:

pythonclass ExerciseResult(Base):
    __tablename__ = "exercise_results"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    user_id = Column(UUID(as_uuid=True), ForeignKey("users.id"),
                     nullable=False, index=True)
    conversation_id = Column(UUID(as_uuid=True),
                             ForeignKey("conversations.id"), nullable=True)
    subject = Column(String(50), nullable=True)
    class_id = Column(String(20), nullable=True)
    correct = Column(Boolean, nullable=False)
    difficulty = Column(String(20), nullable=True)
    topic = Column(String(200), nullable=True)
    created_at = Column(DateTime(timezone=True), server_default=func.now(),
                        index=True)

Every time a student answers a quiz question (via the interactive_quiz tool) or the AI reports an exercise result (via the report_exercise_result tool), a row is inserted into this table. The fields capture:

What the student was studying: subject (mathematiques, physique, francais), class (CP, CE1, 6eme, Terminale), topic (fractions, lois de Newton, conjugaison)
How hard it was: difficulty (facile, moyen, difficile)
Whether they got it right: the boolean correct field

With this data, we can answer critical educational questions:

What subjects do students struggle with most? (aggregate correct by subject)
Do students improve over time? (compare correct rate by created_at over weeks)
Which topics have the lowest success rate for a specific class level? (filter by class_id, group by topic)
Is a specific student ready for their exam? (filter by user_id, aggregate recent correct rates)

The admin dashboard surfaces these analytics at the aggregate level. Individual student progress is visible through the student's own dashboard and through the organization dashboard (for teachers).

The CreditLedger as Audit Trail

Every credit movement in the system is logged in the CreditLedger. This is not just financial tracking -- it is a complete audit trail of how the platform's economy works:

credit_purchase: user bought 100 credits for 1,000 FCFA
credit_usage: 1 credit consumed for a text message
bonus_credit: AI awarded 2 bonus credits for a correct quiz answer
referral_bonus: 50 credits awarded for referring a new user
admin_adjustment: admin manually added 100 credits for a support case

Each ledger entry records the event type, the amount (positive or negative), the balance after the transaction, and a reference ID linking to the source (purchase ID, conversation ID, coupon ID). This triple-entry bookkeeping pattern ensures that we can always reconstruct any user's credit balance from the ledger alone -- we never rely solely on the cached credits_balance field on the user model.

The admin can query the ledger to answer questions like: - "How many credits did we give away as bonuses this month?" (filter by event type) - "Which users are consuming the most credits?" (aggregate by user_id) - "Is the AI being too generous with bonus credits?" (aggregate bonus_credit events)

That last question is particularly important. The AI has a tool (award_bonus_credits) that lets it give students extra credits for good performance. Without observability, a misconfigured system prompt could cause the AI to give away hundreds of credits per conversation. The ledger and AILog together let us detect and correct this.

The Obscured Admin Route

Security through obscurity is not security. But obscurity layered on top of authentication adds defense in depth. Our admin dashboard is not at /admin. It is at /admin-7f3a9c2d/:

src/routes/admin-7f3a9c2d/+layout.svelte
src/routes/admin-7f3a9c2d/+page.svelte
src/routes/admin-7f3a9c2d/users/+page.svelte
src/routes/admin-7f3a9c2d/conversations/+page.svelte
src/routes/admin-7f3a9c2d/ledger/+page.svelte
src/routes/admin-7f3a9c2d/orgs/+page.svelte
src/routes/admin-7f3a9c2d/purchases/+page.svelte
src/routes/admin-7f3a9c2d/voice-sessions/+page.svelte
src/routes/admin-7f3a9c2d/files/+page.svelte
src/routes/admin-7f3a9c2d/projects/+page.svelte

The hash suffix (7f3a9c2d) is a random string that makes the admin URL unguessable. Automated scanners that probe /admin, /dashboard, /wp-admin will find nothing. The actual admin path is only known to us.

But the URL is not the security boundary. The real protection is the get_admin_user dependency that runs on every admin API endpoint:

pythonasync def get_admin_user(request: Request, db: AsyncSession = Depends(get_db)):
    user = await get_current_user(request, db)
    if not user or not user.is_admin:
        raise HTTPException(status_code=401, detail="Unauthorized")
    return user

Every admin endpoint requires a valid JWT token belonging to a user with is_admin = True. There is exactly one admin user in the system. The obscured URL prevents discovery; the authentication prevents unauthorized access.

Notification Templates: DB-Stored, Admin-Overridable

Notification templates (for email, SMS, push, and in-app notifications) are stored in the database with default values defined in code:

python# Default templates defined in code
DEFAULT_TEMPLATES = {
    "welcome_email": {
        "subject": "Bienvenue sur Deblo.ai !",
        "body": "Bonjour {name}, bienvenue sur Deblo.ai...",
        "channel": "email",
    },
    "task_due_soon": {
        "subject": "Rappel : tache a faire aujourd'hui",
        "body": "Votre tache \"{task_title}\" est prevue pour aujourd'hui.",
        "channel": "push",
    },
    # ... 15+ more templates
}

The admin can override any template through the dashboard without a code deploy. The system checks the database first; if no override exists, it falls back to the code-defined default. This pattern -- code defaults with database overrides -- gives us the safety of version-controlled defaults with the flexibility of runtime changes.

The Coupon System

The BonusCredit and CouponRedemption models power the coupon system, which is also observable through the admin dashboard:

sql-- BonusCredit (coupon definition)
CREATE TABLE bonus_credits (
    id UUID PRIMARY KEY,
    code VARCHAR(50) UNIQUE NOT NULL,
    credits_amount INTEGER NOT NULL,
    max_uses INTEGER DEFAULT 1,
    current_uses INTEGER DEFAULT 0,
    expires_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- CouponRedemption (usage tracking)
CREATE TABLE coupon_redemptions (
    id UUID PRIMARY KEY,
    coupon_id UUID REFERENCES bonus_credits(id),
    user_id UUID REFERENCES users(id),
    redeemed_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(coupon_id, user_id)  -- one redemption per user per coupon
);

The admin can create coupons with a specific credit amount, a maximum number of uses, and an optional expiration date. The UNIQUE(coupon_id, user_id) constraint prevents a user from redeeming the same coupon twice. The current_uses counter tracks how many times the coupon has been used, and the system refuses redemption when current_uses >= max_uses.

This is observable because every coupon creation, redemption, and rejection is logged. We can see which coupons are popular, which have expired unused, and which users are attempting to game the system by redeeming multiple times.

Cost Monitoring

The most critical observability metric for an AI product is cost. OpenRouter charges per token, and costs vary dramatically by model:

DeepSeek V3: $0.14 per million input tokens, $0.28 per million output tokens
GPT-4o Mini (vision): $0.15 per million input tokens, $0.60 per million output tokens
Mistral Large (memory): $2.00 per million input tokens, $6.00 per million output tokens

The AILog table lets us calculate exact costs:

sqlSELECT
    model_used,
    DATE(created_at) AS day,
    COUNT(*) AS calls,
    SUM(input_tokens) AS total_input,
    SUM(output_tokens) AS total_output,
    SUM(response_time_ms) / COUNT(*) AS avg_latency_ms
FROM ai_logs
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY model_used, DATE(created_at)
ORDER BY day DESC, total_input DESC;

This query, run weekly, tells us exactly how much each model costs per day, how many calls it handles, and what the average latency is. We can spot anomalies immediately: a sudden spike in GPT-4o Mini calls means more students are uploading photos; a jump in Mistral Large calls means more conversations are ending (triggering summarization); a latency increase on DeepSeek V3 might indicate an OpenRouter outage.

We do not use a dedicated cost monitoring tool like Helicone or Langfuse. At our current scale, the AILog table and simple SQL aggregations give us everything we need. When we reach millions of daily API calls, we will likely need a dedicated observability pipeline. But for now, PostgreSQL is our observability platform.

What We Monitor Daily

Every morning, I (Thales) check:

New users -- how many signed up yesterday, from which countries.
Active conversations -- how many conversations happened, average message count per conversation.
Revenue -- total credit purchases, by country and payment gateway.
LLM costs -- total tokens consumed by model, estimated cost.
Error rate -- how many AILog entries have a non-null error field. Anything above 2% warrants investigation.
Latency -- average and p99 response time. If p99 exceeds 10 seconds, something is wrong.

And I (Claude) monitor through a different lens:

Token efficiency -- are conversations getting longer? Is the compression system activating? If the average conversation length is growing, we might need to adjust the compression threshold.
Tool usage patterns -- which tools are being called most frequently? A spike in generate_pdf calls means our document generation is popular. A spike in report_bug calls means something is broken.
Memory quality -- spot-check AIMemory entries for coherence. If summaries are degrading, the memory model might need to be upgraded.
Exercise results -- aggregate correct/incorrect rates by subject. If a subject's success rate drops suddenly, the system prompt for that subject might need adjustment.

The Broader Lesson

Observability for AI products is fundamentally different from observability for traditional software. In a traditional web application, you monitor request latency, error rates, and database query performance. The application's behavior is deterministic -- the same input always produces the same output.

In an AI product, the behavior is stochastic. The same input might produce a brilliant explanation or a mediocre one. The same prompt might cost 500 tokens or 5,000 tokens depending on the model's verbosity that day. A tool call that usually takes 2 seconds might take 30 seconds because the model decided to generate a 50-page spreadsheet.

This means you need more granular logging, more aggressive alerting, and more human review than a traditional product. You need to log every API call, not just errors. You need to track costs per user, not just aggregate costs. You need to review AI outputs for quality, not just for errors.

We built this observability from day one. Not as an afterthought, not as a "v2 feature." The AILog table was one of the first models we created, before we had a working chat interface. Because we knew that without observability, we would be building blind -- and building blind with LLMs means burning money and degrading quality without even knowing it.

This is article 20 of 20 in the "How We Built Deblo.ai" series. Thank you for following along.