AI Memory and Context Compression

A student opens the app on Monday and tells the AI they are struggling with fractions. The AI walks them through the concept, gives examples, generates a quiz. The student answers 3 out of 5 questions correctly. They close the app.

On Wednesday, the student opens a new conversation: "Aide-moi avec mes maths." If the AI has no memory, it starts from zero. It does not know the student struggled with fractions. It does not know they scored 60% on the quiz. It does not know that the student's exam is on Friday. It is as if the previous conversation never happened.

This is the default behavior of every LLM API. Each API call is stateless. The model has no memory between requests. Whatever context you want it to have, you must send it explicitly in the system prompt or message history.

We solved this with two complementary systems: AI Memory (cross-conversation summaries) and Context Compression (within-conversation token management). Together, they give the AI the illusion of persistent memory while keeping token costs under control.

The AIMemory Model

Every completed conversation generates a memory entry:

pythonclass AIMemory(Base):
    __tablename__ = "ai_memories"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    user_id = Column(UUID(as_uuid=True), ForeignKey("users.id"),
                     nullable=False, index=True)
    conversation_id = Column(UUID(as_uuid=True),
                             ForeignKey("conversations.id"), nullable=True)
    title = Column(String(255), nullable=False)
    content = Column(Text, nullable=False)
    created_at = Column(DateTime(timezone=True), server_default=func.now())

    __table_args__ = (
        Index("ix_ai_memories_user_id", "user_id"),
    )

The model is intentionally minimal. A title (4-10 words) and a content field (2-4 sentences, maximum 200 words). That is all the AI needs to recall what happened in a previous conversation. More detail would waste tokens. Less detail would lose critical context.

The conversation_id links back to the source conversation for traceability but is nullable -- the AI can also save standalone memory entries through the save_memory tool (more on that below).

Auto-Summarization: Fire-and-Forget

When a conversation ends (the user navigates away or starts a new conversation), the backend fires an asynchronous summarization task. This is the core of the memory system:

pythonSUMMARY_SYSTEM_PROMPT = (
    "Tu es un assistant qui analyse des conversations. "
    "Retourne un objet JSON avec exactement deux champs :\n"
    '"title" : un titre de 4 a 10 mots en francais resumant le sujet principal\n'
    '"summary" : une chaine de texte de 2 a 4 phrases concises '
    "en francais mentionnant le sujet, les points cles et les conclusions. "
    "Maximum 200 mots.\n"
    "Retourne UNIQUEMENT le JSON valide, sans balises markdown ni texte autour."
)

async def generate_and_save_summary(
    user_id: UUID,
    conversation_id: UUID,
    messages: list[dict],
    title: str,
    db: AsyncSession,
    update_title: bool = False,
) -> None:
    """Generate a conversation summary via LLM_MEMORY_MODEL, save as AIMemory.

    Fire-and-forget: never blocks the user's response.
    """
    try:
        # Build a minimal transcript for the summarizer
        transcript_parts = []
        for msg in messages:
            role = msg.get("role", "unknown")
            content = msg.get("content", "")
            if isinstance(content, list):
                content = " ".join(
                    p.get("text", "")
                    for p in content
                    if isinstance(p, dict) and p.get("type") == "text"
                )
            if content:
                label = "Utilisateur" if role == "user" else "Assistant"
                # Truncate very long messages for the summarizer
                if len(content) > 500:
                    content = content[:500] + "..."
                transcript_parts.append(f"{label} : {content}")

        if not transcript_parts:
            return

        transcript = "\n".join(transcript_parts)

        async with httpx.AsyncClient(timeout=30.0) as client:
            response = await client.post(
                OPENROUTER_URL,
                headers={
                    "Authorization": f"Bearer {settings.OPENROUTER_API_KEY}",
                    "HTTP-Referer": "https://deblo.ai",
                    "Content-Type": "application/json",
                },
                json={
                    "model": settings.LLM_MEMORY_MODEL,
                    "messages": [
                        {"role": "system", "content": SUMMARY_SYSTEM_PROMPT},
                        {"role": "user", "content": transcript},
                    ],
                    "max_tokens": 600,
                    "temperature": 0.3,
                    "stream": False,
                },
            )
            response.raise_for_status()
            data = response.json()

        # Parse JSON, extract title + summary
        parsed = json.loads(cleaned_response)
        summary = parsed.get("summary", "")
        new_title = parsed.get("title", "").strip()

        # Save as AIMemory
        memory = AIMemory(
            user_id=user_id,
            conversation_id=conversation_id,
            title=title[:255],
            content=summary,
        )
        db.add(memory)
        await db.commit()

    except Exception:
        logger.exception("Failed to generate/save conversation summary")

Several design decisions deserve explanation.

Fire-and-forget. The summarization task is launched with asyncio.create_task() and never awaited by the main request handler. The user sees no delay. If the summarization fails -- network timeout, model error, JSON parsing failure -- the failure is logged silently. The user never knows. The next conversation will simply have one fewer memory entry, which is an acceptable degradation.

Message truncation. Each message in the transcript is capped at 500 characters. A long assistant response (which can be 2,000+ characters with code blocks and explanations) is truncated before being sent to the summarizer. This keeps the summarizer's input manageable and cheap. The truncation loses detail, but the summarizer only needs the gist -- not every code example.

The memory model. We use mistralai/mistral-large-2512 for summarization, configured via the LLM_MEMORY_MODEL setting. This model costs approximately $0.00005 per summary. At 1,000 conversations per day, memory summarization costs about $0.05 per day -- essentially free. We chose a capable model because the quality of the summary directly impacts the AI's ability to recall context. A bad summary is worse than no summary.

JSON output with fallback. We ask the summarizer to return JSON with title and summary fields. But models do not always follow instructions perfectly. Some wrap the JSON in markdown code blocks. Some return plain text. The parsing logic handles all cases: extract from code blocks, attempt JSON parse, and if that fails, treat the entire output as the summary and derive a title from the first sentence.

The save_memory Tool

Beyond automatic summarization, the AI can explicitly save memory entries during conversation. The save_memory tool is exposed to the LLM as part of the agentic tool set:

python# In tool_executor.py
if func_name == "save_memory" and user:
    from app.models.ai_memory import AIMemory

    title = func_args.get("title", "Note")
    content = func_args.get("content", "")
    memory = AIMemory(
        user_id=user.id,
        conversation_id=conversation.id,
        title=title[:255],
        content=content,
    )
    db.add(memory)
    await db.flush()
    return {"success": True, "memory_id": str(memory.id)}

The AI uses this tool when it identifies information that should be remembered explicitly. Examples:

"This student struggles with fractions but is strong in geometry."
"This professional needs SYSCOHADA-compliant templates for a client in Cameroon."
"The user prefers step-by-step explanations over direct answers."

The tool is lightweight -- it creates an AIMemory row with the given title and content, and returns the ID. The AI decides when to use it based on the system prompt instructions, which include guidance like: "Si l'utilisateur mentionne une information importante sur ses preferences, ses difficultes, ou ses objectifs, utilise save_memory pour la retenir."

Memory Loading: Cross-Conversation Context

At the start of each new conversation, the system prompt assembly process loads the user's recent AIMemory entries and injects them into the context:

The loading process queries the most recent N memory entries (currently 10) for the user, ordered by created_at descending. Each entry is formatted as a brief paragraph with the title and content. The resulting block is inserted into the system prompt under a section header like:

## Memoire des conversations precedentes

- Fractions et geometrie (12 mars) : L'eleve a des difficultes avec les fractions,
  notamment la multiplication de fractions. Il maitrise bien la geometrie de base.
  Score au quiz : 3/5.

- Preparation BEPC physique (10 mars) : Discussion sur les lois de Newton.
  L'eleve comprend le concept de force mais confond masse et poids.
  Tache creee : reviser les lois de Newton avant vendredi.

This gives the AI cross-conversation continuity. When the student says "Aide-moi avec mes maths" on Wednesday, the AI can respond: "La derniere fois, tu avais des difficultes avec la multiplication de fractions. Tu veux qu'on continue avec ca, ou tu veux travailler sur un autre sujet ?"

The memory block is kept small -- 10 entries, each under 200 words. The entire memory injection typically consumes 1,500-3,000 tokens, which is a small fraction of the 128K context window. The cost is negligible compared to the pedagogical value.

Context Compression: The 150K Token Threshold

Memory handles cross-conversation context. But what about within a single conversation that grows very long? A student working through a complex topic might exchange 50+ messages with the AI. A professional generating a detailed SYSCOHADA report might have a conversation with 10 tool calls, each producing substantial output. The message history grows, and every message is included in the next API call.

LLMs charge per token. A conversation with 100K tokens of history costs real money on every subsequent message. More importantly, performance degrades -- models lose coherence at very high context lengths, and latency increases.

We set a compression threshold at 150,000 estimated tokens. When the conversation's estimated token count exceeds this threshold, we compress:

pythondef estimate_tokens(messages: list[dict]) -> int:
    """Rough token estimate: ~1 token per 3.5 characters of JSON serialization."""
    import json as _json
    total = sum(len(_json.dumps(m)) for m in messages)
    return int(total / 3.5)

async def compress_history( messages: list[dict], conversation_id: UUID, user_id: UUID | None, db: AsyncSession, ) -> list[dict]: """Replace old messages with a summary when context is too large. BLANK Returns: [summary_msg, ack_msg, ...recent_14_messages] """ from app.config import settings as _settings keep = _settings.CONTEXT_KEEP_RECENT_MESSAGES # 14 old_messages = messages[:-keep] if len(messages) > keep else [] recent_messages = messages[-keep:] if len(messages) > keep else messages BLANK if not old_messages: return messages BLANK # Try to reuse an existing AIMemory summary for this conversation from app.models.ai_memory import AIMemory from sqlalchemy import select BLANK summary_text = "" if conversation_id: result = await db.execute( select(AIMemory) .where(AIMemory.conversation_id == conversation_id) .order_by(AIMemory.created_at.desc()) .limit(1) ) memory = result.scalar_one_or_none() if memory and memory.content: summary_text = memory.content BLANK if not summary_text: summary_text = await _generate_compression_summary( old_messages, user_id ) BLANK if not summary_text: return messages # Fallback: return uncompressed BLANK compression_msg = { "role": "user", "content": ( f"[Resume des {len(old_messages)} messages precedents " f"de cette conversation : {summary_text}]" ), } ack_msg = { "role": "assistant", "content": "Compris, je prends en compte ce contexte.", } return [compression_msg, ack_msg] + list(recent_messages) ```

The compression algorithm works in four steps:

Estimate tokens. A rough estimate using the 1-token-per-3.5-characters heuristic. This is not precise -- actual tokenization depends on the model's vocabulary -- but it is close enough for threshold detection and avoids the overhead of running a real tokenizer.

Split messages. Keep the most recent 14 messages in full. Everything older becomes the "old" block to be summarized.

Find or generate a summary. First, check if an AIMemory entry already exists for this conversation (from a previous auto-summarization). If so, reuse it -- no additional LLM call needed. If not, generate an on-the-fly summary using the memory model.

Reconstruct the message history. Replace the old messages with a single summary message (formatted as a user message) followed by an assistant acknowledgment. Then append the 14 most recent messages. The result is a dramatically shorter history that preserves both the broad context (via the summary) and the immediate context (via the recent messages).

The acknowledgment message ("Compris, je prends en compte ce contexte.") is necessary because LLM APIs require alternating user/assistant turns. Without it, the summary (a user message) would be followed by another user message (the first of the recent messages), which violates the message format.

Why 14 messages? This is configured via CONTEXT_KEEP_RECENT_MESSAGES. Fourteen messages (7 user + 7 assistant turns) provides enough immediate context for the AI to maintain coherence in the current thread of discussion while significantly reducing token count.

Why 150,000 tokens? This threshold was chosen empirically. Below 150K, DeepSeek V3 maintains good coherence and the cost is acceptable. Above 150K, we observed increased latency (responses taking 8-12 seconds instead of 3-5) and occasional coherence issues (the model repeating itself or losing track of earlier context). The threshold gives us headroom below the model's 128K native limit when accounting for the system prompt, tools, and memory block.

The On-the-Fly Compression Summary

When no existing AIMemory entry is available for the conversation (e.g., the conversation has not ended yet and no auto-summarization has run), we generate a summary on demand:

pythonasync def _generate_compression_summary(
    messages: list[dict], user_id: UUID | None
) -> str:
    """Generate an on-the-fly summary for context compression."""
    try:
        transcript_parts = []
        for msg in messages[-20:]:  # Only summarize the last 20 old messages
            role = msg.get("role", "unknown")
            content = msg.get("content", "")
            if isinstance(content, list):
                content = " ".join(
                    p.get("text", "")
                    for p in content
                    if isinstance(p, dict) and p.get("type") == "text"
                )
            if content:
                label = "Utilisateur" if role == "user" else "Assistant"
                if len(content) > 300:
                    content = content[:300] + "..."
                transcript_parts.append(f"{label} : {content}")

        transcript = "\n".join(transcript_parts)

        async with httpx.AsyncClient(timeout=30.0) as client:
            response = await client.post(
                OPENROUTER_URL,
                json={
                    "model": settings.LLM_MEMORY_MODEL,
                    "messages": [
                        {
                            "role": "system",
                            "content": (
                                "Resume cette conversation en 3 a 6 phrases "
                                "concises en francais. Mentionne le sujet, "
                                "les points cles et les conclusions. "
                                "Maximum 300 mots."
                            ),
                        },
                        {"role": "user", "content": transcript},
                    ],
                    "max_tokens": 800,
                    "temperature": 0.3,
                    "stream": False,
                },
            )
            response.raise_for_status()
            data = response.json()

        message = data.get("choices", [{}])[0].get("message", {})
        return (message.get("content") or "").strip()
    except Exception:
        logger.exception("Failed to generate compression summary")
        return ""

This function takes the last 20 old messages (not all of them -- to keep the summarizer's input bounded), truncates each to 300 characters, and asks the memory model for a 3-6 sentence summary. The result is injected into the compressed history.

The on-the-fly summary adds latency -- approximately 1-2 seconds for the LLM call. This is acceptable because compression only triggers on very long conversations (150K+ tokens), which are already slow due to the large context size. The 1-2 seconds of compression time is offset by the subsequent speed improvement from having a much smaller context.

The Cost Equation

Every token in the context window costs money. Here is the breakdown:

Memory summarization: ~$0.00005 per conversation summary. At 1,000 conversations/day, that is $0.05/day or $1.50/month.
Memory loading: ~1,500-3,000 tokens per conversation start. At $0.14/million input tokens (DeepSeek V3), that is $0.0002-0.0004 per conversation. Negligible.
Context compression: saves 100K-200K tokens per subsequent message in long conversations. At $0.14/million input tokens, each compression saves $0.014-0.028 per message. Over a 20-message tail of a long conversation, that is $0.28-0.56 saved. The compression summary costs $0.00005 to generate. The ROI is 5,600x to 11,200x.

The math is unambiguous. Memory and compression are not just features -- they are cost optimizations. Without compression, a single long conversation could cost $2-5 in tokens. With compression, the same conversation costs $0.50-1.00.

What the AI "Remembers"

Putting it all together, here is what the AI knows at the start of any conversation:

Previous conversation summaries (via AIMemory): what topics were discussed, what the student struggled with, what was concluded.
Explicitly saved notes (via save_memory tool): learning preferences, weak areas, important dates.
Uploaded files (via list_user_files and search_user_files tools): documents the student has shared in previous sessions.
Current conversation history (direct messages): everything said in the current session, possibly compressed if it exceeds the threshold.
Task status (via system prompt injection): upcoming and overdue tasks.
Exercise results (via ExerciseResult model): historical quiz scores by subject and topic.

This is not true memory in the human sense. It is reconstructed context -- assembled fresh at the start of each conversation from database records. But from the user's perspective, the effect is the same. The AI remembers. The AI knows what happened last time. The AI picks up where it left off.

For an educational platform, this matters enormously. A tutor who forgets everything between sessions cannot track progress. A tutor who remembers can adapt, follow up, and personalize. That is the difference between a chat interface and a learning relationship.

Edge Cases and Failure Modes

The memory system is designed to degrade gracefully:

If auto-summarization fails: the conversation has no AIMemory entry. The next conversation will not include that context. The AI still functions -- it just does not remember that specific session.
If context compression fails: the fallback returns the uncompressed messages. The conversation will be slower and more expensive, but it will not break.
If the memory model returns garbage: the JSON parsing has a robust fallback chain -- try JSON, strip markdown blocks, try again, fall back to plain text. Even if all parsing fails, the failure is logged and the conversation continues.
If the user has no memories: first-time users simply start without cross-conversation context. The AI introduces itself and begins fresh. Memories accumulate over time.

There is no catastrophic failure mode. Every failure path results in either a missing memory (acceptable) or an uncompressed context (expensive but functional). The system never blocks, never crashes, and never shows an error to the user.

What We Learned

Building AI memory taught us three lessons:

Summaries are surprisingly hard to get right. The quality of a summary depends entirely on the summarizer model's ability to identify what matters. Early versions using smaller models produced summaries that were too generic ("The student discussed mathematics") or too detailed ("The student asked about 3/4 multiplied by 2/5 and the assistant explained..."). The current model (Mistral Large) hits the right balance for our use case.

Fire-and-forget is the right pattern for non-critical side effects. Memory summarization is important but not urgent. Making it synchronous would add 1-2 seconds to every conversation end. Making it fire-and-forget means the user never waits, and failures are invisible. This pattern -- launch the task, log failures, move on -- is the right default for any side effect that does not affect the immediate user experience.

Token economics drive architecture. Every architectural decision in the memory system is shaped by token costs. We summarize because storing full conversations in context is too expensive. We compress because long conversations would bankrupt us. We truncate messages before summarizing because sending full messages to the summarizer would cost more than necessary. In an LLM-powered product, your architecture is your cost structure.

This is article 19 of 20 in the "How We Built Deblo.ai" series.

AI Memory and Context Compression

The AIMemory Model

Auto-Summarization: Fire-and-Forget

The save_memory Tool

Memory Loading: Cross-Conversation Context

Context Compression: The 150K Token Threshold

The On-the-Fly Compression Summary

The Cost Equation

What the AI "Remembers"

Edge Cases and Failure Modes

What We Learned

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live