Back to deblo
deblo

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

Phase 14 shipped Déblo Eyes — real-time camera streaming over LiveKit to Gemini Live native audio. The first deploy took down every voice session in production within ninety seconds because our Step 0 had validated the constructor without exercising the runtime path. The build log of how Déblo got eyes, what an incomplete pre-flight check cost us, and which polish items we shipped versus deferred.

Juste A. Gnimavo (Thales) & Claude | May 20, 2026 30 min deblo
EN/ FR/ ES
debloclaude-opus-4.7claude-codegemini-livevertex-ailivekitreal-time-videomultimodalsession-resumptionpost-mortemprod-downhotfixreact-nativeexposveltekitobservabilityeas-buildsystem-promptscost-capfeature-flageasypanelvalidationstep-zeroruntime-validationsparse-samplingframe-throttlingvision-language-modelshallucinationarchitectural-fix

By Thales (CEO, ZeroSuite) & Claude Opus 4.7 — Claude Code instance

At 10:00 UTC on May 20, 2026, Déblo had ears. Users could hold a voice call with Gemini Live native audio, speak in any of seven major languages, and the model would answer in the same tongue at human conversational latency. At 22:00 UTC the same day, Déblo had eyes too. A user could tap one button on the dock during an ongoing call, the back camera would publish a video track over LiveKit to a Python worker, the worker would consume frames at 0.5 frames per second, push them as RGBA images to the Gemini Live session, and the model would narrate what it saw — a school report card placed under the lens, a contract page held up, an electric meter, an itemized invoice — in the same voice channel the user had already opened.

We call the trio Voice + Eyes + Chat. Voice is audio-only. Eyes is voice plus real-time camera. Chat is text plus uploads. The architectural piece that landed today is the middle one. It also happens to be the piece that took us closest to a full production-down outage during the launch week.

This is the build log of Phase 14. It is not a marketing post. There were nine commits, two false starts, one ninety-second outage that killed every voice session including the ones not touching camera, and two polish items that came back FAIL on the smoke device test and got deferred to dedicated sessions rather than half-fixed in place. Some of the most useful lessons are about what we didn't ship.

We are going to walk through the architecture, the Step 0 validation that wasn't enough, the hotfix that unblocked it, the frame-sampling math that turned a 1 fps default into a 0.5 fps tuned constant, the camera-preview overlay we added after the CEO's second smoke and the X-close button we made context-aware after the third, the two bugs that defeated us tonight, and the system-prompt feedback that needed its own dedicated follow-up. At the end, there is a section on what each of us got right and what each of us could not see.


Part 1 — Why Camera Streaming, And Why Now

The reason to ship real-time camera streaming was not on the launch roadmap two weeks ago. It got onto the roadmap because of a different bug. In Phase 13 (May 19, the day before this session) we had compacted the three voice system prompts to ~6 kB each to stay below the Gemini Live native-audio degradation threshold. While doing that, we discovered a different problem: when a user announces a photo (« regarde », « j'ai envoyé un truc »), there is a 1-2 second window before the data-channel upload completes and the frame arrives in the model's input. During that window, the model has a strong architectural bias to fill the silence — voice realtime models produce audio at every user turn by construction, and that audio tends to hallucinate a plausible description of what the user said they were sending, before the model has seen anything.

We patched it with a prompt block — VISUAL DISCIPLINE — and a worker-side guard that intercepts visual-announcement transcripts, interrupts the model, and injects a 4-word "loading" filler. That works for the photo-upload path (Phase 5.B) and the 5-second video clip path (Phase 5.F). Both are discrete-event visual modes: the user explicitly chooses to send a thing, the worker uploads it, the model sees it once, the model describes it.

But the dominant use case for Déblo Eyes — mother holding up a child's report card while talking to the tutor, customer holding up a contract while talking to Déblo Pro, trader holding up an invoice — is continuous. The user does not want to tap "send photo" five times during a conversation about a multi-page document. They want to point the phone at the document, scroll through pages naturally, and have the AI follow along.

The architectural insight is that continuous streaming structurally eliminates the visual hallucination window. If a frame arrives every 500 to 1000 milliseconds, the model never has to fill a 2-second gap. The frame is always fresh. The "I announce, you hallucinate" path simply does not exist for this mode.

So Phase 14 was both a feature (a major product capability we wanted for launch) and a fix (the cleanest structural resolution of a class of bugs we had just patched at the prompt level). The dual motivation is what got it onto the path-critical list for May 20.


Part 2 — The Architecture We Wanted

The full picture, in one paragraph: a React Native app running Expo SDK 54 publishes a video track over LiveKit when the user taps the camera button on the voice dock. The track arrives at the Python worker running on Easypanel as a RemoteVideoTrack, picked up by the track_subscribed listener on the LiveKit room. A nested asyncio task per track sid consumes rtc.VideoStream(track) as an async iterator, throttles to one frame every two seconds (we will get to why), converts each frame to an RGBA PIL image, thumbnails it to 768 px on the long edge, and calls session._activity.push_video(frame) on the Gemini Live session. Every two frames it also calls session.generate_reply(instructions=...) with a short English directive that nudges the model to narrate only if the scene has changed meaningfully. A 5-minute hard cap and a 3-minute silence auto-off prevent runaway sessions. When the bridge ends for any reason — user toggle, max duration, silence, error — the worker publishes a camera_status event on the LiveKit data channel that the client maps to a localized toast banner.

The single architectural risk we identified going in was session_resumption(transparent=True). Gemini Live native audio sessions default to a 2-minute server-side cap. For a tutorial-style call where the mother is walking through a 4-page report card, 2 minutes is a hostile limit. Vertex AI exposes SessionResumptionConfig(transparent=True) to lift the cap silently — the SDK transparently re-handshakes under the hood when the server would otherwise close the connection.

We did not know for certain that the livekit-plugins-google 1.5.9 realtime client honored this config end-to-end. The plugin docs mentioned the parameter; the upstream Vertex API documented the behavior; nobody we could find had published a confirmation that a real Python session with the parameter set actually stayed up past 7 minutes in production. Phase 14 depended on it: without resumption, the entire bridge would tear down at 2 minutes regardless of how good our code was on top.

So we scheduled a Step 0. The plan was to validate that the architectural primitive worked at all, before writing any of the bridge code that depended on it.


Part 3 — The Step 0 That Wasn't Enough

The Step 0 we ran is documented in session-logs/gemini-session-logs/26-05-20-phase-14-step0-resumption-validation.md. The objective was to confirm three things, in order: that livekit-plugins-google exposes the session_resumption keyword argument on RealtimeModel.__init__, that the plugin accepts a google.genai.types.SessionResumptionConfig(transparent=True) value without raising, and that the configured model can be instantiated cleanly in a local Python session.

The script was forty lines. It imported livekit.plugins.google.beta.realtime as livekit_google_realtime, instantiated a RealtimeModel with session_resumption=genai_types.SessionResumptionConfig(transparent=True), printed the resulting model's _opts dict to confirm the config was stored, and exited zero. It ran clean. The constructor accepted the kwarg, the config was stored in _opts.session_resumption, and the RealtimeModel instance was valid.

We marked Step 0 as GO.

We were wrong about what GO meant.

Step 0 had validated the constructor. It had not validated the runtime code path. The constructor stores the option object; it never calls into the plugin's session-level state machine. The plugin's session-level state machine is what runs when a real LiveKit call arrives, and it is in that path that the plugin reads _opts.session_resumption.handle. If _opts.session_resumption is None — which is exactly what happens when you pass None explicitly versus omitting the kwarg entirely — the runtime hits NoneType.handle and crashes the entire session pipeline before any audio frame is processed.

We did not discover this from reading the plugin source. We discovered it from the production logs ninety seconds after Easypanel finished rebuilding the worker container.


Part 4 — The Production-Down Hotfix

Commit 785040d went out at 19:42 UTC. The commit added the worker bridge (~500 lines), the session_resumption configuration gated behind a new DEBLO_VIDEO_BRIDGE_ENABLED env var, the track-subscribed and track-unsubscribed listeners, the frame-conversion pipeline, and the session-end telemetry. The env var was unset in production, which we expected to mean the bridge feature is disabled and nothing changes.

That is not what it meant.

The relevant code path looked like this:

pythonrealtime_model_kwargs = {
    "model": settings.GEMINI_LIVE_MODEL,
    "instructions": system_prompt,
    "voice": settings.GEMINI_LIVE_VOICE,
    "language": user_lang,
    "session_resumption": (
        genai_types.SessionResumptionConfig(transparent=True)
        if VIDEO_BRIDGE_ENABLED
        else None
    ),
}
model = livekit_google_realtime.RealtimeModel(**realtime_model_kwargs)

When VIDEO_BRIDGE_ENABLED was False, the kwarg was passed as None. The constructor accepted None without complaint (it stores the option as-is). But the session state machine, when a real LiveKit room connected and tried to start streaming, executed something equivalent to handle = self._opts.session_resumption.handle — and there is no None-guard upstream. The traceback was:

AttributeError: 'NoneType' object has no attribute 'handle'
  File ".../livekit/plugins/google/realtime/realtime_api.py", line 493, in __init__
    handle = self._opts.session_resumption.handle

Every voice session attempted on the worker after the rebuild crashed at line 493. Audio-only sessions, which had been working flawlessly in production for four days, were now dead. The bridge feature was disabled, but the path to disable it was a landmine.

The CEO noticed in roughly ninety seconds. He tried to start an audio-only call, it failed silently from the client perspective, he opened the Easypanel logs, saw the stack trace, copied it to the session, and pinged me:

« le worker crash sur toutes les sessions, regarde le log ; je vois 'NoneType has no attribute handle' sur la 493 du plugin google ; ça n'a aucun rapport avec le bridge censé être OFF ? »

It had every rapport. The Step 0 we had passed had told us the configuration object was acceptable. It had not told us the runtime branch was acceptable. The runtime branch dereferenced an attribute on whatever we passed, and we had passed None, and None does not have attributes.

The fix took three minutes to write and four minutes to deploy.

pythonrealtime_model_kwargs = {
    "model": settings.GEMINI_LIVE_MODEL,
    "instructions": system_prompt,
    "voice": settings.GEMINI_LIVE_VOICE,
    "language": user_lang,
}
if VIDEO_BRIDGE_ENABLED:
    realtime_model_kwargs["session_resumption"] = (
        genai_types.SessionResumptionConfig(transparent=True)
    )
model = livekit_google_realtime.RealtimeModel(**realtime_model_kwargs)

Pass the kwarg conditionally, never as None. When the bridge is enabled, the SessionResumptionConfig is provided; when disabled, the kwarg is omitted entirely and the plugin uses its default-handle path that does not crash. Commit 315280e. Easypanel rebuild. The CEO retested an audio-only call: PASS. The bridge feature stayed off in production until the rest of Phase 14 was ready to ship. Total outage window: roughly four minutes from first crash to confirmed recovery.

We were lucky. Audio-only is the most common voice session by far; if the CEO had not been actively testing during the rebuild window, the outage might have extended to ten or twenty minutes before someone noticed. We were also lucky that the failure mode was a clean AttributeError with a useful stack trace pointing at the plugin's own source. A failure mode that fired silently — say, a session that connected but produced no audio — would have been substantially harder to diagnose.

The lesson is the obvious one with an important refinement: Step 0 must exercise the full runtime code path, not just the constructor. Instantiating an object and printing its _opts is not the same as starting a session against the real backend. For SDK validation steps going forward, our default is now: spin up a real session, send a real test frame, observe the real return. The constructor-level check is at best 20 percent of the work.

This is now saved in our agent memory as feedback_step_zero_runtime_validation.md. It was an expensive mistake but a cheap memory entry. The next time we add a new SDK plugin or upgrade a major version, the lesson fires automatically.


Part 5 — Why 0.5 fps Beats 1 fps

After the bridge was wired up and audio-only restored, we moved to tuning. The initial bridge configuration was 1 frame per second, 640 px maximum frame dimension. This is the obvious default — it matches the rate at which a human can visually parse a scene, and 640 px is the dimension at which most vision-language model demo apps run.

The CEO pushed back on both numbers within an hour. The reasoning, worked out at the kitchen-table whiteboard with napkin arithmetic:

Baseline 1.0 fps × 640 px × ~85 tokens per frame
  = 5,100 tokens per minute of camera input
  = 25,500 tokens at the 5-minute hard cap

Tuned  0.5 fps × 768 px × ~122 tokens per frame
  = 3,660 tokens per minute of camera input
  = 18,300 tokens at the 5-minute hard cap

Less cost, and crucially, sharper frames. The non-obvious part is that the 768 px frames are not just "incrementally better"; they cross a perceptual threshold for vision-language models on text-heavy documents. At 640 px, a column of a school report card is legible only for headers and large body text. At 768 px, individual grade marks and teacher initials become recoverable. The use case we are targeting — mother and report card, customer and contract, trader and invoice — is almost entirely text-on-paper. Frame sharpness on text matters more than frame frequency.

The deeper observation is about vision-language model behavior under sparse versus dense sampling. The intuition many engineers have is "more frames is better information". For motion-heavy scenes (a moving subject, a sports clip), this is true. For static scenes (a held-up document, a static product, a whiteboard), it is the opposite: dense sampling pushes the same near-identical image into the model's context window ten times in ten seconds, diluting attention without adding information. The model's effective context is wasted on redundancy. Sparse sampling at higher resolution gives the model one good look at a slowly-changing scene, then time to integrate before the next look.

Our trade-off accepted: the user-perceived latency between "I moved to the next page" and "the model sees the new page" doubled from one second to two. For a document walkthrough at conversational pace, this is invisible. For a sports clip it would be painful — but sports clip review is not Déblo Eyes' use case. Phase 5.F (the discrete 5-second video clip path) handles motion-heavy short videos with all 150 frames batched, and remains the right tool for that job.

Commit 5cf7a75 shipped 0.5 fps + 768 px. The bridge worker's code_version was bumped to phase-14-video-bridge-sparse-0.5fps-2026-05-20 so we could correlate Sentry events to the tuning generation if anything regressed.

The broader lesson, on choosing parameters for new ML-integration features, is match sampling characteristics to scene dynamics, not to default values. The default for "real-time camera" in most SDK examples is 1 fps because that is what fits the average. We are not running the average; we are running a specific use case with specific scene-dynamics properties, and the right number for us is half the default.


Part 6 — Two Pieces Of UX Polish, And Why They Mattered

Smoke #2 came back honest: the camera turned on, the worker received frames, the model described what it saw — and the user had no visual indication that any of this was happening. The phone screen showed the same orange sphere and waveform UI as audio-only mode. The CEO's first feedback was a single line: « il n'y a aucun viseur, on dirait que la caméra n'est pas allumée du tout ».

He was right, and the omission was telling. We had built the technical bridge but forgotten that the user's mental model of "camera on" is the camera viewfinder, visible, full-screen. Every consumer camera app since 2007 trains this expectation. Skipping it because "it's a voice call, not a camera app" is wrong reasoning — the user toggled camera, the user expects to see what the camera sees, full stop.

Commit 202511a added the camera preview overlay. The mobile implementation uses VideoView from @livekit/react-native rendered fullscreen behind the existing voice UI, with a 26% dark scrim overlay to keep the orange sphere and transcript readable. The web parity uses an HTML5 <video> element with track.attach(videoEl) and the same scrim. A flip-camera button floats top-right under the existing top bar. The CSS layering took an evening — position: absolute, inset: 0, careful z-index stacking so the preview is below the controls but above the gradient background.

The default camera facing is now environment (back camera). The original implementation defaulted to user (front camera) because that is what setCameraEnabled(true) returns on most devices without explicit constraints. But the dominant use case for Déblo Eyes is filming something external: a document, a meter, a product. Front camera as default would have meant the first thing users see is themselves, which both confuses the use case and is socially awkward for many users who do not want to look at themselves while talking to an AI.

Smoke #3 surfaced the second piece of UX feedback: the X button at the top-left of the voice screen. In the audio-only era, tapping X meant "hang up the call". With camera live, the CEO's intuition (drawn from using Google Gemini's app) was that tapping X should mean "close the camera, keep the call". This is the correct behavior. The X is, in the user's mind, closing whatever modal piece they last opened. If the camera is open, X closes the camera. If only audio is open, X closes the call.

Commit 15241f8 made the X context-aware. The mobile handleClose checks the camera state and routes to either toggleCamera(false) or the original hang-up handler. The web handleTopCloseClick mirrors. Same one-line conceptual change, three or four lines of code per platform.

These two pieces — the preview overlay and the context-aware X — are the kind of thing that does not show up on any task list before the smoke test. The technical implementation of the bridge was correct; the UX integration of the bridge into the existing voice surface was not. Smoke tests with real users on real devices are the only path to discovering this class of gap. Reading the requirements document one more time would not have found it. Pushing the build to a real phone and watching a real human use it did.


Part 7 — Two Bugs We Did Not Beat Tonight

The smoke test was also honest about two things we did not solve in this session: the flip-camera button, and the streaming transcript chips.

Flip camera (BUG 1). The button renders, the tap fires, a brief visual flash happens on screen, and the camera does not actually switch from back to front. The console shows a warning from event-target-shim:

WARN  An event listener wasn't added because it has been added already
  setMediaStreamTrack (livekit-client.umd.js:1:258098)

The implementation uses LocalVideoTrack.restartTrack({ facingMode: 'user' }), which is the documented path for re-acquiring getUserMedia with new constraints on an existing publication. On web Chrome this pattern works cleanly. On React Native (using react-native-webrtc under the LiveKit RN SDK), the underlying MediaStreamTrack does not appear to honor the new facingMode constraint when restarted on the same publication. The fallback we tried — disable the camera, re-enable with explicit { facingMode, deviceId: undefined } — has the same outcome on RN.

The probable root cause is that RN-WebRTC, when restarting a track, picks the same underlying device handle from its internal cache rather than re-running device enumeration with the new constraint. Fixing it properly requires enumerating cameras via mediaDevices.enumerateDevices(), finding the device whose label matches /back/i versus /front/i, and calling restartTrack({ deviceId: targetDevice.deviceId }) with the explicit ID rather than a facingMode constraint. We have not implemented this yet because it requires a small amount of platform-specific code and we want to validate the pattern on a fresh agent session rather than blob it onto the end of this one.

Streaming transcript chips (BUG 2). The intended UX is YouTube-Live-style: while the camera is active, the last five user-and-AI utterances scroll up as small role-coded chips at the bottom of the screen, giving the user a textual anchor for the conversation while the visual canvas is dominated by the camera preview. The code was added in commit 15241f8 — a streamTranscriptEntries derived store, a ScrollView with auto-scroll on new entries, role-based styling — but the chips do not render on screen during a live camera call.

The probable causes are three, in decreasing order of likelihood: the isFinal filter on the transcript entries may be filtering everything because the underlying transcript objects from Gemini Live arrive without an isFinal flag set in the way the code expects; the cameraPreviewLayer View may have a higher effective z-index than the streamTranscriptOverlay due to React Native's stacking-context rules being subtly different from web; or the flex layout of the parent screen container may collapse the transcript area to zero height when the sphere is hidden. Each is testable; none was testable cheaply in the time window we had tonight.

Both are real bugs and both are documented with specific debugging paths in session-logs/upcoming-prompts/28-phase-14-mobile-polish-and-homepage-3-buttons.md. The session that picks this up next does not need to start from zero. The hypothesis space has been narrowed to a tractable set of specific things to test.

The discipline here is defer cleanly, document precisely. Half-fixing a hard bug at the tail end of a launch sprint, in the same session that landed a major feature, is how you ship a flip-camera button that mostly works on Tuesday and breaks again on Wednesday. The two bugs are documented, the working components around them are stable, and the session ends in a state where the camera bridge can ship to production with the flip-and-transcript pieces explicitly marked as polish.


Part 8 — The System Prompt Was Too Conservative

The final piece of honest smoke feedback is one we did not patch tonight, on principle. The acceptance test for Phase 14 was a real-world use case: the CEO held up a school's brochure to the camera, asked Déblo Eyes to read the contact phone number, then three minutes later asked for it again to test the model's session memory. The model passed both: it read the number correctly the first time, and re-confirmed it correctly three minutes later. Session resumption worked. Memory worked. Vision worked.

But the CEO's qualitative feedback was that the model was too reserved. It answered the literal question, did not elaborate, did not proactively flag related details on the brochure (the school's address, the opening hours, the languages of instruction), did not ask any clarifying questions about what the user might want next. In conversational AI terms, it was passive. In product terms, it was leaving engagement on the table — users do not come back to AI products that answer one question and go silent.

The CEO's specific words: « il parle pas, retient trop d'info, ne détaille pas, faut poser bcp de questions, réponses trop courtes ; risque rétention utilisateur ».

This is a system-prompt problem, not a bridge problem. The Phase 13 ultra-compact prompt rewrite (May 19, the day before this session) had explicitly capped response length to "max 2 short clear sentences per turn (more only when the question demands)". That cap is the right cap for casual audio-only conversation — it prevents the model from filibustering on every "Hi". It is the wrong cap for camera mode, where the user is actively presenting a multi-detail visual artifact and benefits from the model going slightly beyond the literal question.

The wrong way to address this would have been to edit the prompt in-place during this session, half an hour before the agent handed back to the CEO, with no time to validate that the new prompt did not regress on the casual-conversation cases. Prompts at this length are entangled — one change can shift behavior across registers, languages, and user types in unpredictable ways. The right way is to delegate it to a dedicated session that can iterate carefully and validate across the matrix of voice+text, K12+Pro+Companion, and camera-on+off.

That session is queued at session-logs/upcoming-prompts/29-system-prompt-optimization-conservativeness.md. The brief includes the specific feedback, the constraints (preserve VISUAL DISCIPLINE block from Phase 13.B, preserve LIVE CAMERA MODE block from Phase 14), and the validation matrix.

The general principle: smoke-test feedback that is structural — model is too conservative, model is too verbose, model is wrong about a register — belongs in its own session. It is not a bug fix at the end of the feature session. The temptation to "just tweak the prompt while I'm here" is the prompt-engineering equivalent of "just refactor while I'm fixing the bug". Both produce regressions you find next week.


Part 9 — What Each Of Us Got Right

This is Claude Code writing.

Where I was useful in this session :

  • Hot-fixing the Step 0 regression in three minutes from production stack to deployed fix. Once the CEO copied the AttributeError: 'NoneType' has no attribute 'handle' trace into the session, the diagnosis was instant: the kwarg-is-None path was the only one I had introduced that touched the plugin's session-resumption code. The conditional-kwarg fix is the minimum-surface change. Pushing it without trying to "improve" the surrounding code under outage pressure was the right discipline.
  • Parallel commit-and-push throughout. Each of the eight feature commits (worker bridge, hotfix, mobile, web, system prompts, FPS tuning, toasts, camera preview, flip+X) was committed and pushed independently rather than batched. The CEO's six-terminal workflow depends on git pull always being a reliable way to get the current intended state. Batching the commits would have saved me maybe ten minutes of typing and cost him hours of stale-tree confusion at the next checkpoint.
  • Writing the two upcoming-prompt files for sessions 28 and 29 before closing this session. Both files are self-contained, name the failing scenarios precisely, suggest specific debugging paths, and constrain the scope so the next agent does not over-reach. The five minutes to write those files is the difference between deferred-and-recoverable and deferred-and-lost.

Where I needed Thales :

  • The 0.5 fps and 768 px tuning decision. My default would have been 1 fps and 640 px, which are the SDK example values. The CEO had the product-context to know that our use case is text-on-paper documents and that frame sharpness mattered more than frame frequency. The math at the whiteboard was straightforward, but the decision to do the math at all — to question the defaults rather than accept them — came from him.
  • The default camera facing (back instead of front). This is one of those decisions that looks obvious in retrospect but is not pre-test. My instinct as the implementing agent was to keep the SDK default. He overruled with a one-line product argument and was correct.
  • The discipline to defer the flip-camera bug and the streaming-transcript bug to a dedicated session. My instinct under launch pressure was to try one more debugging pass on each. He pulled the rip cord at the right moment, defining the boundary of what would ship tonight versus what would queue. The two upcoming-prompt files exist because he insisted on writing them before closing the session.
  • The decision not to touch the system prompts in this session despite the smoke feedback. My instinct was to draft a quick patch and ship. He recognized that prompts at this length are entangled and that "quick patch" is a category-error.

Where I almost shipped the wrong thing :

  • The first version of the worker generate_reply(instructions=...) directive for the bridge was written in French — because the user-facing audio is mostly French, my heuristic was "match the language of the user". The CEO caught it in code review and pointed at our prior convention (already saved in memory from Phase 13.B): directives are system instructions, not user-facing utterances; they should be in English regardless of user language. The model handles English instructions slightly more reliably across languages than French instructions, and the LLM-instructions-in-English convention exists for that reason. The fix was a single sentence rewrite, but I had to be told. (Commit ea4f358.)
  • The post-hotfix temptation to "validate the rest of Step 0 more thoroughly while I'm here". I almost ran a battery of additional plugin-introspection probes after the hotfix landed. The CEO redirected to ship the rest of the planned commits first, validate the full pipeline end-to-end on real device, then revisit Step 0 methodology in a clean session. The right call. Over-validation in the aftermath of an outage is its own kind of waste.

The pattern is consistent with prior sessions and with the em-dash post we wrote yesterday: I execute well at high throughput on a defined scope, recover quickly from clean failures, and parallelize across files. The strategic moves — what to defer, what to question, what to leave alone — still come from a CEO with product memory and the discipline to override the agent's default impulses. Phase 14 shipped well because both halves of that pair were doing their jobs. The pair is the unit, not the agent.


Part 10 — What Phase 14 Means For The Launch

Code-complete on main does not mean ready for users. The Phase 14 bridge is behind a feature flag (DEBLO_VIDEO_BRIDGE_ENABLED) that is currently enabled in production but inactive for end-users because the mobile build that exposes the camera toggle button is not yet on TestFlight. The next gates are:

  • EAS dev client rebuild to integrate commit 15241f8. Without this, the iOS device the CEO smoke-tests with does not have the camera-toggle button, the camera preview overlay, the flip button, the streaming transcript scaffolding, or the auto-off toast. Estimated 20-25 minutes.
  • Smoke completion of the two remaining scenarios that did not run tonight: iOS lock-screen behavior (does the camera bridge survive the screen locking and unlocking) and the 3-minute silence auto-off path (does the worker correctly tear down the bridge after 3 minutes of user silence with camera still on).
  • Sessions 28 and 29 delivered. Session 28 fixes flip-camera and streaming-transcript and adds the homepage three-button entry. Session 29 optimizes the system prompts for less-conservative behavior under camera mode while preserving the visual-discipline and live-camera blocks.
  • 48-hour Sentry monitor after the next stability checkpoint, watching for video.frame.convert_fail spikes, video.bridge.shutdown_timeout events, and any unexpected error categories on the audio-only path that could regress as a side effect.
  • J+14 cleanup of the Phase 5.F dead code (the discrete 5-second video clip path) if the camera bridge proves stable. The clip path can stay for now; carrying both for a transition window is cheap.

Once all of that is green, the camera-on path is end-user reachable. The launch master document was updated to reference Déblo Eyes as part of the Voice + Eyes + Chat trio (commit included in the Phase 14 sprint), and the App Store description copy is being updated in parallel.

What we have at the end of today is a feature that we are confident in architecturally — the bridge holds together, the cost is bounded, the model integrates correctly, the resumption works past the 2-minute cap — and that we know is unfinished at the UX edges. The honest position is to ship the architecture, ring-fence the unfinished edges in named follow-up sessions, and resist the temptation to declare done before done is achieved.


Conclusion

Déblo got eyes today. The architecture for real-time camera streaming from a React Native client to Gemini Live native audio through a Python LiveKit worker is now code-complete on main. The unique architectural risk (session_resumption(transparent=True) honored at runtime past the 2-minute server cap) is empirically validated by a 7-minute live test. The unique production-down moment (every voice session crashing on None.handle after a kwarg passed conditionally as None) was caught in 90 seconds and hotfixed in four minutes. The frame-sampling tuning (0.5 fps, 768 px, sparse high-quality over dense low-quality) is justified by both cost arithmetic and the perceptual characteristics of vision-language models on text-heavy static scenes. Two UX polish bugs and one system-prompt conservativeness problem are documented, deferred, and queued for dedicated sessions rather than half-patched in place.

The biggest lesson of the day is not about cameras. It is about what "validation" means. The Step 0 we ran told us the configuration object was acceptable to the SDK constructor. It did not tell us the runtime code path that consumed the configuration object would tolerate a None. The first told us almost nothing about the second. The discipline going forward, written into agent memory and into our internal validation guide, is: a Step 0 that does not exercise the runtime is not a Step 0. Instantiating a class and printing its options is the first 20 percent of the work. The other 80 percent is starting a real session, exercising the path the SDK actually uses in production, and watching what happens. If we had done that this morning, the four-minute production outage at 19:42 UTC tonight would not have happened, and the lesson we are now writing would have been written by someone else, somewhere else, against a different SDK.

We did not, and the outage did. The lesson is the cheaper of the two — written down, indexed, automatically retrieved by future agent sessions when an SDK-validation step comes up. The next time we add a major plugin or upgrade a major version, the runtime-validation step will be in the plan from the start, not in the postmortem.

Déblo Eyes is shipped. The trio Voice + Eyes + Chat is structurally complete. The launch window is open.

The eyes can see what you can see, in real time, and they remember what they saw three minutes ago.


This piece was written collaboratively by Thales (CEO of ZeroSuite, building Déblo and VeoStudio from Abidjan, Côte d'Ivoire) and Claude Opus 4.7 — Claude Code instance running on macOS, 1M context window. The Phase 14 sprint it describes was executed on May 20, 2026 by an independent Claude Code agent from a self-contained prompt (session-logs/gemini-session-logs/phase-14-impl-prompt-agent.md), validated by Thales on real iOS device, and recapped at the end of the day in session-logs/gemini-session-logs/26-05-20-phase-14-session-master-recap.md. The nine commits described are, in order: 785040d (worker bridge), ea4f358 (English instructions fix), 315280e (production-down hotfix), a0d07b0 (mobile UI), 156a23e (web parity), 590a284 (system prompts LIVE CAMERA MODE block), 5cf7a75 (sparse 0.5 fps + 768 px tuning), 6629761 (auto-off toasts and worker→client data-channel signaling), 202511a (camera preview overlay + flip button + default back camera), and 15241f8 (flip restartTrack pattern + context-aware X + streaming transcript scaffolding). The two deferred polish items are tracked in session-logs/upcoming-prompts/28-phase-14-mobile-polish-and-homepage-3-buttons.md; the system-prompt optimization is tracked in session-logs/upcoming-prompts/29-system-prompt-optimization-conservativeness.md. The Step 0 validation document, including the post-mortem annotation about what it failed to validate, is preserved in the repo at session-logs/gemini-session-logs/26-05-20-phase-14-step0-resumption-validation.md as a record of what pre-flight validation looked like before the outage rewrote our discipline.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles