Back to flin
flin

Temporal Integration: From Bugs to 100% Test Coverage

The honest war story of how eight sessions of debugging, auditing, and fixing brought FLIN's temporal model from a broken stub to 100% test coverage.

Thales & Claude | March 25, 2026 11 min flin
flintemporaltestingdebuggingcoverage

This is not a success story about elegant design. This is a war story about bugs, wrong assumptions, and the grinding work of making a complex feature actually function. Between Sessions 068 and 076, we spent eight sessions debugging FLIN's temporal model -- discovering that features we thought were missing were actually implemented, that features we thought were working were actually broken, and that our tracking of progress was wildly inaccurate.

By the end, all twenty-seven temporal integration tests passed. But getting there was a humbling reminder that building a compiler feature and shipping a compiler feature are two very different things.

The Starting Point: A Confident 3%

Before Session 068, our tracking document said the temporal model was three percent complete. Five tasks out of one hundred sixty. Only soft delete was implemented. Everything else -- the @ operator, time keywords, history queries, time arithmetic -- was listed as "not started" or "minimal."

This was wrong. Spectacularly wrong.

The Audit That Changed Everything

Session 068 began as a routine assessment: read the code, figure out what was missing, plan the implementation. Instead, it became an archaeological expedition.

The temporal model was not three percent complete. It was thirty-seven and a half percent complete. Sixty out of one hundred sixty tasks were already done. The lexer had all the tokens. The parser built the right AST nodes. The type checker validated temporal expressions. The code generator emitted the correct bytecodes. The VM had handlers for every temporal opcode. The database stored version history.

The code existed. It had just never been tested end-to-end.

Progress before audit:  5/160  (3%)
Progress after audit:  60/160  (37.5%)

Here is what was actually working, layer by layer:

LayerStatus
Lexer@ token and all time keywords present
ParserExpr::Temporal AST node implemented
Type Checkercheck_temporal() validates expressions
Code Generatoremit_temporal() generates bytecode
VMAll temporal opcodes implemented
Databaseget_history(), soft delete, version tracking

And here is what was broken:

  • OpCode::AtTime was a stub that returned the entity unchanged.
  • The type checker rejected date strings in @ expressions.
  • There was one integration test (lexer only).
  • No end-to-end validation existed.

The lesson was painful: without integration tests, you have no idea whether your features work.

Fixing AtTime: The Stub That Fooled Everyone

The most embarrassing bug was OpCode::AtTime. This opcode handled time keyword access -- user @ yesterday, user @ last_week. It had been "implemented" in an earlier session. It compiled. It ran without errors. It returned a value.

It returned the wrong value. The implementation was a stub:

// The original "implementation"
OpCode::AtTime => {
    let _time_code = self.read_u8(code);
    let entity_val = self.pop()?;
    // Just return the entity unchanged
    self.push(entity_val);
}

Read the time code byte. Pop the entity. Push it back. No timestamp calculation. No history lookup. If you wrote user @ yesterday, you got today's user. The feature "worked" in the sense that it did not crash, but it was functionally a no-op.

The fix required actual time arithmetic. Each keyword had to be converted to a millisecond timestamp, then used to search the version history for the matching state:

OpCode::AtTime => {
    let time_code = self.read_u8(code);
    let entity_val = self.pop()?;

let target_timestamp = if let Some(tc) = TimeCode::from_byte(time_code) { let now = current_timestamp_ms(); match tc { TimeCode::Now => now, TimeCode::Today => { let secs_today = (now / 1000) - ((now / 1000) % 86400); secs_today * 1000 } TimeCode::Yesterday => { let secs_today = (now / 1000) - ((now / 1000) % 86400); (secs_today - 86400) * 1000 } TimeCode::Tomorrow => { let secs_today = (now / 1000) - ((now / 1000) % 86400); (secs_today + 86400) * 1000 } TimeCode::LastWeek => now - (7 24 60 60 1000), TimeCode::LastMonth => now - (30 24 60 60 1000), TimeCode::LastYear => now - (365 24 60 60 1000), } };

// Find version at target timestamp // (full history lookup implementation) } ```

Ninety-one lines of new code. Seven keywords, all functional. One bug that had been hiding since the feature was first "implemented."

The History Duplication Bug

Session 075 tackled the .history property. As described in Article 047, the infrastructure was ninety percent complete -- but two bugs made it produce incorrect results.

The duplication bug was subtle. When an entity was first saved, ZeroCore added the initial version to the history array. Then, when .history was accessed, the VM's OpCode::History handler appended the current version to the result. For an entity that had been saved twice, the history was [v1, v2, v2] instead of [v1, v2].

The fix required establishing a clear semantic rule: history stores past versions only. The VM is responsible for appending the current version when constructing the result list. This eliminated the duplication.

The second bug was simpler: unsaved entities (with id == 0) returned [current] instead of []. An entity that has never been persisted has no history. Adding an is_saved check resolved it.

Five new tests passed after these fixes, bringing temporal coverage from eleven out of twenty-seven to sixteen out of twenty-seven.

Session 076: The Final Push to 100%

Session 076 was the culmination -- fixing the remaining eleven failures to reach one hundred percent temporal test coverage. Each failure had a different root cause, and understanding them required knowledge of FLIN's view rendering system, its reserved keywords, and its reactive HTML output.

Root Cause 1: Top-Level {if} Blocks (7 tests)

Seven tests failed because {if} blocks were written at the top level of the file. In FLIN, control flow tokens like {if} are only recognized by the lexer when inside view elements (Content mode). At the top level, { is parsed as the start of an expression, not a control flow directive.

// WRONG -- top-level {if} causes parse error
old = user @ -1
{if old}
    <div>Found</div>
{else}
    <div>Not found</div>
{/if}

// CORRECT -- wrap in a view element old = user @ -1

{if old}

Found

{else}

Not found

{/if}
```

This was not a temporal bug. It was a test authoring error that only manifested because temporal tests tend to use conditionals heavily (checking whether a past version exists). The fix was wrapping every top-level {if} in a

.

Root Cause 2: Reserved Keyword Conflict (2 tests)

Two tests used log as a variable name. In FLIN, log is a built-in function (for logging to the console). The type checker reported a type error: log is (unknown) -> unit, not an entity.

// WRONG -- 'log' is reserved
entity Log { message: text }
log = Log { message: "test" }
save log

// CORRECT -- use a different variable name entity Log { message: text } entry = Log { message: "test" } save entry ```

A naming collision that had nothing to do with temporal logic but only surfaced in temporal test scenarios.

Root Cause 3: Delete Creates a Version (1 test)

One test expected two versions after two saves, but delete creates a third version. The test asserted history.count == 2 after a create-save-save-delete sequence, but the correct count was three because soft delete increments the version number and records the deletion as a history entry.

// ZeroCore delete behavior
entity.version += 1;
let version = EntityVersion { /* ... */ };
collection.history.insert(/* ... */);

The timeline: save (version 1), save (version 2), delete (version 3 -- marks deleted_at and saves to history). The fix was updating the test expectation from two to three.

Root Cause 4: Reactive Rendering (4 tests)

FLIN's HTML renderer wraps interpolated values in reactive spans: value. Four tests asserted plain text output like "Max increased from 100 to 200" but the actual output included reactive spans.

// WRONG assertion
assert_output_contains(&output, "Max increased from 100 to 200");

// CORRECT -- check for the values within reactive spans assert_output_contains(&output, "Max increased from"); assert_output_contains(&output, ">100<"); assert_output_contains(&output, ">200<"); ```

Root Cause 5: Both Branches Rendered (1 test)

One test used assert_output_not_contains to verify that only one branch of an {if} block was rendered. But FLIN's reactive rendering emits both branches with display: none/block, so the client can toggle them without a server round-trip.

<!-- Both branches present in output -->
<span data-flin-if="name_changed" style="display: none">Name changed</span>
<span data-flin-if="(!name_changed)" style="display: block">Name unchanged</span>

The fix was removing the negative assertion and adding a comment explaining the reactive rendering behavior.

The Full List: All 27 Tests Passing

After Session 076, every temporal integration test passed:

Temporal Access (@ operator): Nine tests covering @ -1, @ -2, @ 0, out-of-range access, field access on temporal results, chained temporal access, and access without prior save.

Temporal Keywords: Four tests covering @ now, @ today, @ yesterday, and @ last_week.

History Queries: Six tests covering .history returning all versions, single-version history, history after updates, empty history before save, history in conditionals, and independent history across multiple entities.

Integration Scenarios: Eight tests covering soft delete with history preservation, soft delete with temporal access, change detection, change magnitude calculation, nested temporal access, audit trail use case, price history use case, and undo-to-previous use case.

The test count had grown from one (in Session 068) to twenty-seven (in Session 076). Zero regressions across one thousand and ten library tests.

What We Learned

1. Always Audit Before Assuming

Session 068 revealed that progress was twelve times higher than believed. Code had been written in earlier sessions and never tracked. If we had started implementing from scratch based on the tracking document, we would have wasted sessions rewriting existing code.

The counter-lesson is equally important: code that exists is not code that works. The audit found sixty completed tasks but also found that the most critical one -- AtTime -- was a stub.

2. Integration Tests Are Non-Negotiable

Unit tests at each layer passed. The lexer tokenized correctly. The parser built the right AST. The type checker accepted valid expressions. The code generator emitted proper bytecodes. The VM executed opcodes without crashing. But the end-to-end flow was broken because AtTime was a no-op.

Only integration tests -- tests that write FLIN code, compile it, execute it, and check the HTML output -- caught this. After Session 076, we made temporal integration tests a blocking requirement for any new temporal feature.

3. Test Failures Are Often Not About What You Think

Of the eleven failures fixed in Session 076, zero were caused by temporal logic bugs. Seven were view syntax issues (top-level {if}). Two were keyword collisions. One was a wrong expectation. One was a misunderstanding of reactive rendering. The temporal model itself was correct -- the tests were not.

This pattern repeats across software engineering: the first diagnosis is usually wrong. Fixing the actual root cause requires understanding the full system, not just the component under test.

4. Semantic Clarity Prevents Bugs

The history duplication bug existed because there was no clear rule about who owned the "current version" in the history list. Was it the database? The VM? Both? Once we established the rule -- "history stores past versions only; the VM appends the current version" -- the bug became obvious and the fix was trivial.

Every ambiguity in a system's semantics is a future bug waiting to be discovered.

The Debugging Marathon in Numbers

MetricValue
Sessions spent8 (068-076)
Temporal tasks discovered as already complete55
Bugs found and fixed5
Tests added26
Root causes identified5 distinct categories
Lines of test code~700
Library test regressions0

The temporal model went from an untested collection of code to a fully validated, one hundred percent covered feature. It was not glamorous work. There were no architectural breakthroughs. Just systematic debugging, one failure at a time, until every test turned green.

That is how real software gets shipped.

---

This is Part 3 of the "How We Built FLIN" temporal model series, documenting the debugging marathon that brought temporal tests to 100% coverage.

Series Navigation: - [046] Every Entity Remembers Everything: The Temporal Model - [047] Version History and Time Travel Queries - [048] Temporal Integration: From Bugs to 100% Test Coverage (you are here) - [049] Destroy and Restore: Soft Deletes Done Right - [050] Temporal Filtering and Ordering - [051] Temporal Comparison Helpers - [052] Version Metadata Access - [053] Time Arithmetic: Adding Days, Comparing Dates - [054] Tracking Accuracy and Validation - [055] The Temporal Model Complete: What No Other Language Has

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles