Back to flin
flin

Tracking Accuracy and Validation

How Session 079 revealed that FLIN's temporal model was further along than documented -- and the lessons learned about tracking accuracy, validation, and the danger of stale documentation.

Thales & Claude | March 25, 2026 11 min flin
flintemporalaccuracyvalidationedge-cases

Session 079 was supposed to be a quick cleanup: complete two missing tests for TEMP-2 (Temporal Access) and move on. Instead, it became a lesson in the gap between what documentation says and what code does -- and a case study in why validation matters more than estimation.

The session began with the temporal model tracked at fifty-one point nine percent. It ended at fifty-seven point five percent. Not because we wrote a lot of new code, but because we discovered that nine tasks were already complete and had never been recorded.

The Audit Trigger

The tracking document listed TEMP-2 (Temporal Access) at eighty-nine percent -- sixteen out of eighteen tasks. Two tasks remained:

  • TEMP2-17: Test relative access (user @ -1, user @ -2)
  • TEMP2-18: Test absolute time access (user @ "2024-01-15")

The plan was straightforward: write two integration tests, mark the tasks complete, move on. Five minutes of work, maybe ten.

But when we opened tests/temporal_integration.rs to add the tests, we found them already there. Seven comprehensive integration tests covering exactly the functionality the tracking document claimed was untested:

test_temporal_relative_minus_1          -- @ -1 access
test_temporal_relative_minus_2          -- @ -2 access
test_temporal_relative_out_of_range     -- Out of bounds handling
test_temporal_relative_zero_is_current  -- @ 0 returns current
test_temporal_relative_access_field_directly  -- Field access on result
test_temporal_absolute_date_no_match    -- Date string with no match
test_temporal_absolute_datetime_no_match -- DateTime with no match

All seven passing. TEMP-2 was already one hundred percent complete. It had been one hundred percent complete since Session 076. The tracking document just had not been updated.

The Cascade Discovery

Finding TEMP-2 already complete prompted a deeper investigation. If these tasks were mislabeled as incomplete, what else was wrong?

TEMP-3 (Temporal Keywords) was listed at seventy-nine percent -- eleven out of fourteen tasks. Three tasks were supposedly missing. But the implementation was fully functional: all seven keywords lexed, parsed, type-checked, compiled, and executed correctly. The "missing" tasks were tests, and we wrote two test files to close the gap:

flin// temporal-keywords-test.flin -- Tests all 7 keywords
user = User { name: "Test" }
save user

// Verify all keywords produce valid timestamps
current = now
today_start = today
yesterday_start = yesterday

// Verify relationships
<div>
{if current > yesterday_start}
    <p>Now is after yesterday: correct</p>
{/if}
{if today_start > yesterday_start}
    <p>Today is after yesterday: correct</p>
{/if}
</div>
flin// temporal-keyword-comparisons-test.flin -- Keywords in conditions
entity Event {
    name: text
    scheduled_at: time
}

event = Event { name: "Meeting", scheduled_at: now + 7.days }
save event

// Keywords in comparisons
<div>
{if event.scheduled_at > now}
    <p>Event is in the future</p>
{/if}
{if event.scheduled_at > yesterday}
    <p>Event is after yesterday</p>
{/if}
</div>

Both test files passed syntax checking and execution. TEMP-3 moved to one hundred percent.

TEMP-8 (Hard Delete/Restore) was listed at eight percent -- one out of twelve tasks. But Session 077 had implemented the full destroy/restore lifecycle:

  • destroy keyword: fully implemented (lexer, parser, type checker, codegen, VM).
  • restore() function: fully implemented (VM built-in, type checker registration).
  • Database methods: both destroy() and restore() working in ZeroCore.
  • Integration tests: nine destroy/restore tests passing.

Five core tasks were already complete. TEMP-8 moved from eight percent to forty-two percent.

The Numbers

The tracking audit revealed:

CategoryBeforeAfterChange
TEMP-2: Temporal Access16/18 (89%)18/18 (100%)+2 discovered
TEMP-3: Temporal Keywords11/14 (79%)14/14 (100%)+3 (tests written)
TEMP-8: Hard Delete/Restore1/12 (8%)5/12 (42%)+4 discovered
Total83/160 (51.9%)92/160 (57.5%)+9 tasks (+5.6%)

Six tasks were discovered as already complete. Three tasks were newly completed by writing test files. The net result was a five point six percent progress jump from a session that was supposed to be a quick cleanup.

Why Documentation Drifted

The root cause was a workflow gap. Implementation sessions focused on writing code and passing tests. Tracking updates happened at the end of sessions, if at all. When Session 076 achieved one hundred percent temporal test coverage, the tests for TEMP-2 were among those passing -- but the tracking document was updated with the overall test count, not with task-level granularity.

Session 077 implemented destroy and restore with nine integration tests. The session log documented everything. But the global tracking file was only partially updated because the session ended late and the priority was committing working code, not administrative bookkeeping.

This is a universal pattern in software projects: implementation outpaces documentation. The code is the source of truth, but the tracking document is what people read. When they diverge, decisions get made based on stale information.

Lessons for Project Tracking

1. Always Verify Before Implementing

If we had started writing TEMP-2 tests without checking whether they existed, we would have created duplicate tests. Worse, we might have introduced subtle differences between the duplicates, creating maintenance confusion.

The rule now: before marking a task as "to do," search the codebase for existing implementations. Run existing tests. Verify that the gap is real.

2. Code Is the Source of Truth

The tracking document said three percent. The code said thirty-seven and a half percent. The tracking document said eighty-nine percent for TEMP-2. The code said one hundred percent. In every case, the code was right and the document was wrong.

For a two-person team (one human, one AI), this means the most reliable way to assess progress is to read the code and run the tests -- not to read the tracking file. The tracking file is a planning tool, not an audit tool.

3. Test Coverage Is the Real Metric

The tracking document tracked "tasks" -- a mix of implementation work, test coverage, and documentation. But the only tasks that mattered for "is this feature done?" were the tests. If a feature has comprehensive passing tests, it works. If it does not have tests, it might work -- but you cannot prove it.

Session 079 reinforced what Session 068 had already shown: the temporal model had far more working code than anyone realized, but without integration tests, "working" was an assumption, not a fact.

4. Small Sessions Have Outsized Impact

Session 079 was not a marathon implementation session. No new architecture was designed. No complex algorithms were implemented. The session audited existing work, wrote two test files, and updated tracking documentation.

The impact: nine tasks completed, two categories at one hundred percent, and a five point six percent overall progress jump. The highest return on time investment of any temporal model session.

Validation Patterns for Temporal Data

Beyond project tracking, Session 079 surfaced several validation patterns for temporal data itself -- edge cases that the tests confirmed worked correctly.

Out-of-Range Temporal Access

What happens when you request a version that does not exist?

flinuser @ -100    // Only 3 versions exist

The VM handles this gracefully: it returns None. No crash, no exception. The developer handles the absence case through FLIN's optional type system.

This was validated without writing new code -- the existing test test_temporal_relative_out_of_range covered it. The VM's history lookup simply fails to find a matching version and returns None instead of indexing out of bounds.

Keyword Ordering Consistency

Are temporal keywords ordered correctly? Is now always greater than yesterday? Is today always greater than last_week?

flin// These should always be true
now > yesterday         // true
today > yesterday       // true
now > last_week         // true
now > last_month        // true
last_week > last_month  // true

The keyword test file validated eight such relationships. All held. The timestamp calculations use UTC, which eliminates timezone-related ordering inconsistencies.

Keywords in Conditional Expressions

Temporal keywords must work not just as @ operands but as values in general expressions:

flinentity Event {
    name: text
    scheduled_at: time
}

event = Event { name: "Meeting", scheduled_at: now + 7.days }
save event

// Keywords in boolean expressions
is_future = event.scheduled_at > now
is_recent = event.scheduled_at > last_week

The comparison test file verified that keywords evaluate correctly in {if} blocks, in variable assignments, and in entity field comparisons. All passed.

Soft Delete Temporal Access Interaction

A soft-deleted entity should not appear in standard queries but should still be accessible through temporal access:

flinuser = User { name: "Test" }
save user
delete user

// Standard query: not found
found = User.find(user.id)           // none

// Temporal access: history preserved
old = user @ -1                      // Previous version exists

This interaction was already tested but had not been explicitly tracked as a validation case. Session 079 confirmed it was working correctly.

The State After Session 079

Five categories at one hundred percent:

CategoryStatus
TEMP-1: Core Soft Delete (5/5)Complete
TEMP-2: Temporal Access (18/18)Complete
TEMP-3: Temporal Keywords (14/14)Complete
TEMP-5: Time Arithmetic (12/12)Complete
TEMP-11: Integration Tests (27/27)Complete

Overall: ninety-two out of one hundred sixty tasks (fifty-seven point five percent).

All one thousand and forty-six tests passing (library tests plus integration tests). Zero regressions.

The Broader Pattern: Documentation Debt

Session 079 exposed a form of technical debt that gets less attention than code debt: documentation debt. When documentation falls behind implementation, the consequences cascade:

Planning errors. If the tracking document says a feature is not implemented, someone might schedule a session to implement it -- wasting time on work that is already done. This happened with TEMP-2: we allocated time to write tests that already existed.

Communication failures. When Thales asked "how far along is the temporal model?" the answer based on documentation was fifty-two percent. The answer based on code was fifty-eight percent. For a CEO making product decisions, that six percent gap could change priorities.

Morale impact. Working on a feature that is "fifty-two percent done" feels different from working on one that is "fifty-eight percent done." Progress perception affects motivation, and stale documentation systematically underreports progress, making the team feel like they are falling behind when they are actually ahead.

Duplicate work risk. The most expensive consequence. If a developer (human or AI) implements a feature without checking whether it exists, the result is duplicate code, conflicting implementations, and integration bugs.

The fix is not "better documentation practices" -- that is the same advice every project ignores. The fix is making verification a habit. Before every session that touches temporal features, we now run the integration test suite and check the test count. The test count does not lie. If thirty-six tests pass, thirty-six features work. No document needed.

Applying This to the CEO-AI CTO Workflow

The tracking accuracy problem is amplified in the CEO-AI CTO workflow because of how sessions work. Each session is relatively independent -- Claude receives context about what to do, executes, and produces results. Between sessions, the state is captured in session logs and tracking documents.

If those documents are inaccurate, the next session starts with wrong assumptions. Claude might re-implement something that works, skip something that is broken, or estimate effort incorrectly. The solution we adopted after Session 079 was threefold:

  1. Run tests first. Every session begins with cargo test to establish a baseline.
  2. Grep before implementing. Before writing new code, search for existing implementations of the same feature.
  3. Update tracking atomically. When a task is completed, the tracking file is updated in the same commit.

These practices reduced documentation drift in subsequent sessions and made progress estimation more reliable.

The session proved a principle that applies far beyond FLIN: sometimes the most productive work is not writing new code, but understanding the code you have already written. Verification is implementation. Validation is progress. And the most dangerous assumption in any project is that your tracking document is accurate.


This is Part 9 of the "How We Built FLIN" temporal model series, documenting the tracking accuracy audit that revealed hidden progress and validated temporal edge cases.

Series Navigation: - [046] Every Entity Remembers Everything: The Temporal Model - [047] Version History and Time Travel Queries - [048] Temporal Integration: From Bugs to 100% Test Coverage - [049] Destroy and Restore: Soft Deletes Done Right - [050] Temporal Filtering and Ordering - [051] Temporal Comparison Helpers - [052] Version Metadata Access - [053] Time Arithmetic: Adding Days, Comparing Dates - [054] Tracking Accuracy and Validation (you are here) - [055] The Temporal Model Complete: What No Other Language Has

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles