On the day the audit concluded, we ran cargo test one final time. The number that came back was 3,452. Every test passed. Zero failures. Zero ignored. Zero flaky results. In a codebase of 186,252 lines built across 301 sessions in 42 days, every automated verification still held.
That number did not happen by accident. It happened because every session that introduced a feature also introduced the tests to verify it. It happened because every audit fix session ran the full suite before and after changes. And it happened because Rust's type system catches at compile time the categories of bugs that would otherwise require thousands more tests in a dynamically typed language.
This is the story of FLIN's test suite -- how it grew, what it covers, and what 3,452 passing tests actually mean for a language runtime.
The Growth Curve
FLIN's test count grew with its feature count, but not linearly. The earliest sessions produced the most tests per feature because the foundational components -- lexer, parser, basic VM operations -- have the highest test density. A lexer that recognizes 80+ token types needs at least 80 tests. A parser that handles 40+ statement types needs at least 40 tests. An opcode dispatch table with 170+ opcodes needs at least 170 tests.
Session Range Focus Tests Added Cumulative
001-050 Lexer, Parser, VM ~1,200 1,200
051-100 Types, Entities ~600 1,800
101-150 Web Server, Routes ~400 2,200
151-200 Database, Security ~350 2,550
201-250 Audit, Fixes ~500 3,050
251-301 Polish, Gaps ~400 3,452The middle sessions (101-200) added fewer tests per session because each feature was larger and more integrated. A web server route test exercises the lexer, parser, typechecker, codegen, VM, renderer, and HTTP server all at once -- one test covers many components. The later sessions (201+) saw a resurgence in test count as audit fixes required targeted verification of specific behaviors.
Test Categories
The test suite is organized into three tiers that mirror Rust's testing conventions:
Unit tests (embedded in source files). These test individual functions and methods in isolation. The parser has 29 unit tests for AST creation, display formatting, view elements, component detection, and lifecycle hooks. The renderer has 119 unit tests for HTML generation, event handler serialization, component prop passing, and layout rendering.
rust// Example: parser unit test for view element creation
#[test]
fn test_view_element_is_component() {
let element = ViewElement {
tag: "Button".to_string(),
attributes: vec![],
children: vec![],
span: Span::default(),
};
assert!(element.is_component_tag());
let element = ViewElement {
tag: "div".to_string(),
attributes: vec![],
children: vec![],
span: Span::default(),
};
assert!(!element.is_component_tag());
}Integration tests (in the tests/ directory). These compile and execute complete FLIN programs, verifying end-to-end behavior. The dev server flow tests from Session 203 are integration tests -- they create a VM, inject state, execute bytecode, and verify database persistence.
rust// Example: integration test for entity persistence across restarts
#[test]
fn test_recovery_between_vms() {
let db_path = tempdir().unwrap();
// VM1: create and save an entity
{
let mut vm = VM::new_with_storage(db_path.path());
vm.register_entity("Todo", &["title", "done"]);
vm.save_entity("Todo", &[
("title", Value::Text("Buy milk".into())),
("done", Value::Bool(false)),
]).unwrap();
}
// VM1 is dropped -- simulates server shutdown
// VM2: verify entity survived
{
let mut vm = VM::new_with_storage(db_path.path());
vm.register_entity("Todo", &["title", "done"]);
let todos = vm.query_all("Todo").unwrap();
assert_eq!(todos.len(), 1);
assert_eq!(
vm.get_field(&todos[0], "title").unwrap(),
Value::Text("Buy milk".into())
);
}
}Parser stress tests (line 8866+ in parser.rs). The parser file contains approximately 600 panic-based assertions in its test section. These are deliberately aggressive -- they test that specific FLIN syntax produces specific AST structures, and any deviation causes an immediate test failure. The panic calls in the test section (which the audit initially flagged) are standard Rust test assertions, not production code issues.
What the Tests Cover
The test suite's coverage spans every major subsystem:
Lexer coverage. Every token type in the 84-keyword vocabulary is tested. Template interpolation modes (string, attribute, view), raw content handling for style/script/pre/code tags, the three-mode state machine transitions, and edge cases like nested braces in attribute expressions.
Parser coverage. All 40+ statement types, all 35+ expression types, all 7 pattern types, all 16 type annotation variants, and all 50+ field validators. The parser tests verify both correct parsing and error recovery -- what happens when the input is malformed.
VM coverage. Opcode dispatch for all implemented opcodes, native function behavior for string/math/date/list operations, garbage collection under allocation pressure, and scope management for closures and upvalues.
Renderer coverage. HTML generation for every element type, event handler serialization for native and component elements, conditional and loop rendering, slot and layout composition, and translation lookup.
Database coverage. CRUD operations for all entity types, WAL persistence and recovery, transaction atomicity, time-travel versioning, and query filtering with various operators.
rust// Example: VM test for CreateMap with both string representations
#[test]
fn test_create_map_value_text_keys() {
let mut vm = VM::new();
let source = r#"
translations = {
en: { "hello": "Hello" },
fr: { "hello": "Bonjour" }
}
"#;
let result = vm.execute(source).unwrap();
// Verify both Value::Text and Value::Object keys work
let map = vm.get_global("translations").unwrap();
let en = vm.map_get(&map, "en").unwrap();
let hello = vm.map_get(&en, "hello").unwrap();
assert_eq!(hello, Value::Text("Hello".into()));
}What the Tests Do Not Cover
Honesty about test coverage requires acknowledging the gaps. FLIN's test suite has three significant blind spots as of the audit completion:
No fuzz testing. The parser and lexer have not been subjected to randomized input generation. Fuzz testing would exercise error recovery paths and edge cases that human-written tests miss. For a language that processes untrusted developer input, this is a gap worth closing.
Limited concurrency testing. The web server and WebSocket modules have functional tests but not load tests. Under concurrent request processing, shared state access patterns may reveal race conditions that sequential tests cannot detect.
No property-based testing. Operations that should satisfy algebraic properties (like parse(format(ast)) == ast or deserialize(serialize(value)) == value) are tested with specific examples rather than with property-based testing frameworks like proptest. Property-based testing would provide stronger guarantees about these invariants.
The Test Suite as Documentation
Beyond verification, FLIN's test suite serves as executable documentation. When the audit needed to understand how a feature was supposed to work, the tests provided the authoritative answer. The session logs described intent; the tests described behavior.
This dual role is particularly important for a language runtime because the specification and the implementation can diverge. When they do, the question "which is correct?" has only one reliable answer: the tests. If the tests pass and the behavior matches the test expectations, the implementation is correct regardless of what the specification says. If the tests need updating, the specification needs updating too.
rust// Tests as documentation: how does Entity.where() work?
#[test]
fn test_entity_where_filters_correctly() {
let mut vm = setup_todo_vm();
// Add test data
vm.save_entity("Todo", &[
("title", Value::Text("Done task".into())),
("done", Value::Bool(true)),
]).unwrap();
vm.save_entity("Todo", &[
("title", Value::Text("Open task".into())),
("done", Value::Bool(false)),
]).unwrap();
// where(done == false) should return only open tasks
let result = vm.execute(r#"
open = Todo.where(done == false)
"#).unwrap();
let open = vm.get_global("open").unwrap();
assert_eq!(vm.list_len(&open), 1);
}This test tells you everything you need to know about Entity.where(): it accepts a field comparison, it returns a filtered list, and it evaluates the predicate against each entity's fields. No prose documentation could be more precise or more trustworthy.
3,452 and Counting
The number 3,452 is not a final count. It is a snapshot of January 2026. Every future session that adds a feature will add tests. Every bug report will produce a regression test before the fix. The audit itself added dozens of tests for previously untested code paths.
But 3,452 tests passing simultaneously, after 301 sessions of development, is a statement about the integrity of the codebase. It says that the foundational decisions -- using Rust, maintaining test discipline, running the full suite after every change -- were correct. It says that a language runtime built in 42 days can be as well-tested as one built over years, if the testing culture is embedded from Session 1.
And it says that when the audit found 30 TODOs and 5 production panics, it found them against a backdrop of thousands of verified behaviors. The defects were the exception, not the rule. The codebase was sound.
The next article examines what the audit taught us about the broader practice of building a programming language -- the architectural lessons, the process insights, and the principles we would carry forward into FLIN's next phase.
This is Part 152 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [151] Database Persistence Audit - [152] 3,452 Tests, Zero Failures (you are here) - [153] What the Audit Taught Us About Building a Language