#061 -- Index Utilization: Making Queries Fast

For the first three sessions of FlinDB development, every query performed a full table scan. User.where(email == "[email protected]") iterated through every User entity, checking the email field one by one. For a hundred users, this was instantaneous. For ten thousand users, it was noticeable. For a million users, it would be unusable.

The @index annotation already existed in FlinDB's schema definition. Developers could write email: text @index and feel like they were doing the right thing. But the annotation was decorative. The index was declared but never built. The query engine ignored it entirely.

Session 163 fixed this. In a single session, we implemented index population, index maintenance across all mutation operations, query optimization that automatically uses indexes when available, and index-backed unique constraint checking. Nine tests. Every query on an indexed field went from O(n) to O(1).

The Problem: Decorative Indexes

Before Session 163, the @index annotation modified the schema but had no effect on query execution:

flinentity User {
    email: text @index    // Declared but never used
    name: text
}

// This query always scanned every User entity
users = User.where(email == "[email protected]")

The schema recorded that email should be indexed. The query engine did not check if a field was indexed. Every where_eq, where_lt, and where_gt operation performed a linear scan through all entities of that type.

This was a known gap. Sessions 160-162 focused on correctness -- making CRUD, constraints, and aggregations work correctly. Session 163 was about making them work fast.

Index Storage Design

The index data structure is a nested HashMap stored inside each entity collection:

rust// EntityCollection fields
indexes: HashMap<String, HashMap<String, Vec<u64>>>

The outer key is the field name (e.g., "email"). The inner key is the encoded value (e.g., "$$TEXT$$:[email protected]"). The value is a vector of entity IDs that have that value for that field.

The index key format uses a type prefix to avoid collisions between different value types:

rustfn value_to_index_key(value: &Value) -> String {
    match value {
        Value::Text(s) => format!("$$TEXT$$:{}", s),
        Value::Int(n) => format!("$$INT$$:{}", n),
        Value::Number(n) => format!("$$NUM$$:{}", n),
        Value::Bool(b) => format!("$$BOOL$$:{}", b),
        _ => format!("$$OTHER$$:{:?}", value),
    }
}

Why the type prefix? Because the string "42" and the integer 42 are different values that should map to different index entries. Without the prefix, Value::Text("42") and Value::Int(42) would collide in the index, causing incorrect query results.

Index Population on Save

When an entity is saved, its indexed fields are added to the appropriate indexes:

rustfn add_to_indexes(
    &mut self,
    schema: &EntitySchema,
    entity_id: u64,
    fields: &HashMap<String, Value>,
) {
    for field_name in &schema.indexed_fields {
        if let Some(value) = fields.get(field_name) {
            let key = value_to_index_key(value);
            self.indexes
                .entry(field_name.clone())
                .or_default()
                .entry(key)
                .or_default()
                .push(entity_id);
        }
    }
}

For updates, index maintenance is a two-step process: remove the old value from the index, then add the new value:

rust// In save() for existing entities:
// 1. Remove old values from indexes
self.remove_from_indexes(schema, entity_id, &old_fields);
// 2. Add new values to indexes
self.add_to_indexes(schema, entity_id, &new_fields);

This ensures that the index always reflects the current state of the data. An entity that changes its email from "[email protected]" to "[email protected]" is removed from the index entry for "[email protected]" and added to the entry for "[email protected]".

Index Maintenance on Delete and Restore

Soft delete, hard destroy, and restore all affect indexes:

Soft delete removes the entity from indexes. Since all queries automatically exclude soft-deleted entities, the index should not contain them:

rust// In delete():
self.remove_from_indexes(schema, entity_id, &entity.fields);

Hard destroy also removes from indexes, but before the entity is removed from storage:

rust// In destroy():
self.remove_from_indexes(schema, entity_id, &entity.fields);
// Then remove entity from data

Restore (un-deleting a soft-deleted entity) adds the entity back to indexes:

rust// In restore():
self.add_to_indexes(schema, entity_id, &entity.fields);

This lifecycle management is critical. Without it, indexes would become stale -- containing references to deleted entities or missing references to restored ones. Every mutation operation must maintain index consistency, or the index is worse than useless (it would return incorrect results, which is worse than being slow).

Query Optimization

The query execution engine was modified to check for index availability before executing a query:

rustfn execute_query(&self, query: &Query) -> DatabaseResult<Vec<EntityInstance>> {
    // Check if any equality condition is on an indexed field
    if let Some((field, value)) = self.find_indexed_eq_condition(query) {
        // O(1) index lookup
        let entity_ids = self.lookup_by_index(entity_type, &field, &value)?;

        // Apply remaining conditions as filters
        let results: Vec<_> = entity_ids.iter()
            .filter_map(|id| self.find_by_id_internal(entity_type, *id).ok())
            .filter(|entity| entity.deleted_at.is_none())
            .filter(|entity| query.remaining_conditions_match(entity))
            .collect();

        return Ok(results);
    }

    // Fallback: full table scan
    self.full_scan(query)
}

The optimizer looks for equality conditions (where_eq) on indexed fields. If it finds one, it uses the index for the initial lookup and then applies any remaining conditions as filters on the result set.

For example, consider this query:

flinUser.where(email == "[email protected]" && active == true)

If email is indexed but active is not:

Index lookup on email returns entity IDs [42] -- O(1)
Load entity 42 from the data store -- O(1)
Check active == true on entity 42 -- O(1)

Total: O(1) instead of O(n).

If neither field is indexed, the query falls back to a full scan. If both fields are indexed, the optimizer picks the first indexed condition it finds. A more sophisticated optimizer could estimate selectivity and choose the most selective index, but for FlinDB's use case (embedded databases with moderate data volumes), the simple approach works well.

Index-Backed Unique Constraints

Unique constraint checking was also optimized to use indexes. Before Session 163, checking uniqueness required scanning all entities to find duplicates. After Session 163, unique fields that are also indexed use the index for O(1) conflict detection:

rustfn check_unique_constraints(&self, ...) -> DatabaseResult<()> {
    for constraint in &schema.constraints {
        match constraint {
            Constraint::Unique(field) => {
                if let Some(value) = fields.get(field) {
                    // Try index lookup first
                    if schema.indexed_fields.contains(field) {
                        let ids = self.lookup_by_index(entity_type, field, value)?;
                        for existing_id in ids {
                            if Some(existing_id) != id {
                                return Err(DatabaseError::UniqueViolation { /* ... */ });
                            }
                        }
                    } else {
                        // Full scan fallback for non-indexed unique fields
                        self.check_unique_by_scan(entity_type, id, field, value)?;
                    }
                }
            }
            _ => {}
        }
    }
    Ok(())
}

In practice, unique fields should always be indexed (and FlinDB could auto-index them in a future version). But the implementation handles both cases -- indexed unique fields get O(1) checks, and non-indexed unique fields fall back to O(n) scans.

Performance Impact

The performance improvement is dramatic for indexed queries:

Operation	Before (Complexity)	After (Complexity)
`where_eq` on indexed field	O(n)	O(1)
Unique constraint check on indexed field	O(n)	O(1)
`where_gt` on any field	O(n)	O(n)
Non-indexed `where_eq`	O(n)	O(n)

The O(n) operations remain unchanged because range queries (>, <, >=, <=) require a different index structure (B-tree or sorted array) that was not implemented in this session. For FlinDB's target workload -- applications with hundreds to thousands of entities per type -- the hash index covering equality queries is the highest-impact optimization.

Automatic Indexing for References

One of Session 164's contributions (covered in the relationships article) was automatic indexing of entity reference fields. When a schema contains a reference like author: User, ZeroCore automatically adds author to the indexed fields list:

rust// During schema registration
for field in &schema.fields {
    if field.field_type == FieldType::EntityRef(_) {
        schema.indexed_fields.push(field.name.clone());
    }
}

This means that relationship queries -- Post.where(author == user) -- automatically benefit from index acceleration without the developer adding an explicit @index annotation. Since relationship queries are among the most common query patterns, this automatic indexing provides significant performance improvement with zero configuration.

The Nine Tests

Session 163 added nine tests covering every aspect of index behavior:

test_index_populated_on_save -- verifies index creation for new entities
test_index_updated_on_entity_update -- verifies old value removal and new value addition
test_index_removed_on_delete -- verifies soft delete removes from index
test_index_removed_on_destroy -- verifies hard delete removes from index
test_index_restored_on_restore -- verifies restore adds back to index
test_where_eq_uses_index -- verifies query uses index for equality
test_unique_constraint_uses_index -- verifies unique check uses index
test_index_multiple_entities_same_value -- verifies non-unique indexes work correctly
test_index_with_multiple_conditions -- verifies index plus additional filter conditions

Test 8 is particularly important. Non-unique indexes can have multiple entity IDs for the same value. For example, if ten users are in the "admin" role, the index entry for role: "admin" should contain all ten entity IDs. The test verifies that the lookup returns all matching entities, not just the first one.

Test 9 verifies the combined path: index lookup for one condition, then filter for additional conditions. This is the most common real-world pattern -- you rarely query on a single field.

After Session 163: 2,120 tests passing (1,514 library + 606 integration).

What Indexes Cannot Do (Yet)

Session 163 implemented hash indexes optimized for equality lookups. Several advanced index features were deferred:

Range indexes -- B-tree or sorted indexes that would accelerate where_gt, where_lt, and range queries. These are important for time-based queries (created_at > yesterday) but require a more complex data structure.

Composite indexes -- indexes on multiple fields together, accelerating queries like where(category == "electronics" && brand == "Apple"). The current implementation can use an index on either field but not both simultaneously.

Partial indexes -- indexes that only include entities matching a condition, useful for large datasets where most entities do not match the common query pattern.

Index statistics -- tracking how often each index is used, how selective it is, and whether the query optimizer is making good choices.

These features would bring FlinDB's indexing closer to PostgreSQL's capabilities. But for the embedded database use case -- applications with moderate data volumes running on a single machine -- the hash index covers the highest-impact scenarios. The 80/20 rule applies: 80% of query performance comes from 20% of index features, and that 20% is equality lookups on indexed fields.

This is Part 6 of the "How We Built FlinDB" series, documenting how we built a complete embedded database engine for the FLIN programming language.

Series Navigation: - [058] CRUD Without SQL - [059] Constraints and Validation in FlinDB - [060] Aggregations and Analytics - [061] Index Utilization: Making Queries Fast (you are here) - [062] Relationships and Eager/Lazy Loading

#061 -- Index Utilization: Making Queries Fast

The Problem: Decorative Indexes

Index Storage Design

Index Population on Save

Index Maintenance on Delete and Restore

Query Optimization

Index-Backed Unique Constraints

Performance Impact

Automatic Indexing for References

The Nine Tests

What Indexes Cannot Do (Yet)

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live