#134 -- Zstd Compression and Blob Garbage Collection

Storage is not free. A FLIN application that accepts file uploads will accumulate data indefinitely. Text documents compress well -- a 500 KB PDF becomes a 150 KB blob with Zstd. But already-compressed formats like JPEG and MP4 waste CPU time on compression attempts that produce larger output than the input. And when entities are deleted, their associated blobs become orphans: unreferenced files consuming disk space (or cloud storage budget) with no path to reclamation.

Sessions 233 and 234 addressed both problems. Transparent Zstd compression reduces storage costs for compressible formats while intelligently skipping already-compressed files. Blob garbage collection reclaims space from orphaned blobs using a hybrid approach: reference counting for immediate cleanup and mark-and-sweep for periodic consistency checks.

Transparent Zstd Compression

Zstd (Zstandard) is a compression algorithm developed by Facebook that offers an excellent balance of compression ratio and speed. At its default level (3), it compresses nearly as well as gzip at maximum compression but runs 5-10 times faster. For file storage, this means negligible latency impact on uploads and downloads.

The Compression Module

FLIN's compression is completely transparent. Callers of the storage backend do not know whether files are compressed. The backend compresses on write and decompresses on read:

rust// Magic bytes identify compressed blobs
pub const BLOB_MAGIC: &[u8; 8] = b"FLINBLB\0";

// File format for compressed blobs:
// Bytes 0-7:   Magic "FLINBLB\0"
// Bytes 8-15:  Original size (u64 LE)
// Bytes 16+:   Zstd-compressed data

pub struct CompressionConfig {
    pub enabled: bool,             // Default: true
    pub level: i32,                // Default: 3 (range: 1-22)
    pub min_size: usize,           // Default: 1024 (1 KB)
    pub skip_extensions: Vec<String>,  // Already-compressed formats
}

The magic bytes serve two purposes. They identify compressed blobs so the decompression path knows to decompress. And they provide backward compatibility: blobs stored before compression was enabled do not start with FLINBLB\0, so they are returned as-is.

Smart Compression Decisions

Not every file benefits from compression. JPEG images, ZIP archives, and MP4 videos are already compressed. Attempting to compress them wastes CPU time and often produces output larger than the input. FLIN checks three conditions before compressing:

rustpub fn should_compress(
    extension: &str,
    size: usize,
    config: &CompressionConfig,
) -> bool {
    if !config.enabled {
        return false;
    }

    // Skip files smaller than threshold
    if size < config.min_size {
        return false;
    }

    // Skip already-compressed formats
    let ext = extension.to_lowercase();
    if config.skip_extensions.iter().any(|s| s == &ext) {
        return false;
    }

    true
}

The default skip list covers three categories:

Category	Extensions
Images	.jpg, .jpeg, .png, .webp, .gif, .avif, .heic
Archives	.zip, .gz, .zst, .7z, .rar, .xz, .bz2, .lz4
Media	.mp3, .mp4, .webm, .ogg, .m4a, .aac, .flac, .mkv, .avi, .mov

The 1 KB minimum size threshold prevents compression of tiny files where the header overhead (16 bytes) would negate the savings. A 500-byte text file compressed to 300 bytes still takes 316 bytes with the header -- a 37% reduction instead of the 40% the raw compression achieved.

Compression and Decompression

The core functions handle the FLIN blob format:

rustpub fn compress_blob(data: &[u8], level: i32) -> io::Result<Vec<u8>> {
    let compressed = zstd::encode_all(data, level)?;

    // Skip if compression did not help
    if compressed.len() >= data.len() {
        return Ok(data.to_vec());
    }

    // Build FLIN blob: magic + original_size + compressed_data
    let mut blob = Vec::with_capacity(16 + compressed.len());
    blob.extend_from_slice(BLOB_MAGIC);
    blob.extend_from_slice(&(data.len() as u64).to_le_bytes());
    blob.extend_from_slice(&compressed);

    Ok(blob)
}

pub fn decompress_blob(data: &[u8]) -> io::Result<Vec<u8>> {
    if !is_compressed(data) {
        return Ok(data.to_vec());  // Not compressed, return as-is
    }

    let original_size = u64::from_le_bytes(data[8..16].try_into().unwrap()) as usize;
    let compressed_data = &data[16..];

    let mut decompressed = Vec::with_capacity(original_size);
    zstd::Decoder::new(compressed_data)?
        .read_to_end(&mut decompressed)?;

    Ok(decompressed)
}

pub fn is_compressed(data: &[u8]) -> bool {
    data.len() >= 16 && &data[0..8] == BLOB_MAGIC
}

The "skip if no benefit" check is subtle but important. Some files are incompressible -- random binary data, encrypted files, or files that are already compressed with an algorithm not in the skip list. For these, compress_blob returns the original data without the FLIN header, and the file is stored uncompressed.

Backend Integration

Compression is integrated into all four storage backends. The local backend demonstrates the pattern:

rustimpl StorageBackend for LocalBackend {
    fn put(&self, hash: &str, data: &[u8], extension: &str) -> StorageResult<String> {
        validate_hash(hash)?;
        let path = self.build_path(hash, extension);

        if path.exists() {
            return Ok(self.format_path(hash, extension));
        }

        // Compress if appropriate
        let stored_data = if should_compress(extension, data.len(), &self.compression) {
            compress_blob(data, self.compression.level)?
        } else {
            data.to_vec()
        };

        std::fs::create_dir_all(path.parent().unwrap())?;
        std::fs::write(&path, &stored_data)?;

        Ok(self.format_path(hash, extension))
    }

    fn get(&self, path: &str) -> StorageResult<Vec<u8>> {
        let file_path = self.resolve_path(path);
        let data = std::fs::read(&file_path)?;

        // Decompress transparently
        Ok(decompress_blob(&data)?)
    }
}

The R2 and GCS backends follow the same pattern: compress before PUT, decompress after GET. The compression and decompression happen on the FLIN server, not on the cloud provider. This means the compressed data is what travels over the network, reducing upload and download bandwidth in addition to storage costs.

Blob Garbage Collection

When an entity with a file field is deleted, the file blob remains in storage. The entity record is gone, but the bytes on disk (or in the cloud) persist. Without garbage collection, these orphaned blobs accumulate forever.

The Hybrid Approach

FLIN uses two complementary GC strategies:

Reference counting handles the common case. When destroy is called on an entity, the runtime identifies file fields, decrements their reference counts, and deletes blobs that reach zero references. This is synchronous and immediate.

Mark-and-sweep handles edge cases. Periodic sweeps scan all blobs in storage, check which ones are referenced by live entities, and delete unreferenced blobs that have exceeded a grace period. This catches blobs orphaned by crashes, failed transactions, or schema migrations.

The Reference Index

rustpub struct BlobRefIndex {
    refs: HashMap<String, BlobRefEntry>,  // blob_hash -> entry
    index_path: PathBuf,                   // .flindb/blob_refs.json
    dirty: bool,
}

pub struct BlobRefEntry {
    pub ref_count: u32,
    pub references: HashSet<(String, u64)>,  // (entity_type, entity_id)
    pub created_at: i64,
    pub updated_at: i64,
    pub orphaned_at: Option<i64>,            // When ref_count hit 0
}

The reference index tracks which entities reference each blob. When an entity with a file field is saved, the blob's reference count is incremented and the entity is added to the reference set. When the entity is destroyed, the reference count is decremented.

Content-addressable storage makes reference counting essential. If two entities reference the same file (same SHA-256 hash), deleting one entity must not delete the blob -- the other entity still needs it. The reference count ensures that blobs are deleted only when no entity references them.

Destroy With Cleanup

The destroy_with_cleanup method on FlinDB returns the file paths associated with a destroyed entity:

rustpub fn destroy_with_cleanup(
    &mut self,
    entity_type: &str,
    entity_id: u64,
) -> Result<Vec<String>, DatabaseError> {
    // Get file paths before destruction
    let file_paths = self.get_entity_file_paths(entity_type, entity_id)?;

    // Destroy the entity record
    self.destroy(entity_type, entity_id)?;

    Ok(file_paths)
}

The caller (the VM or the HTTP handler) then processes the returned paths through the reference index:

rustfor path in blob_paths {
    if let Some(hash) = parse_blob_hash(&path) {
        blob_ref_index.remove_ref(&hash, entity_type, entity_id);
        if blob_ref_index.get_ref_count(&hash) == 0 {
            // Mark as orphaned, do not delete yet (grace period)
            blob_ref_index.mark_orphaned(&hash);
        }
    }
}

The Grace Period

Orphaned blobs are not deleted immediately. A configurable grace period (default: 1 hour) ensures that race conditions do not cause data loss. Consider this scenario:

User A starts uploading a file. The blob is stored.
User B uploads the same file. Deduplication detects the existing blob.
User A's transaction fails. The blob has one reference (User B).
User B deletes their entity. Reference count drops to zero.
Without a grace period, the blob is deleted.
But User A retries their upload and expects the blob to exist.

The grace period prevents step 5 from happening too quickly. The blob is marked as orphaned but not deleted until the grace period expires, giving concurrent operations time to complete.

rustpub struct GcConfig {
    pub enabled: bool,
    pub orphan_grace_period: u64,  // Default: 3600 seconds (1 hour)
    pub sync_cleanup: bool,         // Enable synchronous cleanup on destroy
}

Mark-and-Sweep

The sweep function is the safety net. It lists all blobs in storage, checks each one against the reference index, and deletes unreferenced blobs that have exceeded the grace period:

rustpub fn sweep(
    index: &mut BlobRefIndex,
    backend: &dyn StorageBackend,
    config: &GcConfig,
) -> Result<usize, StorageError> {
    let blobs = backend.list_blobs()?;
    let now = current_timestamp();
    let mut deleted = 0;

    for blob_hash in blobs {
        if let Some(entry) = index.get(&blob_hash) {
            if entry.ref_count == 0 {
                if let Some(orphaned_at) = entry.orphaned_at {
                    if now - orphaned_at > config.orphan_grace_period as i64 {
                        backend.delete(&blob_hash)?;
                        index.remove(&blob_hash);
                        deleted += 1;
                    }
                }
            }
        } else {
            // Blob not in index at all -- orphaned before GC was enabled
            backend.delete(&blob_hash)?;
            deleted += 1;
        }
    }

    Ok(deleted)
}

The list_blobs method was added to the StorageBackend trait specifically for GC. The local backend walks the directory tree. The R2 backend uses the S3 list objects API. The GCS backend uses the GCS objects list API. All three return a list of blob hashes present in storage.

Index Persistence

The reference index is persisted to .flindb/blob_refs.json. This file is updated when the index is marked dirty (after any reference count change) and loaded on server startup. Persistence ensures that reference counts survive server restarts -- without it, a restart would lose all reference information, and the next sweep would consider every blob orphaned.

Combined Impact

Together, compression and garbage collection make FLIN's file storage sustainable at scale:

Optimization	Impact
Zstd compression (text files)	60-70% size reduction
Zstd compression (JSON/YAML)	70-80% size reduction
Deduplication (same files)	100% reduction for duplicates
GC (deleted entities)	Reclaims 100% of orphaned storage
Smart skip (images/video)	Zero CPU waste on incompressible files

A FLIN application that stores 1,000 text-heavy documents might use 200 MB of raw storage. With compression and deduplication, that drops to 60-80 MB. When documents are deleted, GC reclaims the space within the grace period. The storage system is self-maintaining.

Session 233 added 25 compression tests. Session 234 added 17 GC tests, plus 3 list_blobs tests for the local backend. Total test count after both sessions: 3,494. The file storage system was now not only feature-complete but production-ready for long-running applications that need to manage storage costs.

In the final article of this arc, we add the last developer-facing feature: file preview generation, which automatically creates thumbnails when images are uploaded.

This is Part 134 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [133] Semantic Auto-Conversion - [134] Zstd Compression and Blob Garbage Collection (you are here) - [135] File Preview Generation