#119 -- FastEmbed Integration for Embeddings

Cloud-based embedding APIs are convenient but come with three fundamental problems: latency (100-300 ms per call), cost (accumulates with volume), and privacy (your data is sent to a third party). For applications that generate thousands of embeddings daily, or that handle sensitive data, or that need sub-50ms search latency, cloud APIs are a bottleneck.

FastEmbed solves all three problems. It is an open-source library that runs embedding models locally, on the same machine as the FLIN runtime. No network call. No API key. No data leaving the server. A 384-dimension embedding generates in 10-50 milliseconds depending on text length and hardware.

FLIN integrates FastEmbed as the default local embedding provider, making it the recommended choice for production applications that need fast, private semantic search.

What FastEmbed Is

FastEmbed is an embedding inference library optimized for production use. It runs quantized ONNX models that produce high-quality embeddings at a fraction of the resource cost of full-precision models.

Key characteristics: - Model size: 30-100 MB (vs 500 MB+ for full-precision) - Inference time: 10-50 ms per embedding - Memory usage: 100-300 MB at runtime - Accuracy: >95% of full-precision model quality - Dependencies: ONNX Runtime only

The models are downloaded once and cached locally. After the first run, there is no network dependency.

Configuration

Enabling FastEmbed in FLIN:

flin// flin.config
ai {
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"    // 384 dimensions, 33 MB
    }
}

Available models:

Model	Dimensions	Size	Quality	Speed
`BAAI/bge-small-en-v1.5`	384	33 MB	Good	Fast
`BAAI/bge-base-en-v1.5`	768	110 MB	Better	Medium
`BAAI/bge-large-en-v1.5`	1024	335 MB	Best	Slower
`sentence-transformers/all-MiniLM-L6-v2`	384	23 MB	Good	Fastest

For most applications, bge-small-en-v1.5 provides the best balance of quality and speed. The 384-dimension vectors are small enough to index efficiently while capturing enough semantic information for accurate search.

Integration with semantic text

When FastEmbed is configured, semantic text fields use it automatically:

flinentity Product {
    name: text
    description: semantic text    // Uses FastEmbed for embedding
}

product = Product {
    name: "Ergonomic Office Chair",
    description: "Adjustable lumbar support with breathable mesh back..."
}
save product  // Embedding generated locally via FastEmbed

The switch from cloud embeddings to FastEmbed is transparent. The save operation calls FastEmbed instead of an API. The search keyword uses the same HNSW index. The developer code does not change.

Implementation

The FastEmbed integration in the FLIN runtime:

rustuse fastembed::{TextEmbedding, InitOptions, EmbeddingModel};

pub struct FastEmbedProvider {
    model: TextEmbedding,
    model_name: String,
}

impl FastEmbedProvider {
    pub fn new(model_name: &str) -> Result<Self, EmbeddingError> {
        let model = TextEmbedding::try_new(InitOptions {
            model_name: parse_model(model_name),
            show_download_progress: true,
            cache_dir: Some(PathBuf::from(".flindb/models/")),
            ..Default::default()
        })?;

        Ok(Self {
            model,
            model_name: model_name.to_string(),
        })
    }

    pub fn embed(&self, text: &str) -> Result<Vec<f32>, EmbeddingError> {
        let documents = vec![text.to_string()];
        let embeddings = self.model.embed(documents, None)?;
        Ok(embeddings.into_iter().next().unwrap())
    }

    pub fn embed_batch(&self, texts: &[String]) -> Result<Vec<Vec<f32>>, EmbeddingError> {
        self.model.embed(texts.to_vec(), None)
            .map_err(EmbeddingError::FastEmbed)
    }
}

fn parse_model(name: &str) -> EmbeddingModel {
    match name {
        "BAAI/bge-small-en-v1.5" => EmbeddingModel::BGESmallENV15,
        "BAAI/bge-base-en-v1.5" => EmbeddingModel::BGEBaseENV15,
        "BAAI/bge-large-en-v1.5" => EmbeddingModel::BGELargeENV15,
        "sentence-transformers/all-MiniLM-L6-v2" => EmbeddingModel::AllMiniLML6V2,
        _ => EmbeddingModel::BGESmallENV15, // Default
    }
}

Batch Embedding for Imports

When importing existing data, generating embeddings one at a time would be slow. FastEmbed supports batch processing:

flin// Import 10,000 products with embeddings
products = load_csv("products.csv")

for batch in products.chunks(100) {
    for product in batch {
        save Product {
            name: product.name,
            description: product.description  // Batched embedding
        }
    }
}

The FLIN runtime detects batch save operations and groups the embedding calls:

rustpub fn embed_batch_on_save(
    provider: &FastEmbedProvider,
    entities: &mut [Entity],
    semantic_fields: &[&str],
) -> Result<(), EmbeddingError> {
    for field_name in semantic_fields {
        let texts: Vec<String> = entities.iter()
            .map(|e| e.get_text(field_name).to_string())
            .collect();

        let embeddings = provider.embed_batch(&texts)?;

        for (entity, embedding) in entities.iter_mut().zip(embeddings) {
            entity.set_embedding(field_name, embedding);
        }
    }
    Ok(())
}

Batch embedding is approximately 5x faster than individual embedding calls due to reduced overhead per invocation.

Model Download and Caching

The first time a FastEmbed model is used, it is downloaded from Hugging Face and cached in .flindb/models/:

.flindb/
  models/
    BAAI--bge-small-en-v1.5/
      model.onnx           (33 MB)
      tokenizer.json       (400 KB)
      config.json           (1 KB)

Subsequent uses load from cache. The download progress is displayed in the FLIN development server console:

[FastEmbed] Downloading BAAI/bge-small-en-v1.5... 33.2 MB
[FastEmbed] Model cached at .flindb/models/BAAI--bge-small-en-v1.5/
[FastEmbed] Ready. First embedding: 12ms

For deployment, the model files should be included in the application bundle or pre-downloaded in the deployment script. FLIN will not attempt to download models in production if the cache directory already contains them.

Benchmarks: FastEmbed vs Cloud APIs

Metric	FastEmbed (local)	OpenAI API	Cohere API
Latency (single)	12 ms	150 ms	120 ms
Latency (batch 100)	180 ms	800 ms	600 ms
Cost per 1M embeddings	$0 (hardware only)	$0.02-$0.13	$0.10
Privacy	Full (no data sent)	Data sent to OpenAI	Data sent to Cohere
Offline capable	Yes	No	No
Accuracy (MTEB avg)	0.62 (small)	0.63 (ada-002)	0.64 (v3)

FastEmbed matches cloud API quality within 2-3% while being 10x faster and completely private.

Hybrid Approach

FLIN supports using different embedding providers for different purposes:

flinai {
    // FastEmbed for semantic text fields (fast, private)
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"
    }

    // Cloud API for Intent Engine (needs LLM, not just embeddings)
    provider = "openai"
    model = "gpt-4o-mini"
    api_key = env("OPENAI_API_KEY")
}

Semantic search uses FastEmbed (local, fast). The Intent Engine uses the cloud LLM (for natural language understanding). This hybrid approach gives the best of both worlds: fast search with private data, and powerful intent translation when needed.

Multilingual Embeddings

For applications serving multilingual content (common in Africa where users switch between French, English, and local languages), multilingual embedding models are available:

flinai {
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"  // English
        // Future: BAAI/bge-m3 for multilingual
    }
}

The BGE-M3 model (when supported) handles over 100 languages in a single embedding space. A search for "chaise de bureau confortable" (French) would find products described in English as "comfortable office chair" because the meanings map to the same vector region.

Why Local Embeddings Matter for Africa

Two practical reasons make local embeddings essential for FLIN's target market:

Internet reliability. Many African developers work with intermittent connectivity. A cloud-dependent embedding pipeline means semantic search stops working when the internet drops. FastEmbed works offline.

Data sovereignty. Enterprise customers in regulated industries (banking, healthcare, government) require that data does not leave their infrastructure. Local embeddings satisfy this requirement without sacrificing functionality.

FastEmbed transforms semantic search from a cloud dependency into a local capability. The embedding model is as much a part of the FLIN binary as the HTTP server or the database engine -- always available, always fast, always private.

In the next article, we explore RAG (Retrieval-Augmented Generation) -- how FLIN combines semantic search with LLM generation to answer questions from your application's data.

This is Part 119 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [118] AI Gateway: 8 Providers, One API - [119] FastEmbed Integration for Embeddings (you are here) - [120] RAG: Retrieval, Reranking, and Source Attribution - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML

#119 -- FastEmbed Integration for Embeddings

What FastEmbed Is

Configuration

Integration with semantic text

Implementation

Batch Embedding for Imports

Model Download and Caching

Benchmarks: FastEmbed vs Cloud APIs

Hybrid Approach

Multilingual Embeddings

Why Local Embeddings Matter for Africa

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live