Back to flin
flin

#119 -- FastEmbed Integration for Embeddings

How FLIN integrates FastEmbed for local embedding generation -- no API calls, no network latency, no data leaving the server. Privacy-first semantic search at 10ms per embedding.

Juste A. Gnimavo (Thales) & Claude | March 26, 2026 7 min flin
EN/ FR/ ES
flinfastembedembeddingslocalai

Cloud-based embedding APIs are convenient but come with three fundamental problems: latency (100-300 ms per call), cost (accumulates with volume), and privacy (your data is sent to a third party). For applications that generate thousands of embeddings daily, or that handle sensitive data, or that need sub-50ms search latency, cloud APIs are a bottleneck.

FastEmbed solves all three problems. It is an open-source library that runs embedding models locally, on the same machine as the FLIN runtime. No network call. No API key. No data leaving the server. A 384-dimension embedding generates in 10-50 milliseconds depending on text length and hardware.

FLIN integrates FastEmbed as the default local embedding provider, making it the recommended choice for production applications that need fast, private semantic search.

What FastEmbed Is

FastEmbed is an embedding inference library optimized for production use. It runs quantized ONNX models that produce high-quality embeddings at a fraction of the resource cost of full-precision models.

Key characteristics: - Model size: 30-100 MB (vs 500 MB+ for full-precision) - Inference time: 10-50 ms per embedding - Memory usage: 100-300 MB at runtime - Accuracy: >95% of full-precision model quality - Dependencies: ONNX Runtime only

The models are downloaded once and cached locally. After the first run, there is no network dependency.

Configuration

Enabling FastEmbed in FLIN:

flin// flin.config
ai {
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"    // 384 dimensions, 33 MB
    }
}

Available models:

ModelDimensionsSizeQualitySpeed
BAAI/bge-small-en-v1.538433 MBGoodFast
BAAI/bge-base-en-v1.5768110 MBBetterMedium
BAAI/bge-large-en-v1.51024335 MBBestSlower
sentence-transformers/all-MiniLM-L6-v238423 MBGoodFastest

For most applications, bge-small-en-v1.5 provides the best balance of quality and speed. The 384-dimension vectors are small enough to index efficiently while capturing enough semantic information for accurate search.

Integration with semantic text

When FastEmbed is configured, semantic text fields use it automatically:

flinentity Product {
    name: text
    description: semantic text    // Uses FastEmbed for embedding
}

product = Product {
    name: "Ergonomic Office Chair",
    description: "Adjustable lumbar support with breathable mesh back..."
}
save product  // Embedding generated locally via FastEmbed

The switch from cloud embeddings to FastEmbed is transparent. The save operation calls FastEmbed instead of an API. The search keyword uses the same HNSW index. The developer code does not change.

Implementation

The FastEmbed integration in the FLIN runtime:

rustuse fastembed::{TextEmbedding, InitOptions, EmbeddingModel};

pub struct FastEmbedProvider {
    model: TextEmbedding,
    model_name: String,
}

impl FastEmbedProvider {
    pub fn new(model_name: &str) -> Result<Self, EmbeddingError> {
        let model = TextEmbedding::try_new(InitOptions {
            model_name: parse_model(model_name),
            show_download_progress: true,
            cache_dir: Some(PathBuf::from(".flindb/models/")),
            ..Default::default()
        })?;

        Ok(Self {
            model,
            model_name: model_name.to_string(),
        })
    }

    pub fn embed(&self, text: &str) -> Result<Vec<f32>, EmbeddingError> {
        let documents = vec![text.to_string()];
        let embeddings = self.model.embed(documents, None)?;
        Ok(embeddings.into_iter().next().unwrap())
    }

    pub fn embed_batch(&self, texts: &[String]) -> Result<Vec<Vec<f32>>, EmbeddingError> {
        self.model.embed(texts.to_vec(), None)
            .map_err(EmbeddingError::FastEmbed)
    }
}

fn parse_model(name: &str) -> EmbeddingModel {
    match name {
        "BAAI/bge-small-en-v1.5" => EmbeddingModel::BGESmallENV15,
        "BAAI/bge-base-en-v1.5" => EmbeddingModel::BGEBaseENV15,
        "BAAI/bge-large-en-v1.5" => EmbeddingModel::BGELargeENV15,
        "sentence-transformers/all-MiniLM-L6-v2" => EmbeddingModel::AllMiniLML6V2,
        _ => EmbeddingModel::BGESmallENV15, // Default
    }
}

Batch Embedding for Imports

When importing existing data, generating embeddings one at a time would be slow. FastEmbed supports batch processing:

flin// Import 10,000 products with embeddings
products = load_csv("products.csv")

for batch in products.chunks(100) {
    for product in batch {
        save Product {
            name: product.name,
            description: product.description  // Batched embedding
        }
    }
}

The FLIN runtime detects batch save operations and groups the embedding calls:

rustpub fn embed_batch_on_save(
    provider: &FastEmbedProvider,
    entities: &mut [Entity],
    semantic_fields: &[&str],
) -> Result<(), EmbeddingError> {
    for field_name in semantic_fields {
        let texts: Vec<String> = entities.iter()
            .map(|e| e.get_text(field_name).to_string())
            .collect();

        let embeddings = provider.embed_batch(&texts)?;

        for (entity, embedding) in entities.iter_mut().zip(embeddings) {
            entity.set_embedding(field_name, embedding);
        }
    }
    Ok(())
}

Batch embedding is approximately 5x faster than individual embedding calls due to reduced overhead per invocation.

Model Download and Caching

The first time a FastEmbed model is used, it is downloaded from Hugging Face and cached in .flindb/models/:

.flindb/
  models/
    BAAI--bge-small-en-v1.5/
      model.onnx           (33 MB)
      tokenizer.json       (400 KB)
      config.json           (1 KB)

Subsequent uses load from cache. The download progress is displayed in the FLIN development server console:

[FastEmbed] Downloading BAAI/bge-small-en-v1.5... 33.2 MB
[FastEmbed] Model cached at .flindb/models/BAAI--bge-small-en-v1.5/
[FastEmbed] Ready. First embedding: 12ms

For deployment, the model files should be included in the application bundle or pre-downloaded in the deployment script. FLIN will not attempt to download models in production if the cache directory already contains them.

Benchmarks: FastEmbed vs Cloud APIs

MetricFastEmbed (local)OpenAI APICohere API
Latency (single)12 ms150 ms120 ms
Latency (batch 100)180 ms800 ms600 ms
Cost per 1M embeddings$0 (hardware only)$0.02-$0.13$0.10
PrivacyFull (no data sent)Data sent to OpenAIData sent to Cohere
Offline capableYesNoNo
Accuracy (MTEB avg)0.62 (small)0.63 (ada-002)0.64 (v3)

FastEmbed matches cloud API quality within 2-3% while being 10x faster and completely private.

Hybrid Approach

FLIN supports using different embedding providers for different purposes:

flinai {
    // FastEmbed for semantic text fields (fast, private)
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"
    }

    // Cloud API for Intent Engine (needs LLM, not just embeddings)
    provider = "openai"
    model = "gpt-4o-mini"
    api_key = env("OPENAI_API_KEY")
}

Semantic search uses FastEmbed (local, fast). The Intent Engine uses the cloud LLM (for natural language understanding). This hybrid approach gives the best of both worlds: fast search with private data, and powerful intent translation when needed.

Multilingual Embeddings

For applications serving multilingual content (common in Africa where users switch between French, English, and local languages), multilingual embedding models are available:

flinai {
    embedding {
        provider = "fastembed"
        model = "BAAI/bge-small-en-v1.5"  // English
        // Future: BAAI/bge-m3 for multilingual
    }
}

The BGE-M3 model (when supported) handles over 100 languages in a single embedding space. A search for "chaise de bureau confortable" (French) would find products described in English as "comfortable office chair" because the meanings map to the same vector region.

Why Local Embeddings Matter for Africa

Two practical reasons make local embeddings essential for FLIN's target market:

Internet reliability. Many African developers work with intermittent connectivity. A cloud-dependent embedding pipeline means semantic search stops working when the internet drops. FastEmbed works offline.

Data sovereignty. Enterprise customers in regulated industries (banking, healthcare, government) require that data does not leave their infrastructure. Local embeddings satisfy this requirement without sacrificing functionality.

FastEmbed transforms semantic search from a cloud dependency into a local capability. The embedding model is as much a part of the FLIN binary as the HTTP server or the database engine -- always available, always fast, always private.

In the next article, we explore RAG (Retrieval-Augmented Generation) -- how FLIN combines semantic search with LLM generation to answer questions from your application's data.


This is Part 119 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [118] AI Gateway: 8 Providers, One API - [119] FastEmbed Integration for Embeddings (you are here) - [120] RAG: Retrieval, Reranking, and Source Attribution - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles

Thales & Claude deblo

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

Phase 14 shipped Déblo Eyes — real-time camera streaming over LiveKit to Gemini Live native audio. The first deploy took down every voice session in production within ninety seconds because our Step 0 had validated the constructor without exercising the runtime path. The build log of how Déblo got eyes, what an incomplete pre-flight check cost us, and which polish items we shipped versus deferred.

30 min May 20, 2026
debloclaude-opus-4.7claude-codegemini-live +25
Thales & Claude deblo

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Two days before App Store submission, Déblo’s entire chat product silently broke. No spinner, no toast, no error in the UI — just dead air. The 24-hour outage came down to a single « é » in an HTTP header value raising UnicodeEncodeError before any request to OpenRouter ever left the backend. The post-mortem of a false hypothesis, a Sentry trace, and a 6-line fix that unblocked the launch.

27 min May 19, 2026
debloclaude-opus-4.7claude-codeincident +19
Thales & Claude deblo

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live

Live walkthrough of submitting Déblo to the iOS App Store in six hours: what Apple’s validators rejected (a Unicode superscript), what we corrected (a Promotional Text wasted on third-party brands), and the iOS ASO mechanics almost everyone gets wrong.

27 min May 13, 2026
debloclaude-opus-4.7claude-codeapp-store +16