Embedding models have a fixed context window. Most models accept 512 tokens (approximately 400 words). Some accept 8,192 tokens. None accept an entire 50-page document as input. To embed a long document, you must split it into chunks that fit the model's context window.
Naive chunking -- splitting every 500 characters regardless of content -- produces terrible embeddings. A chunk that starts in the middle of a sentence and ends in the middle of a code block has no coherent meaning. The embedding is a blurred average of two unrelated topics. Search queries that match either topic will score poorly because the embedding does not strongly represent either one.
FLIN's chunk_text() function is aware of document structure. It splits text at semantic boundaries: paragraph breaks, heading boundaries, code block delimiters, and sentence endings. The result is chunks that each represent a single coherent idea, producing focused embeddings that retrieve accurately.
The chunk_text() Function
flinchunks = chunk_text(text, {
max_size: 500, // Maximum characters per chunk
overlap: 50, // Characters of overlap between chunks
strategy: "semantic" // "fixed", "paragraph", "semantic", "code"
})Each chunk contains:
``flin
chunk.text // The chunk content
chunk.position // Start position in the original document
chunk.page // Page number (if available from parsing)
chunk.index // Sequential chunk index
``
Chunking Strategies
Fixed Size
The simplest strategy. Splits at exact character boundaries:
flinchunks = chunk_text(text, { max_size: 500, strategy: "fixed" })Use this only for unstructured text with no headings, code blocks, or clear paragraph boundaries. It is fast but produces the lowest quality embeddings.
Paragraph-Aware
Splits at paragraph boundaries (double newlines), keeping paragraphs intact when possible:
flinchunks = chunk_text(text, { max_size: 500, strategy: "paragraph" })rustfn chunk_by_paragraph(text: &str, max_size: usize, overlap: usize) -> Vec<Chunk> {
let paragraphs: Vec<&str> = text.split("\n\n").collect();
let mut chunks = Vec::new();
let mut current = String::new();
let mut position = 0;
for paragraph in paragraphs {
if current.len() + paragraph.len() + 2 > max_size && !current.is_empty() {
chunks.push(Chunk {
text: current.trim().to_string(),
position,
index: chunks.len(),
});
// Overlap: keep the last `overlap` characters
let overlap_start = current.len().saturating_sub(overlap);
current = current[overlap_start..].to_string();
position += overlap_start;
}
if !current.is_empty() {
current.push_str("\n\n");
}
current.push_str(paragraph);
}
if !current.is_empty() {
chunks.push(Chunk {
text: current.trim().to_string(),
position,
index: chunks.len(),
});
}
chunks
}Semantic (Default)
The most sophisticated strategy. Respects headings, paragraphs, lists, and natural semantic boundaries:
flinchunks = chunk_text(text, { max_size: 500, strategy: "semantic" })The semantic chunker follows a priority hierarchy:
- Never split in the middle of a code block. A code snippet is atomic -- splitting it produces two meaningless fragments.
- Prefer splitting at headings. A heading marks the start of a new topic. Chunks should align with topic boundaries.
- Prefer splitting at paragraph boundaries. Paragraphs are the natural unit of a single idea.
- Prefer splitting at sentence boundaries. If a paragraph is too long, split between sentences rather than mid-sentence.
- As a last resort, split at word boundaries. Never split in the middle of a word.
rustfn chunk_semantic(text: &str, max_size: usize, overlap: usize) -> Vec<Chunk> {
let blocks = parse_into_blocks(text);
let mut chunks = Vec::new();
let mut current = String::new();
let mut position = 0;
for block in blocks {
match block {
Block::Heading(h) => {
// Always start a new chunk at a heading
if !current.is_empty() {
chunks.push(make_chunk(¤t, position, chunks.len()));
position += current.len();
current.clear();
}
current.push_str(&h);
current.push('\n');
}
Block::Code(code) => {
// Keep code blocks intact
if current.len() + code.len() > max_size && !current.is_empty() {
chunks.push(make_chunk(¤t, position, chunks.len()));
position += current.len();
current.clear();
}
current.push_str(&code);
current.push('\n');
}
Block::Paragraph(p) => {
if current.len() + p.len() > max_size {
if !current.is_empty() {
chunks.push(make_chunk(¤t, position, chunks.len()));
position += current.len();
current.clear();
}
// If paragraph itself exceeds max_size, split by sentence
if p.len() > max_size {
let sentence_chunks = split_by_sentences(&p, max_size);
for sc in sentence_chunks {
chunks.push(make_chunk(&sc, position, chunks.len()));
position += sc.len();
}
continue;
}
}
current.push_str(&p);
current.push_str("\n\n");
}
}
}
if !current.is_empty() {
chunks.push(make_chunk(¤t, position, chunks.len()));
}
add_overlaps(&mut chunks, overlap);
chunks
}Code-Aware
A specialized strategy for source code and technical documentation with heavy code content:
flinchunks = chunk_text(text, { max_size: 500, strategy: "code" })The code-aware chunker recognizes: - Function definitions -- a function and its body stay together. - Class/struct definitions -- a type definition stays together. - Import blocks -- grouped imports are a single chunk. - Comments -- doc comments are attached to their associated code.
rustfn chunk_code_aware(text: &str, max_size: usize) -> Vec<Chunk> {
let mut chunks = Vec::new();
let mut in_code_block = false;
let mut code_buffer = String::new();
let mut text_buffer = String::new();
for line in text.lines() {
if line.starts_with("```") {
if in_code_block {
// End of code block
code_buffer.push_str(line);
code_buffer.push('\n');
// Flush text before code
if !text_buffer.is_empty() {
flush_buffer(&mut text_buffer, max_size, &mut chunks);
}
// Code block as single chunk (or split if very large)
if code_buffer.len() <= max_size {
chunks.push(make_chunk(&code_buffer, 0, chunks.len()));
} else {
// Split large code blocks by function boundaries
let code_chunks = split_code_by_functions(&code_buffer, max_size);
chunks.extend(code_chunks);
}
code_buffer.clear();
in_code_block = false;
} else {
in_code_block = true;
code_buffer.push_str(line);
code_buffer.push('\n');
}
} else if in_code_block {
code_buffer.push_str(line);
code_buffer.push('\n');
} else {
text_buffer.push_str(line);
text_buffer.push('\n');
}
}
// Flush remaining buffers
if !text_buffer.is_empty() {
flush_buffer(&mut text_buffer, max_size, &mut chunks);
}
if !code_buffer.is_empty() {
chunks.push(make_chunk(&code_buffer, 0, chunks.len()));
}
chunks
}Overlap Between Chunks
Chunks can overlap to ensure that information at chunk boundaries is not lost:
Chunk 1: "...the user authentication system uses JWT tokens for stateless..."
Chunk 2: "...uses JWT tokens for stateless verification. The token contains..."The overlap (50 characters by default) means that a query about "JWT tokens for stateless verification" will match both chunks, even though the phrase spans the boundary.
Overlap increases the total number of chunks and storage requirements but significantly improves retrieval quality for queries that happen to align with chunk boundaries.
Chunk Quality Metrics
FLIN provides quality metrics for chunks to help developers tune their chunking parameters:
flinchunks = chunk_text(text, { max_size: 500, strategy: "semantic" })
for chunk in chunks {
log_info("Chunk {chunk.index}: {chunk.text.len} chars")
}
// Summary statistics
avg_size = chunks.map(c => c.text.len).sum / chunks.len
log_info("Average chunk size: {avg_size} characters")
log_info("Total chunks: {chunks.len}")Ideal chunk sizes for embedding: - Too small (< 100 characters): Not enough context for a meaningful embedding. - Optimal (200-600 characters): Single coherent idea, good embedding quality. - Too large (> 1000 characters): Multiple topics blurred together, poor retrieval precision.
Practical Example: Indexing a Documentation Site
flin// Batch index all documentation
docs_dir = ".flindb/documents/"
files = list_files(docs_dir)
total_chunks = 0
for file_path in files {
parsed = parse_document(file_path)
doc = Document {
title: parsed.metadata.title || file_name(file_path),
file_path: file_path,
format: parsed.format,
full_text: parsed.text
}
save doc
chunks = chunk_text(parsed.text, {
max_size: 500,
overlap: 50,
strategy: if file_path.ends_with(".md") { "code" } else { "semantic" }
})
for chunk in chunks {
save DocumentChunk {
document_id: doc.id,
content: chunk.text, // semantic text -- auto-embedded
position: chunk.position,
chunk_index: chunk.index
}
}
total_chunks = total_chunks + chunks.len
log_info("Indexed {file_path}: {chunks.len} chunks")
}
log_info("Total: {files.len} documents, {total_chunks} chunks")Chunking is the bridge between raw documents and searchable embeddings. Get it wrong, and your RAG system returns irrelevant results regardless of how good the embedding model or LLM is. Get it right, and every chunk represents a focused, retrievable piece of knowledge.
In the next article, we explore hybrid document search -- combining BM25 keyword search with semantic search for the best of both worlds.
This is Part 122 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML - [122] Code-Aware Chunking for RAG (you are here) - [123] Hybrid Document Search: BM25 + Semantic - [124] AI-First Language Design