Fixed-size chunking
📖 4 min readUpdated 2026-04-18
Fixed-size chunking splits text into equal-sized pieces by character count or token count. It's the simplest strategy, the default in almost every RAG framework, and a perfectly reasonable baseline for many corpora. It's also the first thing to reconsider when your retrieval underperforms.
How it works
- Pick a chunk size (e.g., 512 tokens)
- Pick an overlap (e.g., 64 tokens)
- Iterate through the document, emitting chunks of the target size with overlap between consecutive chunks
Character-based vs token-based
Character-based is simpler and faster. Token-based is more accurate because embedding models have token context limits, not character limits.
- English prose: ~4 characters per token, so 2000 characters ≈ 500 tokens
- Code: ~3 characters per token
- Non-English or heavy punctuation: varies widely
Use token-based chunking in production. Character-based is fine for prototyping.
The sweet spots
- Short, factual content (FAQs, definitions): 100-200 tokens
- Typical prose: 400-600 tokens
- Reasoning-heavy technical content: 600-1000 tokens
- Narrative or contextual content: 800-1200 tokens
Overlap typically 10-15% of chunk size.
When fixed-size works well
- Uniform content type (e.g., all blog posts, all similar-length docs)
- Prose-dominant content without strong structure
- Early prototyping before you know query patterns
- Corpora where structural metadata is unreliable
When fixed-size breaks
- Mixed content (prose + tables + code): size that works for one doesn't work for others
- Strong document structure (headings, sections) where boundaries carry meaning
- Content where semantic boundaries matter (legal clauses, code functions)
- Very short content (individual FAQs that shouldn't be combined)
The implementation details that matter
Don't split mid-word
Naive character splitting can break "chunk" into "chu" and "nk". Always snap to word boundaries.
Prefer sentence boundaries
Even with fixed-size targets, snap to the nearest sentence end (or paragraph end) within a tolerance window. Keeps each chunk a complete thought.
Handle short documents
If a document is shorter than your chunk size, don't pad it or drop it. Emit one chunk with the full content.
Deterministic IDs
Chunk IDs should be stable across re-ingestions. Use hash(document_id + chunk_position) so the same content always gets the same chunk ID and you can do proper upserts.
The experiment to run
For any non-trivial corpus, test three chunk sizes (e.g., 256, 512, 1024 tokens) against the same eval set. Often the best size is 30-50% off from your gut estimate. The only way to know is to measure.
Next: Semantic chunking.