Home›Expertise›RAGS to Riches›Fixed-size chunking

Fixed-size chunking

📖 4 min readUpdated 2026-04-18

Fixed-size chunking splits text into equal-sized pieces by character count or token count. It's the simplest strategy, the default in almost every RAG framework, and a perfectly reasonable baseline for many corpora. It's also the first thing to reconsider when your retrieval underperforms.

How it works

Pick a chunk size (e.g., 512 tokens)
Pick an overlap (e.g., 64 tokens)
Iterate through the document, emitting chunks of the target size with overlap between consecutive chunks

Character-based vs token-based

Character-based is simpler and faster. Token-based is more accurate because embedding models have token context limits, not character limits.

English prose: ~4 characters per token, so 2000 characters ≈ 500 tokens
Code: ~3 characters per token
Non-English or heavy punctuation: varies widely

Use token-based chunking in production. Character-based is fine for prototyping.

The sweet spots

Short, factual content (FAQs, definitions): 100-200 tokens
Typical prose: 400-600 tokens
Reasoning-heavy technical content: 600-1000 tokens
Narrative or contextual content: 800-1200 tokens

Overlap typically 10-15% of chunk size.

When fixed-size works vs breaks

Works well for

Uniform content type (e.g., all blog posts, all similar-length docs)
Prose-dominant content without strong structure
Early prototyping before you know query patterns
Corpora where structural metadata is unreliable

Breaks when

Mixed content (prose + tables + code): size that works for one doesn't work for others
Strong document structure (headings, sections) where boundaries carry meaning
Content where semantic boundaries matter (legal clauses, code functions)
Very short content (individual FAQs that shouldn't be combined)

The implementation details that matter

Don't split mid-word

Naive character splitting can break "chunk" into "chu" and "nk". Always snap to word boundaries.

Prefer sentence boundaries

Even with fixed-size targets, snap to the nearest sentence end (or paragraph end) within a tolerance window. Keeps each chunk a complete thought.

Handle short documents

If a document is shorter than your chunk size, don't pad it or drop it. Emit one chunk with the full content.

Deterministic IDs

Chunk IDs should be stable across re-ingestions. Use hash(document_id + chunk_position) so the same content always gets the same chunk ID and you can do proper upserts.

The experiment to run

For any non-trivial corpus, test three chunk sizes (e.g., 256, 512, 1024 tokens) against the same eval set. Often the best size is 30-50% off from your gut estimate. The only way to know is to measure.

What to do with this

Use token-based, not character-based, in production.
Start with a preset from the table; A/B against one other size.
Switch to structure-aware if your content has strong hierarchy.