Fixed-size chunking

Fixed-size chunking splits text into equal-sized pieces by character count or token count. It's the simplest strategy, the default in almost every RAG framework, and a perfectly reasonable baseline for many corpora. It's also the first thing to reconsider when your retrieval underperforms.

How it works

  1. Pick a chunk size (e.g., 512 tokens)
  2. Pick an overlap (e.g., 64 tokens)
  3. Iterate through the document, emitting chunks of the target size with overlap between consecutive chunks

Character-based vs token-based

Character-based is simpler and faster. Token-based is more accurate because embedding models have token context limits, not character limits.

Use token-based chunking in production. Character-based is fine for prototyping.

The sweet spots

Overlap typically 10-15% of chunk size.

When fixed-size works well

When fixed-size breaks

The implementation details that matter

Don't split mid-word

Naive character splitting can break "chunk" into "chu" and "nk". Always snap to word boundaries.

Prefer sentence boundaries

Even with fixed-size targets, snap to the nearest sentence end (or paragraph end) within a tolerance window. Keeps each chunk a complete thought.

Handle short documents

If a document is shorter than your chunk size, don't pad it or drop it. Emit one chunk with the full content.

Deterministic IDs

Chunk IDs should be stable across re-ingestions. Use hash(document_id + chunk_position) so the same content always gets the same chunk ID and you can do proper upserts.

The experiment to run

For any non-trivial corpus, test three chunk sizes (e.g., 256, 512, 1024 tokens) against the same eval set. Often the best size is 30-50% off from your gut estimate. The only way to know is to measure.

Next: Semantic chunking.