Home›Expertise›RAGS to Riches›Picking an embedding model

Picking an embedding model

📖 6 min readUpdated 2026-04-18

Embedding model choice has a larger impact on retrieval quality than almost any other decision in the RAG stack. The right model for a general-knowledge corpus may be wrong for legal documents, code, or Japanese-language content. Here's how I pick.

The MTEB leaderboard is the starting point

MTEB (Massive Text Embedding Benchmark) ranks embedding models on standardized tasks. Look up current leaderboards on Hugging Face. Not gospel, but a reasonable starting short-list.

Caveats:

MTEB measures average performance across tasks. Your domain may not match any task.
Top models on MTEB are often very large and expensive to run.
Leaderboards don't capture latency, cost, or context length well.

The five decision dimensions

1. Quality

Proxy: MTEB score, or better, your own eval set. Test 2-3 candidate models against a held-out set of real queries.

2. Cost

OpenAI text-embedding-3-small: $0.02 / 1M tokens
OpenAI text-embedding-3-large: $0.13 / 1M tokens
Cohere embed-v3: ~$0.10 / 1M tokens
Voyage-3-large: ~$0.18 / 1M tokens
Self-hosted open-source: GPU-hours only (often significantly cheaper at scale)

Embedding cost compounds: every reindex pays the cost again. For large corpora, this matters.

3. Latency

Embedding time per query affects user-facing latency. API calls: 50-200ms. Self-hosted small models: 5-20ms. Self-hosted large models: 30-100ms. For real-time systems, this is load-bearing.

4. Context length

Match to your chunk size. If your chunks are 1000 tokens, a model with 512-token context silently truncates your chunks. Check the spec.

5. Domain fit

General-purpose models work for general content. For code, law, medicine, or finance, domain-specific or domain-aware models (or fine-tuned ones) beat general models significantly.

My current default picks

For general prose, production

OpenAI text-embedding-3-small: cheap, fast, solid. Good baseline.
OpenAI text-embedding-3-large: when quality matters and cost is secondary.
Cohere embed-v3 (multilingual): best-in-class multilingual retrieval.
Voyage-3-large: often tops benchmarks, especially for finance/legal.

For general prose, self-hosted

BGE-M3: strong multilingual, unified dense/sparse/colbert, Apache-2.0 license
E5-mistral-7b-instruct: top open-source quality, heavier inference
nomic-embed-text-v1.5: efficient, strong, 8K context, Apache-2.0
gte-large: efficient, decent quality, 512 context

For code

Voyage-code-2 or Voyage-code-3: code-specific
jina-embeddings-v2-base-code: open-source code embeddings

For legal/financial/medical

Evaluate domain-specific options first:

Voyage-law-2 for legal
Voyage-finance-2 for finance
BioBERT / Clinical BERT for medical

If none fit, fine-tune a general model on domain data (see fine-tuning embeddings).

For multilingual

Cohere embed-v3 multilingual (excellent quality)
BGE-M3 (open-source, competitive)
multilingual-e5-large

Closed vs open

See the dedicated page: closed vs open embedding models.

The testing protocol

Build a small eval set: 30-100 queries with known-good answer chunks
Index your corpus with candidate models
Measure hit-rate@k, MRR, NDCG for each
Compare alongside cost and latency
Pick the one on the best quality/cost frontier for your use case

This is a one-afternoon exercise that saves months of subtle retrieval issues.

The lock-in question

Switching embedding models means reindexing your entire corpus. For large corpora, this is expensive and slow. Factors to consider:

How often will you want to change models?
Can you afford a migration window?
Does your infra support running two embedding models in parallel during migration?

Build the infrastructure to switch embedding models without rebuilding everything else. Treat embedding_model_version as first-class metadata.

What to do with this

Run the 30-query eval protocol before committing to a model; it's an afternoon that pays for itself.
Track embedding_model_version on every chunk so future upgrades aren't blocked.
Try 3-large before assuming 3-small is enough.