Home›Expertise›RAGS to Riches›Fine-tuning embeddings

Fine-tuning embeddings

📖 5 min readUpdated 2026-04-18

A general-purpose embedding model knows general English. It doesn't know that in your company, "Jaguar" means a specific product line, not the animal or the car. Fine-tuning embeddings teaches the model your domain's semantic space. For specialized corpora, this is one of the highest-leverage optimizations available.

When fine-tuning is worth it

Highly specialized vocabulary (legal, medical, financial, scientific, pharma)
Proprietary terminology that doesn't exist in general training data
Internal product names or codenames that generic models don't recognize
Non-English or code-mixed content where general models underperform
Domains where generic embeddings demonstrably underperform on eval

When it isn't

Small corpora (< 10K documents), not enough data to fine-tune effectively
General content where commercial models already perform well
Early-stage projects before you have meaningful eval data
Teams without ML ops capacity to maintain the fine-tuned model

The data you need

Fine-tuning requires query-document pairs. The positive pairs: a query and a chunk that actually answers it. Optionally, hard negatives: queries with chunks that look relevant but aren't.

Where this data comes from:

Click logs: if you have an existing search system, click-through data is gold
Q&A pairs: FAQs, support tickets, knowledge base questions
Synthetic generation: use an LLM to generate queries for each chunk. Works surprisingly well.
Human annotation: expensive but highest quality

For most domain fine-tuning, you want at least 1000-10000 query-document pairs. More is better.

The synthetic data pipeline

Common approach when you don't have natural query data:

Sample 5000-10000 chunks from your corpus
For each chunk, prompt an LLM: "Generate 3 questions a user might ask that this chunk answers."
You now have 15000-30000 query-chunk pairs
Fine-tune on these

The quality of your synthetic queries determines fine-tune quality. Use a strong model (GPT-4, Claude) and iterate on the prompt to get queries that match real user style.

Fine-tuning approaches

Contrastive learning

The classic approach. Train with (query, positive_chunk, negative_chunks) triplets. Reward the model when positive is closer than negatives in embedding space.

Matryoshka fine-tuning

Fine-tune in a way that preserves MRL property. Lets you still truncate the fine-tuned vectors.

LoRA / PEFT

Parameter-efficient fine-tuning. Train a small adapter on top of the base model. Dramatically reduces GPU requirements for fine-tuning. Works for most embedding models.

Tools

Sentence-Transformers: library with built-in fine-tuning loops for most open-source embedding models
Hugging Face Trainer: lower-level but more flexible
LlamaIndex finetuning: wrappers for common patterns
Voyage / Cohere fine-tuning APIs: for closed models where supported

The OpenAI / commercial reality

As of 2026, OpenAI embedding models aren't fine-tunable. Cohere and Voyage offer fine-tuning APIs. For maximum flexibility, open-source models are the path.

Expected gains

Well-done domain fine-tuning typically produces 10-30% improvement on retrieval metrics over a general-purpose baseline. For highly specialized domains (medical coding, legal citations), gains can be larger.

The maintenance cost

Fine-tuned models need maintenance:

Re-tune as your corpus evolves
Version your fine-tuned models and track which indexes use which
Re-evaluate quarterly against fresh eval data
When the base model is upgraded, decide whether to re-fine-tune or skip the upgrade

The pragmatic path

For most teams, I recommend:

Ship with a commercial embedding model
Build an eval set that reflects real queries
Measure baseline retrieval quality
Only fine-tune if you can demonstrate the baseline is genuinely underperforming
If you fine-tune, run it against the same eval set to prove the gain is real

Don't fine-tune because it's cool. Fine-tune because measurement says you need to.

What to do with this

Prove your baseline is underperforming on your eval set before fine-tuning.
If you do fine-tune, start with a synthetic query dataset, then iterate with real logs.
Use LoRA / PEFT to keep GPU costs reasonable.

Next: Vector database overview.