Fine-tuning embeddings
📖 5 min readUpdated 2026-04-18
A general-purpose embedding model knows general English. It doesn't know that in your company, "Jaguar" means a specific product line, not the animal or the car. Fine-tuning embeddings teaches the model your domain's semantic space. For specialized corpora, this is one of the highest-leverage optimizations available.
When fine-tuning is worth it
- Highly specialized vocabulary (legal, medical, financial, scientific, pharma)
- Proprietary terminology that doesn't exist in general training data
- Internal product names or codenames that generic models don't recognize
- Non-English or code-mixed content where general models underperform
- Domains where generic embeddings demonstrably underperform on eval
When it isn't
- Small corpora (< 10K documents), not enough data to fine-tune effectively
- General content where commercial models already perform well
- Early-stage projects before you have meaningful eval data
- Teams without ML ops capacity to maintain the fine-tuned model
The data you need
Fine-tuning requires query-document pairs. The positive pairs: a query and a chunk that actually answers it. Optionally, hard negatives: queries with chunks that look relevant but aren't.
Where this data comes from:
- Click logs: if you have an existing search system, click-through data is gold
- Q&A pairs: FAQs, support tickets, knowledge base questions
- Synthetic generation: use an LLM to generate queries for each chunk. Works surprisingly well.
- Human annotation: expensive but highest quality
For most domain fine-tuning, you want at least 1000-10000 query-document pairs. More is better.
The synthetic data pipeline
Common approach when you don't have natural query data:
- Sample 5000-10000 chunks from your corpus
- For each chunk, prompt an LLM: "Generate 3 questions a user might ask that this chunk answers."
- You now have 15000-30000 query-chunk pairs
- Fine-tune on these
The quality of your synthetic queries determines fine-tune quality. Use a strong model (GPT-4, Claude) and iterate on the prompt to get queries that match real user style.
Fine-tuning approaches
Contrastive learning
The classic approach. Train with (query, positive_chunk, negative_chunks) triplets. Reward the model when positive is closer than negatives in embedding space.
Matryoshka fine-tuning
Fine-tune in a way that preserves MRL property. Lets you still truncate the fine-tuned vectors.
LoRA / PEFT
Parameter-efficient fine-tuning. Train a small adapter on top of the base model. Dramatically reduces GPU requirements for fine-tuning. Works for most embedding models.
Tools
- Sentence-Transformers: library with built-in fine-tuning loops for most open-source embedding models
- Hugging Face Trainer: lower-level but more flexible
- LlamaIndex finetuning: wrappers for common patterns
- Voyage / Cohere fine-tuning APIs: for closed models where supported
The OpenAI / commercial reality
As of 2026, OpenAI embedding models aren't fine-tunable. Cohere and Voyage offer fine-tuning APIs. For maximum flexibility, open-source models are the path.
Expected gains
Well-done domain fine-tuning typically produces 10-30% improvement on retrieval metrics over a general-purpose baseline. For highly specialized domains (medical coding, legal citations), gains can be larger.
The maintenance cost
Fine-tuned models need maintenance:
- Re-tune as your corpus evolves
- Version your fine-tuned models and track which indexes use which
- Re-evaluate quarterly against fresh eval data
- When the base model is upgraded, decide whether to re-fine-tune or skip the upgrade
The pragmatic path
For most teams, I recommend:
- Ship with a commercial embedding model
- Build an eval set that reflects real queries
- Measure baseline retrieval quality
- Only fine-tune if you can demonstrate the baseline is genuinely underperforming
- If you fine-tune, run it against the same eval set to prove the gain is real
Don't fine-tune because it's cool. Fine-tune because measurement says you need to.
Next: Vector database overview.