Fine-tuning embeddings

A general-purpose embedding model knows general English. It doesn't know that in your company, "Jaguar" means a specific product line, not the animal or the car. Fine-tuning embeddings teaches the model your domain's semantic space. For specialized corpora, this is one of the highest-leverage optimizations available.

When fine-tuning is worth it

When it isn't

The data you need

Fine-tuning requires query-document pairs. The positive pairs: a query and a chunk that actually answers it. Optionally, hard negatives: queries with chunks that look relevant but aren't.

Where this data comes from:

For most domain fine-tuning, you want at least 1000-10000 query-document pairs. More is better.

The synthetic data pipeline

Common approach when you don't have natural query data:

  1. Sample 5000-10000 chunks from your corpus
  2. For each chunk, prompt an LLM: "Generate 3 questions a user might ask that this chunk answers."
  3. You now have 15000-30000 query-chunk pairs
  4. Fine-tune on these

The quality of your synthetic queries determines fine-tune quality. Use a strong model (GPT-4, Claude) and iterate on the prompt to get queries that match real user style.

Fine-tuning approaches

Contrastive learning

The classic approach. Train with (query, positive_chunk, negative_chunks) triplets. Reward the model when positive is closer than negatives in embedding space.

Matryoshka fine-tuning

Fine-tune in a way that preserves MRL property. Lets you still truncate the fine-tuned vectors.

LoRA / PEFT

Parameter-efficient fine-tuning. Train a small adapter on top of the base model. Dramatically reduces GPU requirements for fine-tuning. Works for most embedding models.

Tools

The OpenAI / commercial reality

As of 2026, OpenAI embedding models aren't fine-tunable. Cohere and Voyage offer fine-tuning APIs. For maximum flexibility, open-source models are the path.

Expected gains

Well-done domain fine-tuning typically produces 10-30% improvement on retrieval metrics over a general-purpose baseline. For highly specialized domains (medical coding, legal citations), gains can be larger.

The maintenance cost

Fine-tuned models need maintenance:

The pragmatic path

For most teams, I recommend:

  1. Ship with a commercial embedding model
  2. Build an eval set that reflects real queries
  3. Measure baseline retrieval quality
  4. Only fine-tune if you can demonstrate the baseline is genuinely underperforming
  5. If you fine-tune, run it against the same eval set to prove the gain is real

Don't fine-tune because it's cool. Fine-tune because measurement says you need to.

Next: Vector database overview.