Closed vs open embedding models

The choice between closed-source API embeddings (OpenAI, Cohere, Voyage) and open-source self-hosted (BGE, E5, Nomic) is a real decision with different tradeoffs for different teams. Here's how I approach it.

Closed / API advantages

Closed / API disadvantages

Open / self-hosted advantages

Open / self-hosted disadvantages

The cost break-even

Rough rule of thumb:

The compliance angle

For regulated industries (healthcare, finance, government), sending proprietary data to an external embedding API may be disqualifying. Some vendors offer dedicated deployments or in-VPC options (AWS Bedrock, Azure OpenAI, private Cohere deployments). Self-hosted open-source is the cleanest compliance story.

The quality gap (2026)

Top open-source models (BGE-M3, E5-mistral, nomic-embed-v1.5) are within 2-5 points on MTEB of the best commercial models. For most RAG applications, this difference is smaller than the variance from chunking and retrieval choices.

For specialized domains (legal, finance), commercial domain-tuned models (Voyage-law, Voyage-finance) currently have a wider lead. Unless you can fine-tune your own.

The hybrid pattern

Many serious teams end up running both:

This requires maintaining two embedding pipelines but can optimize cost and compliance simultaneously.

Self-hosting stack

What I actually recommend

Next: Dimensions, cost, and MRL.