Closed vs open embedding models
📖 5 min readUpdated 2026-04-18
The choice between closed-source API embeddings (OpenAI, Cohere, Voyage) and open-source self-hosted (BGE, E5, Nomic) is a real decision with different tradeoffs for different teams. Here's how I approach it.
Closed / API advantages
- Zero ops overhead. Upload tokens, get vectors.
- Consistent latency and reliability (SLAs).
- Automatic upgrades as providers release new versions.
- Usually highest raw quality (providers invest heavily).
- No GPU infrastructure required.
Closed / API disadvantages
- Per-token pricing that compounds with every reindex.
- Your queries and documents leave your infrastructure.
- Rate limits can bottleneck large ingestion jobs.
- Vendor lock-in: switching means reindexing.
- Dependency on external uptime.
Open / self-hosted advantages
- No per-token cost (only compute).
- Data stays in your VPC (critical for regulated industries).
- Predictable cost at scale.
- Can fine-tune on your domain.
- Full control over latency (batch size tuning, hardware choice).
Open / self-hosted disadvantages
- GPU infrastructure needed (or careful CPU inference tuning).
- Ops burden: model updates, deployments, monitoring.
- Quality slightly below top commercial models (gap is closing).
- Engineering time to serve at scale.
The cost break-even
Rough rule of thumb:
- Below ~10M chunks total: API is almost always cheaper. You're spending $100-500/month; not worth ops time.
- 10M-100M chunks, growing: depends. Model a 1-year TCO before deciding.
- Above 100M chunks with frequent reindexing: self-hosted typically wins on cost alone.
The compliance angle
For regulated industries (healthcare, finance, government), sending proprietary data to an external embedding API may be disqualifying. Some vendors offer dedicated deployments or in-VPC options (AWS Bedrock, Azure OpenAI, private Cohere deployments). Self-hosted open-source is the cleanest compliance story.
The quality gap (2026)
Top open-source models (BGE-M3, E5-mistral, nomic-embed-v1.5) are within 2-5 points on MTEB of the best commercial models. For most RAG applications, this difference is smaller than the variance from chunking and retrieval choices.
For specialized domains (legal, finance), commercial domain-tuned models (Voyage-law, Voyage-finance) currently have a wider lead. Unless you can fine-tune your own.
The hybrid pattern
Many serious teams end up running both:
- Commercial API for the main retrieval index (quality, simplicity)
- Self-hosted model for query-time embedding (latency, cost on high-volume queries)
- Self-hosted for sensitive data segments where external APIs aren't allowed
This requires maintaining two embedding pipelines but can optimize cost and compliance simultaneously.
Self-hosting stack
- Text Embeddings Inference (TEI): Hugging Face's production serving for embedding models. Fast, optimized.
- vLLM: general LLM serving, supports embedding models.
- Triton Inference Server: NVIDIA's production serving, heavy but capable.
- Sentence-Transformers + FastAPI: simple DIY for smaller scale.
What I actually recommend
- Starting out: OpenAI text-embedding-3-small. Cheap, fast, good enough.
- Scaling up (10M+ chunks) with spare engineering capacity: evaluate BGE-M3 or nomic-embed-v1.5 self-hosted.
- Quality-sensitive: test Voyage, text-embedding-3-large, or BGE-M3 against your eval set. Let numbers decide.
- Regulated: commercial with dedicated deployment, or self-hosted open-source.
Next: Dimensions, cost, and MRL.