Metadata extraction

Metadata is what turns a flat blob of vectors into a queryable knowledge graph. Good metadata makes retrieval 3-10x better and unlocks features that pure vector search can't do. Most RAG systems ship without it, then spend months bolting it on retroactively.

What metadata is for

  1. Filtering. Only retrieve chunks where product = "Enterprise" and user has access.
  2. Boosting. Prefer recent docs, canonical sources, high-authority authors.
  3. Grouping. Return top-k diverse sources instead of five chunks from the same doc.
  4. Citation. Tell the user "I learned this from [document], section 3.2, published 2024-01-15."
  5. Access control. Enforce permissions at retrieval time.
  6. Debugging. Trace bad answers back to specific chunks and source documents.

The metadata I always include

Provenance

Content structure

Access + governance

Domain-specific

Where metadata comes from

From the source system

Confluence gives you author, labels, space, creation date. Drive gives you owner, sharing settings, MIME type. APIs usually return more metadata than you think, capture it all, decide what to use later.

From the document itself

Title from the first heading. Author from a byline. Date from a publication header. Section hierarchy from structure-aware parsing. Take metadata from inside the document wherever possible, it's often more accurate than the source system's metadata.

Inferred via LLM or NER

Run an LLM or a named-entity recognizer over each chunk to extract:

LLM enrichment costs money per chunk but often pays back in retrieval quality. A small model (Haiku, GPT-4o-mini) works fine for structured extraction.

Metadata filtering in retrieval

Most production vector DBs support filtered search: "find nearest neighbors where tenant_id = 'acme' AND visibility != 'restricted' AND updated_at > '2024-01-01'."

The pattern:

  1. User sends query + their user context (tenant, role, etc.)
  2. Build filter from user context + query-derived constraints
  3. Run vector search with filter applied
  4. Rerank or boost based on remaining metadata (recency, authority)

Pre-filter vs post-filter

Vector DBs handle metadata filtering in one of two ways:

Pre-filter

Narrow the candidate set to matching metadata first, then search. Exact, returns only documents matching the filter. Slow when filters are selective.

Post-filter

Do vector search first, then filter results. Fast but may return fewer-than-k results if matches are rare.

Hybrid (dynamic)

The DB decides based on filter selectivity. Pinecone, Qdrant, and Weaviate do this well.

For narrow filters (e.g., "tenant_id = X"), pre-filter is usually right. For broad filters (e.g., "language = en"), post-filter is fine.

The access control pattern

Every chunk carries a list of roles or groups allowed to see it. At query time, compute the user's allowed roles. Filter retrieval to chunks where permissions intersects with the user's roles. This is how you run multi-tenant RAG without leaks.

See also multi-tenant RAG and security.

The common mistakes

Next: Why chunking matters.