Home›Expertise›RAGS to Riches›Metadata extraction

Metadata extraction

📖 5 min readUpdated 2026-04-18

Metadata is what turns a flat blob of vectors into a queryable knowledge graph. Good metadata makes retrieval 3-10x better and unlocks features that pure vector search can't do. Most RAG systems ship without it, then spend months bolting it on retroactively.

What metadata is for

Filtering. Only retrieve chunks where product = "Enterprise" and user has access.
Boosting. Prefer recent docs, canonical sources, high-authority authors.
Grouping. Return top-k diverse sources instead of five chunks from the same doc.
Citation. Tell the user "I learned this from [document], section 3.2, published 2024-01-15."
Access control. Enforce permissions at retrieval time.
Debugging. Trace bad answers back to specific chunks and source documents.

The metadata I always include

Provenance

source_system (e.g., "confluence", "google_drive", "s3")
source_id (ID in that system)
source_url (canonical URL)
title
author
created_at, updated_at
ingested_at, embedding_model_version

Content structure

document_type (e.g., "blog", "manual", "policy")
section_path (e.g., "Setup > Authentication > OAuth")
chunk_position (index within document)
chunk_total (so you can reconstruct)
element_type (e.g., "paragraph", "table", "heading", "list_item")

Access + governance

tenant_id
visibility (public, internal, restricted)
permissions (list of roles or groups with access)
data_classification (public, confidential, PII, etc.)

Domain-specific

product, version, language
topics, tags (extracted or human-assigned)
entities (named entities extracted from the text)

Where metadata comes from

From the source system

Confluence gives you author, labels, space, creation date. Drive gives you owner, sharing settings, MIME type. APIs usually return more metadata than you think, capture it all, decide what to use later.

From the document itself

Title from the first heading. Author from a byline. Date from a publication header. Section hierarchy from structure-aware parsing. Take metadata from inside the document wherever possible, it's often more accurate than the source system's metadata.

Inferred via LLM or NER

Run an LLM or a named-entity recognizer over each chunk to extract:

Named entities (people, organizations, locations, products)
Topics / categories
Sentiment
Document type (if not provided)
Language
Summary (useful for reranking and LLM-based routing)

LLM enrichment costs money per chunk but often pays back in retrieval quality. A small model (Haiku, GPT-4o-mini) works fine for structured extraction.

Metadata filtering in retrieval

Most production vector DBs support filtered search: "find nearest neighbors where tenant_id = 'acme' AND visibility != 'restricted' AND updated_at > '2024-01-01'."

The pattern:

User sends query + their user context (tenant, role, etc.)
Build filter from user context + query-derived constraints
Run vector search with filter applied
Rerank or boost based on remaining metadata (recency, authority)

Pre-filter vs post-filter

Vector DBs handle metadata filtering in one of two ways:

Pre-filter

Narrow the candidate set to matching metadata first, then search. Exact, returns only documents matching the filter. Slow when filters are selective.

Post-filter

Do vector search first, then filter results. Fast but may return fewer-than-k results if matches are rare.

Hybrid (dynamic)

The DB decides based on filter selectivity. Pinecone, Qdrant, and Weaviate do this well.

For narrow filters (e.g., "tenant_id = X"), pre-filter is usually right. For broad filters (e.g., "language = en"), post-filter is fine.

The access control pattern

Every chunk carries a list of roles or groups allowed to see it. At query time, compute the user's allowed roles. Filter retrieval to chunks where permissions intersects with the user's roles. This is how you run multi-tenant RAG without leaks.

The common mistakes

Missing timestamps. Users get outdated answers and there's no way to boost recency.
No source_url. Citations are broken or fake.
Permissions as a string instead of a list. Makes multi-role access impossible to query.
Mutable metadata stored immutably. When a document's permissions change, the index is out of sync until you reindex.
No embedding_model_version. When you upgrade embedding models, you can't tell which chunks need reprocessing.

What to do with this

Write down the metadata schema before you ingest anything. It's harder to add later.
Capture permissions as a list from day one.
Include embedding_model_version so upgrades don't orphan chunks.