Metadata is what turns a flat blob of vectors into a queryable knowledge graph. Good metadata makes retrieval 3-10x better and unlocks features that pure vector search can't do. Most RAG systems ship without it, then spend months bolting it on retroactively.
source_system (e.g., "confluence", "google_drive", "s3")source_id (ID in that system)source_url (canonical URL)titleauthorcreated_at, updated_atingested_at, embedding_model_versiondocument_type (e.g., "blog", "manual", "policy")section_path (e.g., "Setup > Authentication > OAuth")chunk_position (index within document)chunk_total (so you can reconstruct)element_type (e.g., "paragraph", "table", "heading", "list_item")tenant_idvisibility (public, internal, restricted)permissions (list of roles or groups with access)data_classification (public, confidential, PII, etc.)product, version, languagetopics, tags (extracted or human-assigned)entities (named entities extracted from the text)Confluence gives you author, labels, space, creation date. Drive gives you owner, sharing settings, MIME type. APIs usually return more metadata than you think, capture it all, decide what to use later.
Title from the first heading. Author from a byline. Date from a publication header. Section hierarchy from structure-aware parsing. Take metadata from inside the document wherever possible, it's often more accurate than the source system's metadata.
Run an LLM or a named-entity recognizer over each chunk to extract:
LLM enrichment costs money per chunk but often pays back in retrieval quality. A small model (Haiku, GPT-4o-mini) works fine for structured extraction.
Most production vector DBs support filtered search: "find nearest neighbors where tenant_id = 'acme' AND visibility != 'restricted' AND updated_at > '2024-01-01'."
The pattern:
Vector DBs handle metadata filtering in one of two ways:
Narrow the candidate set to matching metadata first, then search. Exact, returns only documents matching the filter. Slow when filters are selective.
Do vector search first, then filter results. Fast but may return fewer-than-k results if matches are rare.
The DB decides based on filter selectivity. Pinecone, Qdrant, and Weaviate do this well.
For narrow filters (e.g., "tenant_id = X"), pre-filter is usually right. For broad filters (e.g., "language = en"), post-filter is fine.
Every chunk carries a list of roles or groups allowed to see it. At query time, compute the user's allowed roles. Filter retrieval to chunks where permissions intersects with the user's roles. This is how you run multi-tenant RAG without leaks.
See also multi-tenant RAG and security.
Next: Why chunking matters.