RAG (Retrieval-Augmented Generation)

An architecture that retrieves relevant documents and injects them into the model's context.

RAG adds a persistent indexing baseline (embedding + vector storage) plus per-query retrieval overhead. Total RAG cost is the sum of indexing refresh, embedding generation, vector storage and LLM inference.

Related terms

Embedding

A vector representation of text used for semantic search and RAG retrieval.

Context window

The maximum number of tokens a model can process in a single request.

Input tokens

Tokens consumed by the prompt, system message and retrieved context fed into the model.