CutScore | RAG design considerations

WHAT IT IS

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model so it references an authoritative knowledge base outside of its training data before generating a response. Rather than relying solely on what the model learned during training, RAG retrieves relevant content at inference time, adds it to the prompt, and lets the model reason over fresh, organization-specific information.

Mental model

Think of RAG as giving the model a research assistant. Before the model answers, the assistant runs to a library (your vector store), pulls the most relevant pages, and hands them to the model along with the original question. The model then answers using both its trained reasoning ability and the retrieved material — without ever needing to be retrained.

When to use it

The exam frequently tests whether a candidate can choose between RAG and the adjacent alternative of fine-tuning (or full retraining). The table below captures the decision boundary grounded in the AWS documentation.

Criterion	RAG	Fine-tuning / Retraining
Knowledge is external, domain-specific, or changes over time	Strong fit — retrieves current data at inference time	Weak fit — baked-in knowledge becomes stale
Goal is to give the model access to private or proprietary documents	Strong fit — documents live in the knowledge base, not the model weights	Possible but expensive and raises data-exposure concerns
Computational and financial cost is a constraint	More cost-effective — avoids expensive retraining	High computational and financial cost
Need source attribution or auditability	Strong fit — retrieved passages can be surfaced to the user	Difficult — knowledge is encoded in weights with no traceable source
Goal is to change the model's tone, style, or core behavior	Weak fit — RAG does not alter model behavior	Better fit

The RAG pipeline — key design components

Understanding the pipeline is essential because exam questions often probe individual stages.

Chunking — Before embedding, documents must be split into chunks. The chunking strategy directly affects retrieval quality:

Fixed-size chunking lets you set a maximum token count per chunk and an overlap percentage between consecutive chunks. It is predictable but may split ideas across chunk boundaries.
Default chunking splits content into approximately 300-token chunks while honoring sentence boundaries.
Hierarchical chunking organizes content into parent and child chunks. At retrieval time, child chunks (smaller, more precise embeddings) are matched, but the system returns the broader parent chunk to give the model more context. Small embeddings are more precise, but retrieval aims for comprehensive context — hierarchical chunking balances these needs.
Semantic chunking divides text into chunks based on meaning rather than token count, aiming to improve retrieval accuracy by focusing on semantic content rather than syntactic structure. It uses a foundation model during ingestion, which incurs additional cost.
No chunking treats each document as a single chunk; pre-processing the documents before ingestion is recommended when using this option.

Embedding model selection — The embedding model converts both source documents and user queries into vector representations. The same model must be used for both ingestion and retrieval, because the vector space must be consistent. Higher vector dimensions improve accuracy but increase cost and latency.

Embeddings type — Floating-point (float32) embeddings are more precise; binary vector embeddings are less precise but less costly. Not all vector stores support binary vectors.

Vector store — The vector database stores the numerical representations and supports similarity search. The choice of embedding model and vector dimensions can affect which vector stores are compatible.

Data freshness — Because RAG retrieves from an external knowledge base rather than relying on training data, keeping the knowledge base current requires updating the source documents and re-ingesting (re-embedding) them through asynchronous processes or periodic batch updates.

COMMON MISCONCEPTION

The trap: "RAG trains the model on your data."

RAG does not modify the model's weights in any way. The model is never retrained or fine-tuned. RAG extends the model's effective knowledge by providing retrieved context in the prompt at inference time. Candidates who conflate RAG with fine-tuning may incorrectly conclude that RAG is expensive (like retraining), that it permanently changes model behavior, or that the model "remembers" retrieved content across sessions. None of these are true. The cost advantage of RAG over retraining is explicitly stated in the AWS documentation: "The computational and financial costs of retraining FMs for organization or domain-specific information are high. RAG is a more cost-effective approach."

A related misconception is that RAG solves all hallucination problems. RAG reduces hallucination by grounding responses in retrieved content, but it does not eliminate the possibility of the model generating inaccurate text — particularly if the retrieved passages are themselves incomplete or if the retrieval step fails to surface the right content.

How it shows up on the exam

The cognitive target for RAG design considerations is application and analysis: candidates are expected to match a business requirement to the right architectural choice, and to identify what goes wrong when a design decision is poorly made.

Signal phrases to watch for:

"proprietary documents," "internal knowledge base," "organization-specific information" — these point toward RAG rather than retraining
"current," "up-to-date," "real-time data" — these highlight RAG's data-freshness advantage
"without retraining the model" — this is the defining characteristic of RAG
"hallucinations," "false information," "outdated responses" — these are the problems RAG is designed to mitigate
"chunk size," "overlap," "embedding model," "vector database" — these signal a question about RAG pipeline design choices

Candidates often confuse the chunking and embedding stages (both happen during ingestion) with the retrieval stage (which happens at inference time). The pipeline is sequential and each stage has distinct design levers — understanding which lever affects which outcome is the core skill being tested.

RAG design considerations — AIF-C01