RAG design considerations — AIF-C01
RAG design considerations for AWS AIF-C01: pipeline stages, chunking strategies, embedding trade-offs, and how to choose RAG over retraining.
WHAT IT IS
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model so it references an authoritative knowledge base outside of its training data before generating a response. Rather than relying solely on what the model learned during training, RAG retrieves relevant content at inference time, adds it to the prompt, and lets the model reason over fresh, organization-specific information.
Mental model
Think of RAG as giving the model a research assistant. Before the model answers, the assistant runs to a library (your vector store), pulls the most relevant pages, and hands them to the model along with the original question. The model then answers using both its trained reasoning ability and the retrieved material — without ever needing to be retrained.
When to use it
The exam frequently tests whether a candidate can choose between RAG and the adjacent alternative of fine-tuning (or full retraining). The table below captures the decision boundary grounded in the AWS documentation.
| Criterion | RAG | Fine-tuning / Retraining |
|---|---|---|
| Knowledge is external, domain-specific, or changes over time | Strong fit — retrieves current data at inference time | Weak fit — baked-in knowledge becomes stale |
| Goal is to give the model access to private or proprietary documents | Strong fit — documents live in the knowledge base, not the model weights | Possible but expensive and raises data-exposure concerns |
| Computational and financial cost is a constraint | More cost-effective — avoids expensive retraining | High computational and financial cost |
| Need source attribution or auditability | Strong fit — retrieved passages can be surfaced to the user | Difficult — knowledge is encoded in weights with no traceable source |
| Goal is to change the model's tone, style, or core behavior | Weak fit — RAG does not alter model behavior | Better fit |
The RAG pipeline — key design components
Understanding the pipeline is essential because exam questions often probe individual stages.
Chunking — Before embedding, documents must be split into chunks. The chunking strategy directly affects retrieval quality:
- Fixed-size chunking lets you set a maximum token count per chunk and an overlap percentage between consecutive chunks. It is predictable but may split ideas across chunk boundaries.
- Default chunking splits content into approximately 300-token chunks while honoring sentence boundaries.
- Hierarchical chunking organizes content into parent and child chunks. At retrieval time, child chunks (smaller, more precise embeddings) are matched, but the system returns the broader parent chunk to give the model more context. Small embeddings are more precise, but retrieval aims for comprehensive context — hierarchical chunking balances these needs.
- Semantic chunking divides text into chunks based on meaning rather than token count, aiming to improve retrieval accuracy by focusing on semantic content rather than syntactic structure. It uses a foundation model during ingestion, which incurs additional cost.
- No chunking treats each document as a single chunk; pre-processing the documents before ingestion is recommended when using this option.
Embedding model selection — The embedding model converts both source documents and user queries into vector representations. The same model must be used for both ingestion and retrieval, because the vector space must be consistent. Higher vector dimensions improve accuracy but increase cost and latency.
Embeddings type — Floating-point (float32) embeddings are more precise; binary vector embeddings are less precise but less costly. Not all vector stores support binary vectors.
Vector store — The vector database stores the numerical representations and supports similarity search. The choice of embedding model and vector dimensions can affect which vector stores are compatible.
Data freshness — Because RAG retrieves from an external knowledge base rather than relying on training data, keeping the knowledge base current requires updating the source documents and re-ingesting (re-embedding) them through asynchronous processes or periodic batch updates.
COMMON MISCONCEPTION
The trap: "RAG trains the model on your data."
RAG does not modify the model's weights in any way. The model is never retrained or fine-tuned. RAG extends the model's effective knowledge by providing retrieved context in the prompt at inference time. Candidates who conflate RAG with fine-tuning may incorrectly conclude that RAG is expensive (like retraining), that it permanently changes model behavior, or that the model "remembers" retrieved content across sessions. None of these are true. The cost advantage of RAG over retraining is explicitly stated in the AWS documentation: "The computational and financial costs of retraining FMs for organization or domain-specific information are high. RAG is a more cost-effective approach."
A related misconception is that RAG solves all hallucination problems. RAG reduces hallucination by grounding responses in retrieved content, but it does not eliminate the possibility of the model generating inaccurate text — particularly if the retrieved passages are themselves incomplete or if the retrieval step fails to surface the right content.
How it shows up on the exam
The cognitive target for RAG design considerations is application and analysis: candidates are expected to match a business requirement to the right architectural choice, and to identify what goes wrong when a design decision is poorly made.
Signal phrases to watch for:
- "proprietary documents," "internal knowledge base," "organization-specific information" — these point toward RAG rather than retraining
- "current," "up-to-date," "real-time data" — these highlight RAG's data-freshness advantage
- "without retraining the model" — this is the defining characteristic of RAG
- "hallucinations," "false information," "outdated responses" — these are the problems RAG is designed to mitigate
- "chunk size," "overlap," "embedding model," "vector database" — these signal a question about RAG pipeline design choices
Candidates often confuse the chunking and embedding stages (both happen during ingestion) with the retrieval stage (which happens at inference time). The pipeline is sequential and each stage has distinct design levers — understanding which lever affects which outcome is the core skill being tested.
Related concepts
Sources
Every claim on this page traces to the public exam blueprint and official documentation: