RAG (Retrieval Augmented Generation)

RAG (Retrieval Augmented Generation) is the pattern that lets a large language model answer questions using fresh, private, or domain-specific data it was never trained on. On the AWS Certified AI Practitioner (AIF-C01) exam, RAG is the single most-tested generative AI architecture because it solves the two biggest production risks of foundation models at once — the knowledge-frozen-at-training-time problem and the hallucination risk. Instead of baking new facts into the model's weights (expensive, slow, brittle), RAG retrieves relevant passages at query time and injects them into the prompt so the model grounds its answer on verifiable evidence. On AWS, RAG is delivered through Amazon Bedrock Knowledge Bases as the fully managed option, or through a custom pipeline built on Amazon OpenSearch Service, Amazon Aurora PostgreSQL with pgvector, Pinecone, or Redis, with Amazon Bedrock serving the foundation model. This study guide walks every AIF-C01 candidate through the RAG pattern end to end — ingest, chunk, embed, index, retrieve, augment, generate — and spells out every exam trap between RAG, fine-tuning, and the hybrid approach.

Expect three to six RAG questions on a typical AIF-C01 sitting. RAG is explicitly named in the AIF-C01 exam guide under Domain 3 (Applications of Foundation Models) and Domain 4 (Guidelines for Responsible AI), and it is the most common answer to scenarios phrased as "the company has internal documents the model has not seen" or "reduce hallucinations without retraining."

What is RAG (Retrieval Augmented Generation)?

RAG stands for Retrieval Augmented Generation. RAG is a generative AI design pattern that plugs a search step in front of a language model so the model can read authoritative source material before it answers. The RAG pattern has two phases. In the indexing phase, source documents are ingested from Amazon S3 (or Confluence, SharePoint, Salesforce, web crawl, and more), split into chunks, converted into vector embeddings by an embedding model such as Amazon Titan Text Embeddings or Cohere Embed, and stored in a vector database such as Amazon OpenSearch Serverless, Amazon Aurora pgvector, Pinecone, or Redis. In the query phase, the user question is embedded with the same embedding model, the top-k most similar chunks are retrieved by vector similarity search (optionally hybrid with keyword BM25), those chunks are injected into the prompt as context, and a foundation model on Amazon Bedrock generates the final answer with citations back to the source.

RAG is not a product name — it is an architectural pattern. AWS ships two flavours: Amazon Bedrock Knowledge Bases hides every step behind a single API, while a custom RAG pipeline gives you control over every knob (chunk size, retriever, reranker, hybrid weighting, prompt template, generation model).

Why RAG matters for AIF-C01

The AIF-C01 exam guide lists RAG under Task Statement 3.2 ("Select appropriate generative AI models and techniques for various use cases") and Task 4.2 ("Responsible AI practices" — hallucination mitigation). Generative AI questions grew +35 percent across AWS certifications in the past year. Every recent AIF-C01 form includes at least one scenario of the shape "the model must answer using internal docs the FM has never seen — what is the most cost-effective approach?" The expected answer is almost always RAG with Amazon Bedrock Knowledge Bases, not fine-tuning, not continued pre-training, and not training from scratch.

白話文解釋 RAG (Retrieval Augmented Generation)

RAG (Retrieval Augmented Generation) sounds like a research term until you rewrite it in everyday language. Three analogies turn the whole pattern into common sense.

Analogy 1 — The open-book exam (開書考試)

A closed-book exam is like a vanilla foundation model. Whatever the student memorized during study (training) is all they can use. If a brand-new chapter was released yesterday, the student fails any question about it, because the knowledge is frozen at training time. Worse, a student who is afraid of a blank answer sheet will confidently make things up — that is exactly what a language model hallucination is.

An open-book exam is RAG. The student walks in with a well-organized binder (the vector index). When a question arrives, the student does not try to answer from memory alone. First they flip through the table of contents and locate the two or three pages that match the topic (top-k retrieval). Only then do they write the answer, quoting the relevant sentences (grounded generation with citations).

This is the single most important analogy on AIF-C01. RAG converts closed-book mode into open-book mode. The foundation model stays the same — what changes is the binder the model is allowed to read during the test.

Analogy 2 — The library reference desk (圖書館)

Picture a researcher at a library reference desk. The researcher has deep general knowledge but does not know your company's internal documents.

Ingest and chunk is the librarian who takes your binders and photocopies them one chapter per folder.
Embed is the librarian translating each folder's topic into a coordinate on a giant map, so similar folders cluster together.
Index is the map itself — the vector database, be it Amazon OpenSearch Serverless, Amazon Aurora pgvector, Pinecone, or Redis.
Retrieve is the researcher walking to the map, pinpointing the region that matches the question, and pulling the three closest folders (top-k).
Augment is the researcher laying those folders on the desk in front of the foundation model.
Generate is the foundation model reading the folders and composing a grounded answer with page references.

In this picture, fine-tuning is a very different activity — it would be sending the researcher back to school to memorize every folder word-for-word. Much slower, much more expensive, and if the folders change tomorrow you have to send them back to school again. RAG just swaps a folder on the map.

Analogy 3 — The kitchen pantry (廚房)

A chef (the foundation model) knows every classic technique. Their pantry (the pretraining corpus) has staple ingredients — salt, flour, sugar — but no truffle, no wagyu, no your-company's-secret-sauce.

RAG is the sous chef running to the walk-in fridge (vector store) at the start of each order and placing only the exact special ingredients the dish needs on the pass (prompt context). The chef cooks with whatever is on the pass.
Fine-tuning is re-training the chef's muscle memory by making them practice a thousand identical recipes until the moves are automatic. Good when you want a permanent style change, overkill when you just need today's special ingredient.
A vanilla FM call with no grounding is the chef improvising — sometimes brilliant, sometimes inventing ingredients that do not exist (hallucination).

If the scenario on the exam says "the answer must cite the exact internal document" or "the company updates its pricing catalog weekly," the sous-chef-to-walk-in-fridge round trip (RAG) beats sending the chef to culinary school (fine-tuning) every time.

Core Operating Principles — The RAG Problem and the RAG Pattern

The two problems RAG solves

Foundation models have two structural limits the AIF-C01 exam loves to test.

Knowledge is frozen at training time. A foundation model like Anthropic Claude, Meta Llama, or Amazon Titan was trained on data up to a training cutoff. Anything created after that date, plus anything private that was never on the public internet, is invisible to the model. No amount of prompt engineering teaches the model your Q3 2026 product roadmap if that roadmap was not in its training data.
Hallucination risk. When asked a question it cannot answer, an ungrounded foundation model may produce a fluent, confident, and completely fabricated answer. This is the number-one responsible-AI concern on AIF-C01. RAG mitigates hallucination by forcing the model to base its answer on retrieved passages, and by letting the application cite those passages back to the user.

RAG solves both problems with the same mechanism — retrieve authoritative passages at query time, stuff them into the prompt, let the model read them before answering.

RAG is a generative AI technique that augments a foundation model's input prompt with passages retrieved from an external knowledge source (typically a vector database) so the model can generate answers grounded in up-to-date, private, or domain-specific content it was not trained on. RAG reduces hallucination, keeps knowledge current, and enables source citation, all without modifying the model's weights. Source ↗

The seven-step RAG pattern

Memorize this sequence — AIF-C01 questions can target any step.

Ingest — pull source documents from Amazon S3, Confluence, Microsoft SharePoint, Salesforce, web crawlers, or databases. Amazon Bedrock Knowledge Bases supports all of these as first-class data sources.
Parse — extract text (and optionally layout) from PDFs, DOCX, HTML, Markdown, CSV. Amazon Bedrock Knowledge Bases can parse complex documents using an FM-based parser for tables and figures.
Chunk — split long documents into retrievable units (fixed-size, semantic, or hierarchical — see the next section).
Embed — convert each chunk into a dense vector using an embedding model such as Amazon Titan Text Embeddings v2, Cohere Embed, or an open-source model on Amazon SageMaker.
Index — store vectors (and source metadata) in a vector database: Amazon OpenSearch Serverless vector collection, Amazon Aurora PostgreSQL pgvector, Amazon Neptune Analytics, Pinecone, Redis Enterprise Cloud, or Amazon DocumentDB.
Retrieve — at query time, embed the user question with the same embedding model, search the vector store for the top-k nearest neighbours (cosine or Euclidean), optionally combine with keyword search (hybrid retrieval), and optionally rerank.
Augment + Generate — build a prompt template that injects the retrieved chunks as context, send to a foundation model on Amazon Bedrock (Claude, Llama, Titan, Nova, Mistral), and return the generated answer with citations.

This ingest → chunk → embed → index → retrieve → augment → generate pipeline is the canonical RAG pattern. Every RAG question on AIF-C01 maps to one of these seven steps.

RAG = Ingest, Parse, Chunk, Embed, Index (the offline indexing phase) then Retrieve, Augment, Generate (the online query phase). If the scenario emphasizes "up-to-date private data without retraining," RAG is almost always the answer. If it emphasizes "source citation to reduce hallucination," RAG is again the answer. Source ↗

Chunking Strategies — The Most-Tuned Knob in RAG

Chunking — the step that splits a long document into retrievable units — is the single biggest lever for RAG quality. Chunks that are too small lose context; chunks that are too large waste tokens and dilute relevance. AIF-C01 does not ask you to pick a specific character count, but it does ask you to recognize strategy names. Amazon Bedrock Knowledge Bases lists four chunking strategies explicitly.

Fixed-size chunking

Split every document into chunks of N tokens (for example 300 tokens with a 20-token overlap). Simple, fast, deterministic, and the default for most RAG starter pipelines.

Pros: predictable embedding cost, predictable retrieval latency.
Cons: blindly cuts sentences and paragraphs in half, which hurts retrieval quality on narrative documents.
On AWS: Amazon Bedrock Knowledge Bases default chunking is fixed-size with configurable max tokens and overlap.

Semantic chunking

Split documents at semantic boundaries detected by an embedding model — two adjacent sentences are kept in the same chunk if their embeddings are similar and split when similarity drops. Produces variable-length chunks that respect topic transitions.

Pros: each chunk is a coherent idea, which boosts retrieval precision.
Cons: embedding cost during ingestion is higher; indexing is slower.
On AWS: Amazon Bedrock Knowledge Bases offers semantic chunking as a first-class option.

Hierarchical chunking

Split documents into two levels — parent chunks (long, high-context) and child chunks (short, high-precision). The retriever matches on child chunks (precise) but returns the parent chunks (more context) to the generator. Also called parent-child retrieval.

Pros: combines precision of small chunks with context of large chunks; strong choice for technical manuals and contracts.
Cons: more storage, more retrieval logic.
On AWS: Amazon Bedrock Knowledge Bases supports hierarchical chunking natively.

Custom / no chunking

Bring your own splitter, or skip chunking entirely for already-short documents like FAQ entries or SKU descriptions.

Pros: full control, perfect for data that already has natural boundaries.
Cons: you own the pipeline code and maintenance.

The right chunk size is the smallest passage that still answers the question on its own. For FAQ and SKU data, one entry per chunk is ideal. For policy manuals with nested sections, hierarchical chunking wins. For narrative docs like earnings transcripts, semantic chunking outperforms fixed-size. Start with Amazon Bedrock Knowledge Bases defaults (300 tokens, 20-token overlap), measure retrieval quality, then switch to semantic or hierarchical if precision is low. Source ↗

Retrieval Quality Evaluation

RAG quality is retrieval quality plus generation quality. Bad retrieval means good foundation models produce grounded but irrelevant answers. AIF-C01 expects you to recognize the core metrics.

Retrieval metrics

Recall@k — of all the truly relevant passages, what fraction appeared in the top-k retrieved. High recall matters when you cannot afford to miss evidence (legal, medical, compliance).
Precision@k — of the k retrieved passages, what fraction are relevant. High precision matters when the generator's context window is small and every token counts.
Mean Reciprocal Rank (MRR) — average of 1/rank of the first relevant result. Rewards ranking the best passage near the top.
nDCG (normalized Discounted Cumulative Gain) — rewards relevant passages near the top with graded relevance.

Generation metrics

Faithfulness / groundedness — does every claim in the answer trace back to a retrieved passage. Amazon Bedrock model evaluation supports groundedness scoring with an LLM-as-judge.
Answer relevance — does the answer actually address the user question, not the retrieved trivia.
Context relevance — are the retrieved passages the right ones for the question.

Evaluation tooling on AWS

Amazon Bedrock Model Evaluation (Knowledge Base evaluation) — runs automatic evaluation on a RAG pipeline with built-in groundedness, relevance, and correctness metrics.
Amazon SageMaker Clarify — can be extended for fairness and bias checks on RAG outputs.
RAGAS (open source) — popular library for groundedness, answer relevance, context precision.

If the final answer is wrong, figure out whether the retriever missed the passage (recall problem) or whether the passage was present but the model ignored it (generation problem). Tuning chunk size, embedding model, or hybrid weight fixes retrieval. Changing prompt template, few-shot examples, or foundation model fixes generation. AIF-C01 scenarios that say "the answer is fluent but contradicts the source" point to a generation prompt problem, not a retrieval problem. Source ↗

AWS RAG Stack Option 1 — Amazon Bedrock Knowledge Bases

Amazon Bedrock Knowledge Bases is the fully managed RAG service on AWS. It is the default exam answer whenever the scenario says "build a RAG app with minimal operational overhead."

What Amazon Bedrock Knowledge Bases handles for you

Data source connectors — Amazon S3, web crawler, Confluence, Microsoft SharePoint, Salesforce, and Amazon Kendra (reusable index).
Parsing — plain text plus FM-based parsing for complex PDFs with tables and figures.
Chunking — fixed-size, semantic, hierarchical, none, or a custom AWS Lambda transformer.
Embedding — Amazon Titan Text Embeddings v2, Cohere Embed multilingual, or other supported embedding models on Amazon Bedrock.
Vector store — choice of Amazon OpenSearch Serverless (default, fully managed), Amazon Aurora PostgreSQL with pgvector, Amazon Neptune Analytics, Pinecone, Redis Enterprise Cloud, or Amazon DocumentDB.
Retrieval API — Retrieve returns raw passages; RetrieveAndGenerate returns a finished answer with citations.
Agents integration — Amazon Bedrock Agents orchestrates Knowledge Bases as one of their tools for multi-step reasoning.

Anatomy of a RetrieveAndGenerate call

Caller passes a natural-language query.
Query is embedded with the same embedding model used at ingestion.
Top-k vectors are retrieved from the chosen vector store.
Optional reranker reorders results.
Retrieved passages are merged into a prompt template.
A foundation model on Amazon Bedrock (Claude, Llama, Titan, Nova) generates the answer.
Response includes both the text and an array of citations (source URI, page, snippet).

When Amazon Bedrock Knowledge Bases is the right answer

Team has limited ML / infrastructure capacity.
Data lives in Amazon S3 or the supported connectors.
Team wants citations out of the box.
Latency budget is "seconds, not milliseconds."
Cost model "pay for storage and queries" is acceptable.

If the question says "fully managed RAG," "no-ops," "minimal operational overhead," "citations required," or "connect to S3 or SharePoint and answer internal questions," the answer is Amazon Bedrock Knowledge Bases, not a custom pipeline. Only pick a custom pipeline when the scenario emphasizes "fine-grained control over chunking, reranking, or sub-100 millisecond latency." Source ↗

AWS RAG Stack Option 2 — Custom RAG Pipeline

When you need control that managed Knowledge Bases does not expose — a custom reranker, a specialty chunker, sub-100-millisecond retrieval, or a private embedding model — you build a custom RAG pipeline. The building blocks:

Vector stores on AWS

Amazon OpenSearch Service (provisioned) with k-NN plugin — HNSW or IVF index for vector search, hybrid with BM25 in a single query, full control over replicas and shards.
Amazon OpenSearch Serverless vector search collection — auto-scaled, pay per OpenSearch Compute Unit, the managed store behind Amazon Bedrock Knowledge Bases by default.
Amazon Aurora PostgreSQL with pgvector — vector search inside a relational database; great when your metadata is already in SQL.
Amazon Neptune Analytics — vector plus graph queries; picks when entity relationships matter.
Amazon DocumentDB (with vector search) — vector search next to JSON documents.
Amazon MemoryDB for Redis / Amazon ElastiCache for Redis — sub-millisecond vector retrieval for latency-critical use cases.
Pinecone on AWS Marketplace — fully managed specialty vector database.
Redis Enterprise Cloud on AWS — for teams already on Redis.

Embedding models

Amazon Titan Text Embeddings v2 on Amazon Bedrock — 256 / 512 / 1024-dimensional embeddings, multilingual, pay per token.
Cohere Embed on Amazon Bedrock — strong multilingual embeddings.
Open-source embedding models (e.g., BGE, E5) deployed on Amazon SageMaker endpoints — useful for private deployments.

Generation models

Any foundation model on Amazon Bedrock — Anthropic Claude, Meta Llama, Amazon Titan, Amazon Nova, Mistral AI, AI21 Jurassic, Cohere Command. Custom pipelines also allow models on Amazon SageMaker endpoints.

Custom pipeline reference architecture

Amazon S3 bucket stores raw documents.
Amazon EventBridge or Amazon S3 event triggers an AWS Lambda ingester on upload.
The ingester parses, chunks, and calls the embedding model on Amazon Bedrock.
Vectors are written to Amazon OpenSearch Service (or Amazon Aurora pgvector, Pinecone, Redis).
An Amazon API Gateway plus AWS Lambda serves the query path.
The Lambda embeds the query, retrieves top-k from OpenSearch, optionally reranks with a Cohere Rerank or cross-encoder model, builds a prompt, and calls Amazon Bedrock to generate the answer.
Answers and citations are returned to the client.

Amazon Kendra is managed enterprise search with semantic understanding, and it can power RAG — Amazon Bedrock Knowledge Bases even supports Amazon Kendra as a data source. But Amazon Kendra is NOT a general-purpose vector database you feed with arbitrary embeddings. If the scenario asks for a vector store that accepts 1024-dim embeddings from a custom model, pick Amazon OpenSearch Serverless, Amazon Aurora pgvector, Pinecone, or Redis — not Amazon Kendra. Amazon Kendra shines when you want out-of-the-box enterprise search with relevance tuning and many connectors. Source ↗

Side-by-Side — Amazon Bedrock Knowledge Bases vs Custom RAG Pipeline

Dimension	Amazon Bedrock Knowledge Bases	Custom RAG pipeline
Operational overhead	Fully managed	You own ingest, index, query
Data source connectors	S3, web, Confluence, SharePoint, Salesforce, Kendra	Whatever you write
Chunking options	Fixed, semantic, hierarchical, none, custom Lambda	Anything
Vector store	OpenSearch Serverless (default), Aurora pgvector, Neptune Analytics, Pinecone, Redis, DocumentDB	Any — full control
Reranker	Built-in Cohere Rerank option	Plug any reranker
Citations	Out of the box	You build
Latency	Seconds	Can tune to sub-100 ms with Redis
Best for	Enterprise Q&A, no-ops, fast TTM	Latency-critical, specialty chunkers, private embedding models
Cost model	Pay per storage and query	Pay for every component

RAG vs Fine-Tuning vs Hybrid — The Decision Framework

This is the second-most-tested topic on AIF-C01 after the RAG pattern itself. Memorize the decision framework.

When RAG beats fine-tuning

Pick RAG when any of these is true:

Knowledge changes frequently — weekly price lists, daily inventory, news articles. RAG picks up new documents by re-indexing; fine-tuning requires a new training run.
Answers must cite the source document. RAG returns citations natively; fine-tuned models blur source attribution.
The domain corpus is too large to memorize into weights affordably.
Explainability or compliance requires showing exactly which passage drove each answer.
Hallucination mitigation is the primary goal.
Team has limited ML expertise and cannot afford repeated training cycles.

When fine-tuning beats RAG

Pick fine-tuning (or continued pre-training) when any of these is true:

The task is a style, format, or behaviour change — always answer in JSON, always speak in the brand voice, always follow a chain-of-thought structure. Style is best learned in weights, not injected as context each call.
The model needs to learn a new vocabulary or jargon so deep that retrieval context does not disambiguate it.
Latency is ultra-critical and you cannot afford the retrieval round-trip plus larger prompts.
Privacy or compliance forbids sending source snippets in the prompt.
A small, stable, well-labelled training dataset exists and will not churn.

On AWS, fine-tuning options are Amazon Bedrock custom models (fine-tuning or continued pre-training on supported FMs such as Amazon Titan, Anthropic Claude Haiku, Meta Llama, Cohere Command) and Amazon SageMaker JumpStart fine-tuning.

When the hybrid approach wins

Hybrid = fine-tune the foundation model AND wrap it with RAG. Use this when:

You want the model to behave a certain way (fine-tuned style or format) AND answer from fresh private data (RAG).
Example: a customer-support agent fine-tuned to always answer in a specific tone and always escalate when unsure, combined with a Knowledge Base over the current product manual.
Another example: a medical summarization model fine-tuned on clinical note structure, augmented with RAG over the latest clinical guidelines.

Hybrid is the most capable pattern but also the most expensive and operationally complex. On AIF-C01, the exam usually points to pure RAG as the cheapest-best answer; hybrid shows up when the scenario explicitly combines "tone / format" with "fresh data."

RAG changes what the model knows. Fine-tuning changes how the model behaves. New facts? RAG. New style or format? Fine-tune. Both? Hybrid. If you only remember one rule for AIF-C01, remember this one. Source ↗

A very common AIF-C01 wrong answer is "fine-tune the model on company documents to stop hallucinations." This is almost never the right answer. Fine-tuning on a static corpus teaches style more reliably than facts; the model may still hallucinate when asked about details not well-represented in training. The correct hallucination mitigation is RAG plus source citation (and optionally Amazon Bedrock Guardrails for content filtering). If the exam says "reduce hallucinations and cite sources," pick RAG, not fine-tuning. Source ↗

Prompt Template Design for RAG

The prompt template is how the retrieved passages become model input. A production RAG prompt has four sections.

System instruction — "You are an assistant that answers only from the provided context. If the answer is not in the context, say you do not know."
Retrieved context — the k chunks, each tagged with an ID so citations can reference them.
User question — the original query, unchanged.
Output schema — "Return JSON with fields answer, citations[]."

Good RAG prompts explicitly forbid the model from using its own memory when the retrieved context is silent. This single instruction is the biggest hallucination mitigator after retrieval itself. Amazon Bedrock Knowledge Bases exposes the prompt template through the generationConfiguration.promptTemplate field so you can tune it without leaving the managed service.

Key Numbers and Must-Memorize Facts

Amazon Titan Text Embeddings v2 dimensions — 256, 512, or 1024 (configurable at invocation).
Amazon Bedrock Knowledge Bases default chunking — fixed-size, roughly 300 tokens with 20-token overlap.
Top-k typical range — 3 to 10 for most Q&A RAG; higher for summarization.
Amazon OpenSearch Serverless minimum OCU — 2 OCUs for indexing, 2 for search when using a dedicated collection (costs accrue even at idle).
Amazon Bedrock Knowledge Bases vector store options — Amazon OpenSearch Serverless (default), Amazon Aurora pgvector, Amazon Neptune Analytics, Pinecone, Redis Enterprise Cloud, Amazon DocumentDB.
Amazon Bedrock Knowledge Bases data source connectors — Amazon S3, web crawler, Confluence, SharePoint, Salesforce, Amazon Kendra.
Amazon Bedrock retention — prompts and completions are not used to train base foundation models.
Citation support — Amazon Bedrock Knowledge Bases RetrieveAndGenerate returns citations by default.
Reranker option — Cohere Rerank is offered natively inside Amazon Bedrock Knowledge Bases.
RAG mitigates hallucination — but only when the prompt explicitly instructs the model to say "I do not know" if the answer is not in context.

Common Exam Traps for RAG on AIF-C01

RAG vs fine-tuning — knowledge vs behaviour. "Answer from private docs updated weekly" → RAG. "Always output JSON in a strict schema" → fine-tuning.
Knowledge Bases vs custom pipeline — ops overhead. "Minimal operational overhead, fully managed" → Amazon Bedrock Knowledge Bases. "Specialty reranker and sub-100 ms latency" → custom pipeline on Amazon OpenSearch Service or Redis.
Amazon Kendra vs Amazon OpenSearch Serverless — managed enterprise search vs generic vector database. "Many enterprise connectors, out-of-the-box relevance" → Amazon Kendra. "Arbitrary 1024-dim embeddings with hybrid search" → Amazon OpenSearch Service.
Chunking strategy — if the document has nested sections, hierarchical wins. If it is narrative, semantic wins. If it is uniform (logs, short SKUs), fixed-size wins.
Hallucination fix — RAG with citations and a strict prompt, not fine-tuning. Add Amazon Bedrock Guardrails for content and topic filtering.
Private data and compliance — Amazon Bedrock does not train on your prompts or completions; RAG keeps your data in your account's vector store.
Cost model — RAG scales with retrieval volume and token usage, fine-tuning is a large up-front training cost plus hosting.
Embedding consistency — the same embedding model must be used for ingestion and query; switching embedding models invalidates the index.
Agents vs Knowledge Bases — Amazon Bedrock Agents orchestrate multi-step actions (calling APIs, running code) and can consume a Knowledge Base as one tool. If the question is "look up a document and answer," pick Knowledge Bases. If it is "book a flight and update the CRM," pick Agents.

Every AIF-C01 study guide warns about this: if you ingest with Amazon Titan Text Embeddings v2 (1024-dim) and query with Cohere Embed (1024-dim but different vector space), retrieval degrades to near-random. Vectors live in a model-specific space; they are not interchangeable across embedding models. Pin one embedding model per Knowledge Base. If you need to switch, re-embed the entire corpus. Source ↗

RAG Boundary — What RAG Is Not

RAG is not a chatbot framework. Amazon Lex or Amazon Bedrock Agents handle dialog orchestration; RAG is a retrieval technique that can be called from inside either.
RAG is not a model. RAG is a pattern that wraps any foundation model with a retrieval step.
RAG is not fine-tuning. Fine-tuning updates model weights; RAG leaves weights untouched and changes the prompt.
RAG is not identical to semantic search. Semantic search retrieves passages and stops; RAG retrieves passages AND asks an FM to synthesize an answer.
RAG does not eliminate hallucination. It greatly reduces it when paired with strict prompts and citations, but a sloppy prompt template still lets the model wander.

Practice Question Links — AIF-C01 RAG Scenarios

Scenario 1: A bank wants an internal assistant that answers employee questions using the latest version of the HR handbook (updated monthly). The team has no ML engineers. Correct choice: Amazon Bedrock Knowledge Bases with an Amazon S3 data source and Amazon OpenSearch Serverless vector store.

Scenario 2: A law firm needs answers that cite the exact paragraph of a contract. Correct approach: RAG with citations returned by RetrieveAndGenerate; do not fine-tune.

Scenario 3: A company wants the model to always speak in its brand voice and always reply in a JSON schema. Correct approach: Fine-tune the foundation model using Amazon Bedrock custom models; knowledge does not change, behaviour does.

Scenario 4: A support team wants the brand voice AND answers grounded in the current product manual (which changes weekly). Correct approach: Hybrid — fine-tune for voice, wrap with RAG over the product manual.

Scenario 5: A retail chatbot needs sub-50-millisecond retrieval latency to keep P99 under 300 ms. Correct choice: Custom RAG pipeline with Amazon MemoryDB for Redis vector store and Amazon Bedrock for generation.

Scenario 6: A startup has 10 million product SKU descriptions in Amazon S3 and wants semantic search plus generated recommendations. Correct approach: RAG with Amazon Bedrock Knowledge Bases; chunking = none (one SKU per chunk is already natural).

Scenario 7: A medical research tool must answer questions using both the latest clinical guidelines (updated monthly) and a strict formatting template. Correct approach: Hybrid — RAG for the guidelines, fine-tune for the formatting template.

Scenario 8: An engineering team complains that RAG answers are fluent but sometimes contradict the retrieved passages. Correct diagnosis: Prompt template issue — add strict instruction "answer only from the provided context; say you do not know otherwise" and consider Amazon Bedrock Guardrails.

Scenario 9: The team reports that retrieval misses obvious relevant passages. Correct diagnosis: Chunking or embedding issue — try semantic or hierarchical chunking, increase k, or switch to hybrid (vector plus BM25) retrieval.

Scenario 10: A compliance officer asks whether prompts sent to Amazon Bedrock are used to train foundation models. Correct answer: No — Amazon Bedrock does not use customer prompts or completions to train base foundation models.

FAQ — RAG (Retrieval Augmented Generation) on AIF-C01

1. What is RAG and why is it on the AIF-C01 exam?

RAG (Retrieval Augmented Generation) is the pattern of retrieving relevant passages from an external knowledge source and injecting them into the prompt of a foundation model so the model can answer with up-to-date, private, or domain-specific information it was never trained on. AIF-C01 tests RAG heavily because it solves the two structural limits of foundation models — knowledge frozen at training time and hallucination risk — and because AWS ships a flagship service around it (Amazon Bedrock Knowledge Bases). Expect three to six RAG questions per sitting.

2. When should I pick RAG over fine-tuning?

Pick RAG whenever knowledge changes frequently, citations are required, hallucination mitigation is the main goal, or the team lacks ML expertise for repeated training runs. Pick fine-tuning when the task is a style, format, or behaviour change on a stable dataset, when latency is ultra-critical, or when privacy rules forbid injecting source snippets into prompts. Pick the hybrid approach when you need both — fine-tune for voice or format, wrap with RAG for fresh data. The one-line rule: RAG changes what the model knows, fine-tuning changes how the model behaves.

3. What is the difference between Amazon Bedrock Knowledge Bases and a custom RAG pipeline?

Amazon Bedrock Knowledge Bases is the fully managed RAG service — it handles ingestion from Amazon S3 and other connectors, parsing, chunking (fixed, semantic, hierarchical), embedding with Amazon Titan or Cohere, storage in Amazon OpenSearch Serverless or Aurora pgvector or Pinecone or Redis, retrieval, and generation with citations through a single RetrieveAndGenerate API. A custom RAG pipeline replaces any of these steps with AWS Lambda, Amazon OpenSearch Service, Amazon SageMaker endpoints, or open-source components when you need a specialty chunker, a custom reranker, sub-100-millisecond latency, or a private embedding model. Pick Knowledge Bases for no-ops enterprise Q&A; pick custom for specialized requirements.

4. Which chunking strategy should I use in RAG?

Start with fixed-size chunking (around 300 tokens with 20-token overlap) as a baseline. Switch to semantic chunking when your documents are narrative (earnings calls, research articles) so chunks respect topic transitions. Switch to hierarchical chunking when documents have nested sections (policy manuals, contracts) so the retriever matches on precise children but returns richer parents to the generator. Use no chunking for naturally short records like FAQ entries or product SKUs. Amazon Bedrock Knowledge Bases supports all four strategies natively.

5. Does RAG eliminate hallucination?

No — RAG greatly reduces hallucination but does not eliminate it. Two things must be in place. First, the retriever must actually return the relevant passage (measured by recall and precision at k). Second, the prompt template must explicitly instruct the model to answer only from the provided context and to say "I do not know" when the passage is silent. Without a strict prompt, the model may still fall back on its pretraining memory and produce an ungrounded answer. Pair RAG with Amazon Bedrock Guardrails for content filtering and topic restriction to close the remaining gap.

6. Which AWS vector stores can I use for RAG?

Amazon Bedrock Knowledge Bases supports Amazon OpenSearch Serverless (default), Amazon Aurora PostgreSQL with pgvector, Amazon Neptune Analytics, Pinecone, Redis Enterprise Cloud, and Amazon DocumentDB. For custom pipelines you can add Amazon OpenSearch Service (provisioned) with the k-NN plugin, Amazon MemoryDB for Redis and Amazon ElastiCache for Redis (sub-millisecond retrieval), or any compatible open-source vector database running on Amazon EKS or Amazon EC2. Pick Amazon OpenSearch Serverless when you want the managed default, Aurora pgvector when your metadata is already in PostgreSQL, and Redis when latency is critical.

7. Can I use fine-tuning AND RAG together?

Yes — this is the hybrid approach. Fine-tune the foundation model using Amazon Bedrock custom models or Amazon SageMaker JumpStart to bake in style, tone, format, or domain vocabulary. Then wrap the fine-tuned model with RAG through Amazon Bedrock Knowledge Bases or a custom pipeline so it can still pull fresh facts at query time. The hybrid approach is the most capable RAG pattern and also the most expensive — on AIF-C01, pure RAG is usually the cost-effective answer, but scenarios that combine "brand voice" or "strict output format" with "data updated frequently" typically point to hybrid.

8. How do I evaluate a RAG system?

Evaluate retrieval and generation separately. For retrieval, measure recall@k, precision@k, MRR, and nDCG against a labelled test set. For generation, measure groundedness (every claim traces to a passage), answer relevance (the answer addresses the question), and context relevance (retrieved passages match the question). Amazon Bedrock Model Evaluation supports automatic Knowledge Base evaluation with built-in metrics; open-source RAGAS is another common choice. If the final answer is wrong, diagnose whether retrieval missed the passage (fix chunking, embedding, or hybrid weighting) or the model ignored it (fix the prompt template or switch foundation models).

Summary

RAG (Retrieval Augmented Generation) is the AIF-C01 pattern that grounds a foundation model on fresh, private, or domain-specific data by retrieving relevant passages at query time and injecting them into the prompt. The canonical pipeline is ingest → parse → chunk → embed → index, then retrieve → augment → generate. Chunking choices (fixed-size, semantic, hierarchical, none) are the biggest retrieval-quality lever; retrieval quality is measured with recall@k, precision@k, MRR, and nDCG, and generation quality with groundedness and answer relevance. On AWS, Amazon Bedrock Knowledge Bases is the fully managed RAG service that handles every step, supports Amazon OpenSearch Serverless, Amazon Aurora pgvector, Amazon Neptune Analytics, Pinecone, Redis Enterprise Cloud, and Amazon DocumentDB as vector stores, and returns citations natively; a custom pipeline on Amazon OpenSearch Service, AWS Lambda, and Amazon Bedrock gives you specialty control for latency-critical or specialized cases. The RAG-vs-fine-tuning rule is simple — RAG changes what the model knows, fine-tuning changes how it behaves, and hybrid does both. Master these rules, recognize the seven-step RAG pattern, memorize the managed-vs-custom trade-off, and you are ready for every RAG question AIF-C01 can throw.

What is RAG (Retrieval Augmented Generation)?

Why RAG matters for AIF-C01

白話文解釋 RAG (Retrieval Augmented Generation)

Analogy 1 — The open-book exam (開書考試)

Analogy 2 — The library reference desk (圖書館)

Analogy 3 — The kitchen pantry (廚房)

Core Operating Principles — The RAG Problem and the RAG Pattern

The two problems RAG solves

The seven-step RAG pattern

Chunking Strategies — The Most-Tuned Knob in RAG

Fixed-size chunking

Semantic chunking

Hierarchical chunking

Custom / no chunking

Retrieval Quality Evaluation

Retrieval metrics

Generation metrics

Evaluation tooling on AWS

AWS RAG Stack Option 1 — Amazon Bedrock Knowledge Bases

What Amazon Bedrock Knowledge Bases handles for you

Anatomy of a RetrieveAndGenerate call

When Amazon Bedrock Knowledge Bases is the right answer

AWS RAG Stack Option 2 — Custom RAG Pipeline

Vector stores on AWS

Embedding models

Generation models

Custom pipeline reference architecture

Side-by-Side — Amazon Bedrock Knowledge Bases vs Custom RAG Pipeline

RAG vs Fine-Tuning vs Hybrid — The Decision Framework

When RAG beats fine-tuning

When fine-tuning beats RAG

When the hybrid approach wins

Prompt Template Design for RAG

Key Numbers and Must-Memorize Facts

Common Exam Traps for RAG on AIF-C01

RAG Boundary — What RAG Is Not

Practice Question Links — AIF-C01 RAG Scenarios

FAQ — RAG (Retrieval Augmented Generation) on AIF-C01

1. What is RAG and why is it on the AIF-C01 exam?

2. When should I pick RAG over fine-tuning?

3. What is the difference between Amazon Bedrock Knowledge Bases and a custom RAG pipeline?

4. Which chunking strategy should I use in RAG?

5. Does RAG eliminate hallucination?

6. Which AWS vector stores can I use for RAG?

7. Can I use fine-tuning AND RAG together?

8. How do I evaluate a RAG system?

Further Reading

Summary

官方資料來源