xavier collantes

RAG with LangChain

By Xavier Collantes

Created: 9/15/2024; Updated: 7/25/2025

RAG with LangChain is only one of many ways to implement a RAG-enabled LLM.

In this example, I will show you how to build a RAG pipeline with LangChain. This will guide you in making different technical decisions for different components given certain situations.

We will be referencing my blog about my time in Bulldog Band as reference data.

What Is RAG?

Think of RAG like this: Using ChatGPT but it can see your documents folder so now you can ask what the documents are about.

Old Regular LLM: "Based on my training data from 2021, here's what I think..."

RAG powered LLM: "Let me check the latest docs first... okay, here's what's actually happening..."

Great for giving a specific reference to your LLM without having to spend thousands of dollars on re-training the base LLM.

My Tech Stack Choices

I will explain my technical choices for this example project. First the imports for LangChain:

Bash

1pip install langchain-text-splitters langchain-community langgraph
2

snippet hosted withby Xavier

LLM Model

Core LLM component to receive inputs and return responses.

I chose Gemini for the LLM Model because I already have a baseline for its performance beforehand. And also Gemini is fairly cheap. The Embedding Model that is compatible with Gemini is also ranked at the highest of today's models: MTEB Leaderboard (as of July 2025).

But you can find the list of available models at langchain.com.

Some models are APIs such as Llama or OpenAI.

Some models are locally hosted through Ollama. I point out Ollama because this dependency opens up Ollama's local hosting capabilities which is a whole huge list itself: Ollama models.

Bash

1pip install "langchain[google-genai]"
2

snippet hosted withby Xavier

🐍

Python3

1from langchain.chat_models import init_chat_model
2from langchain_core.language_models.chat_models import BaseChatModel
3
4
5MODEL_NAME: str = "gemini-2.0-flash"
6MODEL_PROVIDER: str = "google_genai"
7
8# Full list: https://python.langchain.com/docs/integrations/chat/
9llm: BaseChatModel = init_chat_model(MODEL_NAME, model_provider=MODEL_PROVIDER)
10

snippet hosted withby Xavier

Full list of LLM Models

API key

Depending on the model, you may need an API key.

Gemini API Key: https://ai.google.dev/gemini-api/docs/api-key

Embedding Model

Encoding model which translates human-readable text to a hash which can be scored to indicate relationships.

xomnia.com

For example, "dog" and "cat" may be given a score of 0.7 because they are both animals, both are common pets, but are different species as per the training data.

"Cat" and "cow" may be given a score of 0.2 because though they are both animals, they are less seen together in the training data.

Some of these algorithms for measuring similarity include Co-sine Similarity and Euclidean Distance.

Technically embeddings and models are interchangeable as longs as the outputs are the size the model expects.

Embedding outputs match size (384, 1024, etc.)
Do not change your embedding model suddenly. The embeddings outputs are unique to the model so switching models without re-embedding the vector database will not work.

Bash

1pip install langchain-google-genai
2

snippet hosted withby Xavier

🐍

Python3

1from langchain_google_genai import GoogleGenerativeAIEmbeddings
2import os
3
4# Full list: https://python.langchain.com/docs/integrations/text_embedding/
5embeddings = GoogleGenerativeAIEmbeddings(
6    model="models/gemini-embedding-001",
7    # Make sure to set this in your environment or set this variable in your code.
8    api_key=os.getenv("GOOGLE_API_KEY"),
9)
10

snippet hosted withby Xavier

Full list of Embedding Models

Vector database

Embeddings turn regular text into coordinates in high-dimensional space where similar concepts end up close together.

There are many choices for vector databases:

Chroma
- Simple setup
- Local storage in the form of a SQLite file
Pinecone
- Cloud store only with a local emulator version
- Free tier then you pay for the storage
- Pricing model is based off monthly minimum usage
FAISS
- Simple setup
- Local vector store
Qdrant
- Cloud vector store or locally hosted with Docker
- Cloud managed service version has 1GB free
- Cloud managed service has straight-forward pricing by the hour

Generally speaking, all these services offer about the same features. The biggest differences are the adjacent features and ability how to deploy. For example, Qdrant can be run on Docker easily while Pinecone has an emulator for local development.

I chose Chroma for this demo because it is local and simple. Chroma generates a local SQLite file for persistence so be aware when a new file pops up in your project.

For more complex use cases, you can use a cloud vector store like Pinecone or if you have a Docker Compose or Kubernetes setup, you can use Qdrant.

Bash

1pip install langchain-chroma
2

snippet hosted withby Xavier

🐍

Python3

1from langchain_chroma import Chroma
2
3vector_store = Chroma(
4    collection_name="example_collection",
5    embedding_function=embeddings,
6    persist_directory="./chroma_langchain_db",  # Keeps a local SQLite file.
7)
8

snippet hosted withby Xavier

Full list of Vector Stores

Building the Actual RAG Pipeline

Feeding Data

We need reference data to give the RAG to work with. In this demo, I will use a couple of my own blog posts featured on xaviercollantes.dev.

Bash

1pip install langchain-community langchain-text-splitters bs4
2

snippet hosted withby Xavier

🐍

Python3

1import bs4
2from langchain_core.documents import Document
3from langchain_community.document_loaders import WebBaseLoader
4from langchain_text_splitters import RecursiveCharacterTextSplitter
5
6# Some URLs will be blocked by my "Prove You're Human" bot-prevention.
7loader: WebBaseLoader = WebBaseLoader(
8    web_paths=(
9        "https://xaviercollantes.dev/articles/bulldog-band",
10        "https://xaviercollantes.dev/articles/faxion-ai",
11        "https://xaviercollantes.dev/articles/measuring-tokens",
12        "https://xaviercollantes.dev/articles/rpi-camera",
13    ),
14)
15docs: list[Document] = loader.load()
16

snippet hosted withby Xavier

Your output will look like this:

txt

1[Document(metadata={'source': 'https://xaviercollantes.dev/articles/bulldog-band', ...'),
2 Document(metadata={'source': 'https://xaviercollantes.dev/articles/faxion-ai', ...),
3 Document(metadata={'source': 'https://xaviercollantes.dev/articles/measuring-tokens', ...),
4 Document(metadata={'source': 'https://xaviercollantes.dev/articles/rpi-camera', ...)]
5

snippet hosted withby Xavier

Chunking Words (optional)

This step included because about a year ago, LLMs did not have a big enough context window to work with. So if we did this then, we would have to split the documents into smaller chunks.

This might still be needed if your input is too long.

See my other blog on LLM Tokens if your input is too long for a specific LLM: Measuring Tokens

🐍

Python3

1from langchain_text_splitters import RecursiveCharacterTextSplitter
2
3text_splitter: RecursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(
4    chunk_size=200,
5    chunk_overlap=50
6)
7docs: list[Document] = text_splitter.split_documents(docs)
8print(f"Divided the 1 document into {len(docs)} chunks.")
9

snippet hosted withby Xavier

Upload to Vector Store

Now we can add the documents to the vector store.

🐍

Python3

1doc_ids: list[str] = vector_store.add_documents(documents=docs)
2print(f"Document IDs: {len(doc_ids)}: {doc_ids}")
3

snippet hosted withby Xavier

You can test the vector store by searching for a document.

🐍

Python3

1vector_store.similarity_search("What is Bulldog Band?")
2

snippet hosted withby Xavier

Returns documents in order from most relevant to least relevant.

txt

1[Document(id='4d2d84a1-d93b-4342-90d6-812047d56882', metadata={'language': 'en-US', 'source': 'https://xaviercollantes.dev/articles/bulldog-band', 'title': 'Bulldog Band -
2 Document(id='d9bc138c-7330-47aa-8e64-42cdfda26799', metadata={'description': 'Tokens mean $$$ and how to measure them.', 'title': 'Measuring Tokens in LLMs - Xavier Collant
3 Document(id='7f72ab32-c2bb-424e-9aac-0f821ae222aa', metadata={'description': 'Architecting and leading the development of a groundbreaking AI fashion platform that reduce
4 Document(id='69d83587-f15a-4e27-a9fe-88d4ab0ca553', metadata={'title': 'FastAPI: Build your own APIs - Xavier Collantes', 'source': 'https://xaviercol
5

snippet hosted withby Xavier

Pitfall: If your LLM and Embedding Model are not compatible, you will get an error:

txt

1InvalidArgumentError: Collection expecting embedding with dimension of 1024, got 3072
2

snippet hosted withby Xavier

Potential solutions:

Clear out the vector store since once you add documents, you cannot change the embedding model size.
Make sure the Embedding Model is compatible with the LLM

Asking the LLM

Build the Prompt

LangChain has many "prompt management" features such as being able to pull prompts from a hub like Gits with Github (see LangChain Hub). LangChain also has a built-in prompt template for RAG.

🐍

Python3

1from langchain_core.messages import BaseMessage
2from langchain_core.prompts import ChatPromptTemplate
3
4# Create your own RAG prompt template.
5custom_prompt: ChatPromptTemplate = ChatPromptTemplate.from_messages(
6    [
7        (
8            "system",
9            "You are a helpful assistant that can answer questions about Xavier's blogs.\n\nContext:\n{context}",
10        ),
11        (
12            "human",
13            "{question}",  # This is not Python string interpolation.
14        ),
15    ]
16)
17

snippet hosted withby Xavier

Retrieve Context

Now we write some helper functions to retrieve the context and generate an answer.

This will use Pydantic to define data types.

🐍

Python3

1from pydantic import BaseModel, Field
2
3class State(BaseModel):
4    """State for the application."""
5
6    question: str = Field(default="", description="The user's input text.")
7    context: list[Document] = Field(
8        default_factory=list,
9        description="The documents retrieved from the vector store.",
10    )
11    answer: str = Field(default="", description="The LLM's answer to the question.")
12
13
14def retrieve_context(state: State) -> dict:
15    """Retrieves the most relevant documents from the vector store."""
16
17    retrieved_docs: list[Document] = vector_store.similarity_search(state.question)
18    # List of documents which are the most relevant to the question.
19    # "context" is the key for the value being returned and matches the key in
20    # the State object.
21    # print(f"Retrieved {len(retrieved_docs)} documents: {retrieved_docs}")
22    return {"context": retrieved_docs}
23
24
25def generate(state: State, prompt: ChatPromptTemplate, llm: BaseChatModel) -> dict:
26    """Performs the actual query to LLM."""
27
28    docs_content: str = "\n\n".join(doc.page_content for doc in state.context)
29    messages: list = prompt.invoke(
30        {"question": state.question, "context": docs_content}
31    )
32    response = llm.invoke(messages)
33    # "answer" is the key for the value being returned and matches the key in
34    # the State object.
35    # print(f"Generate: {response.content}")
36    return {"answer": response.content}
37

snippet hosted withby Xavier

FINALLY: Asking the LLM

🐍

Python3

1### PLACE YOUR QUESTION HERE ###
2input_chat: str = "Where did Bulldog Band travel to?"
3

snippet hosted withby Xavier

Run the helper functions to retrieve the context and generate an answer.

🐍

Python3

1state: State = State(question=input_chat)
2
3# Get relevant context using helper function.
4context_result: dict = retrieve_context(state)
5state.context = context_result["context"]
6
7# Generate answer using helper function.
8answer_result: dict = generate(state, custom_prompt, llm)
9state.answer = answer_result["answer"]
10
11# LangChain output is in a weird format.
12answer_words: list[str] = state.answer.split(" ")
13output_lines: str = ""
14line_len: int = 10
15curr_words: int = 0
16for word in answer_words:
17    curr_words -= 1
18    output_lines += word + " "
19    if curr_words == 0:
20        output_lines += "\n"
21        curr_words = line_len
22
23# This is the final answer.
24print(output_lines)
25

snippet hosted withby Xavier

Result should look like this:

txt

1The Bulldog Band traveled to a handful of cities across the United States, including:
2*   Las Vegas
3*   San Jose
4*   Chicago
5*   Phoenix
6

snippet hosted withby Xavier

Which is true, by the way. Confirm at Bulldog Band.

LangGraph

🐍

Python3

1from langgraph.graph import START, StateGraph
2
3# Chain everything together
4graph_builder = StateGraph(State).add_sequence([retrieve_context, generate])
5graph_builder.add_edge(START, "retrieve_context")
6graph = graph_builder.compile()
7
8# One-liner execution
9question = "what are the components?"
10result = graph.invoke({"question": question})
11

snippet hosted withby Xavier

LangGraph is pretty slick. It handles the state management and gives you a nice visual representation of what's happening.

What I Learned

Chunk Size Is Everything

Small chunks (200 chars) gave me super precise context but required better ranking. It is like the difference between having a detailed index vs. chapter summaries in a book. Both have their place.

Vector Stores Same But Different

Different vector stores have different features and use cases.

Local stores (Chroma, FAISS): Great for development, terrible for production scale
Cloud stores (Pinecone, Qdrant): Expensive but probably necessary for real apps
In-memory LangGraph option: Perfect for experimenting, useless for persistence

Prompt Engineering Is Still Crucial

How you structure the prompt for RAG makes a huge difference. You need to be explicit about using the retrieved context and handling cases where the context doesn't contain the answer.

Next Steps

Multi-step reasoning - Let the AI ask follow-up questions if needed
Smarter context filtering - Pick the best chunks, not just the first few
Build UI or connect to webapp - For user-facing apps in a chat interface

xavier collantes

RAG with LangChain

What Is RAG?

My Tech Stack Choices

LLM Model

API key

Embedding Model

Vector database

Building the Actual RAG Pipeline

Feeding Data

Chunking Words (optional)

Upload to Vector Store

Asking the LLM

Build the Prompt

Retrieve Context

FINALLY: Asking the LLM

LangGraph

What I Learned

Chunk Size Is Everything

Vector Stores Same But Different

Prompt Engineering Is Still Crucial

Next Steps

Further Reading

Related Articles

Dynamic Routing in LangGraph

Qdrant vs AWS S3 Vector Store

Build Your Own Search Engine: Databases for AI

Related Articles

Dynamic Routing in LangGraph
Build decision-making AI with conditional paths.
By Xavier Collantes11/9/2025
ai
langchain
langgraph
+2
Dynamic Routing in LangGraph
Build decision-making AI with conditional paths.
By Xavier Collantes11/9/2025
ai
langchain
langgraph
+2

Qdrant vs AWS S3 Vector Store
My experience comparing the new AWS S3 Vector Store to Qdrant.
By Xavier Collantes8/15/2025
ai
llm
ml
+9
Qdrant vs AWS S3 Vector Store
My experience comparing the new AWS S3 Vector Store to Qdrant.
By Xavier Collantes8/15/2025
ai
llm
ml
+9

Build Your Own Search Engine: Databases for AI
My experience with vector storage solutions for embeddings and similarity search in AI applications.
By Xavier Collantes8/14/2025
ai
llm
ml
+8
By Xavier Collantes8/14/2025