This content originally appeared on DEV Community and was authored by Youvandra Febrial
How to Fine‑Tune Your Data and Build Smarter RAG Apps with LangGraph
Ever felt like your Retrieval‑Augmented Generation (RAG) app was talking to a clueless chatbot?
You’re not alone. In this post I’ll walk you through turning raw docs into a laser‑focused knowledge base and wiring it up with LangGraph—the new kid on the block that makes RAG pipelines feel like building LEGO bricks.
Intro: The “Why” Behind the Fine‑Tuning Fuss
Picture this: you’re building a customer‑support bot for a SaaS startup. You feed it the entire product manual (30 k+ lines) and the model starts spitting out “I don’t know” or, worse, hallucinating answers about pricing that never existed.
The culprit? Noisy, unstructured data + a generic retrieval layer.
Fine‑tuning your data (cleaning, chunking, embedding) is the secret sauce that lets LangGraph’s graph‑based orchestration do its magic without the noise. By the end of this article you’ll have:
- A tidy, chunked dataset ready for embeddings.
- A simple LangGraph graph that routes queries intelligently.
- A handful of pro‑tips to keep your RAG app fast, cheap, and actually helpful.
Grab a coffee, and let’s get our hands dirty.
Step‑by‑Step: From Raw Docs to a Smarter RAG App
1⃣ Gather & Clean Your Source Material
# Example: pull markdown docs from a Git repo
git clone https://github.com/yourorg/product-docs.git
cd product-docs
What to do | Why it matters |
---|---|
Strip HTML/Markdown (BeautifulSoup , mistune ) |
Reduces token clutter |
Normalize whitespace (replace multiple spaces, line breaks) | Improves chunk consistency |
Remove boilerplate (headers, footers, navigation) | Cuts down irrelevant embeddings |
import re
from pathlib import Path
def clean_md(text: str) -> str:
# Remove markdown headings, code fences, etc.
text = re.sub(r'```
.*?
```', '', text, flags=re.S) # code blocks
text = re.sub(r'#+\s*', '', text) # headings
text = re.sub(r'\s+', ' ', text).strip() # collapse spaces
return text
docs = [clean_md(Path(p).read_text()) for p in Path('docs').rglob('*.md')]
2⃣ Chunk Like a Pro
LangGraph works best when each node receives reasonable‑sized chunks (≈200‑400 tokens). Too big → expensive embeddings; too small → loss of context.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50, # keep a bit of context between chunks
separators=["\n\n", "\n", " "]
)
chunks = []
for doc in docs:
chunks.extend(splitter.split_text(doc))
print(f"🔢 Got {len(chunks)} chunks")
Pro tip: If your docs contain tables or code snippets, add a custom separator (e.g.,
"
`"
) so those blocks stay together.
3⃣ Embed & Index
LangGraph plays nicely with any vector store (FAISS, Pinecone, Qdrant). Below we’ll use FAISS for simplicity.
`python
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
emb = OpenAIEmbeddings(model=”text-embedding-3-large”) # cheap, strong
vectorstore = FAISS.from_texts(chunks, embedding=emb)
Persist for later runs
vectorstore.save_local(“faiss_index”)
`
4⃣ Build the LangGraph Pipeline
LangGraph lets you define nodes (functions) and edges (routing logic) as a directed graph.
Our simple graph:
`
User Query --> Retriever --> (optional) Reranker --> LLM --> Response
`
`python
from langgraph.graph import Graph
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
1⃣ Retriever node
def retrieve(state):
query = state[“query”]
docs = vectorstore.similarity_search(query, k=5)
return {“retrieved_docs”: docs}
2⃣ (Optional) Reranker node – we’ll use a tiny cross‑encoder
def rerank(state):
from sentence_transformers import CrossEncoder
cross = CrossEncoder(“cross-encoder/ms-marco-MiniLM-L-12-v2”)
query = state[“query”]
docs = state[“retrieved_docs”]
scores = cross.predict([(query, d.page_content) for d in docs])
top_docs = [doc for _, doc in sorted(zip(scores, docs), reverse=True)[:3]]
return {“reranked_docs”: top_docs}
3⃣ LLM node
def answer(state):
llm = OpenAI(model=”gpt-4o-mini”)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”, # simple concat‑then‑answer
retriever=None, # we already have docs
)
answer = qa.run(state[“query”], documents=state[“reranked_docs”])
return {“answer”: answer}
Wire it together
graph = Graph()
graph.add_node(“retrieve”, retrieve)
graph.add_node(“rerank”, rerank)
graph.add_node(“answer”, answer)
graph.add_edge(“retrieve”, “rerank”)
graph.add_edge(“rerank”, “answer”)
graph.set_entry_point(“retrieve”)
`
Now you can call the graph like a function:
`python
def ask(query: str):
result = graph.run({“query”: query})
return result[“answer”]
print(ask(“How do I reset my API key?”))
`
5⃣ Test, Iterate, & Deploy
Stage | What to watch | Quick fix |
---|---|---|
Latency | Retrieval + rerank time > 300 ms? | Reduce k or switch to a faster cross‑encoder |
Hallucinations | LLM answers with unsupported features? | Add a guardrail node that checks for “unknown” patterns |
Cost | Embedding calls blowing the budget? | Switch to text-embedding-3-small for non‑critical docs |
Pro Tips & Tricks
- Hybrid Retrieval: Combine sparse (BM25) and dense (embeddings) scores for better coverage.
-
Metadata‑driven Routing: Tag chunks with
section
orproduct_version
and let LangGraph branch based on the query’s intent. - Batch Embedding: When you have >10k docs, embed in batches of 500 to stay under OpenAI rate limits.
-
Cache Answers: Store
(query, answer)
pairs in Redis; many support tickets are repeats! - Observability: Use LangGraph’s built‑in logging to trace which node took how long—perfect for debugging bottlenecks.
Conclusion & Call‑to‑Action
You’ve just turned a chaotic pile of docs into a smart, graph‑orchestrated RAG app that:
- Cleans & chunks data for optimal embeddings.
- Indexes with a cheap vector store.
- Routes queries through a LangGraph pipeline (retriever → reranker → LLM).
The real magic isn’t the code itself—it’s the habit of fine‑tuning your data before you hand it to the model. When the foundation is solid, LangGraph does the heavy lifting, and you get a bot that actually helps.
Your turn:
- Fork this repo, swap in your own docs, and watch the difference.
- Got a cool LangGraph pattern (e.g., multi‑LLM voting)? Drop a comment below!
Let’s keep the conversation going—happy hacking!
References (optional)
- LangGraph docs: https://langchain.com/langgraph
- OpenAI Embeddings guide: https://platform.openai.com/docs/guides/embeddings
- Sentence‑Transformers cross‑encoders: https://www.sbert.net/docs/pretrained_models.html
This content originally appeared on DEV Community and was authored by Youvandra Febrial