How to Fine-Tune Your Data and Build Smarter RAG Apps with LangGraph



This content originally appeared on DEV Community and was authored by Youvandra Febrial

🚀 How to Fine‑Tune Your Data and Build Smarter RAG Apps with LangGraph

Ever felt like your Retrieval‑Augmented Generation (RAG) app was talking to a clueless chatbot?

You’re not alone. In this post I’ll walk you through turning raw docs into a laser‑focused knowledge base and wiring it up with LangGraph—the new kid on the block that makes RAG pipelines feel like building LEGO bricks. 🎉

👋 Intro: The “Why” Behind the Fine‑Tuning Fuss

Picture this: you’re building a customer‑support bot for a SaaS startup. You feed it the entire product manual (30 k+ lines) and the model starts spitting out “I don’t know” or, worse, hallucinating answers about pricing that never existed.

The culprit? Noisy, unstructured data + a generic retrieval layer.

Fine‑tuning your data (cleaning, chunking, embedding) is the secret sauce that lets LangGraph’s graph‑based orchestration do its magic without the noise. By the end of this article you’ll have:

  1. A tidy, chunked dataset ready for embeddings.
  2. A simple LangGraph graph that routes queries intelligently.
  3. A handful of pro‑tips to keep your RAG app fast, cheap, and actually helpful.

Grab a coffee, and let’s get our hands dirty. ☕

📚 Step‑by‑Step: From Raw Docs to a Smarter RAG App

1⃣ Gather & Clean Your Source Material

# Example: pull markdown docs from a Git repo
git clone https://github.com/yourorg/product-docs.git
cd product-docs
What to do Why it matters
Strip HTML/Markdown (BeautifulSoup, mistune) Reduces token clutter
Normalize whitespace (replace multiple spaces, line breaks) Improves chunk consistency
Remove boilerplate (headers, footers, navigation) Cuts down irrelevant embeddings
import re
from pathlib import Path

def clean_md(text: str) -> str:
    # Remove markdown headings, code fences, etc.
    text = re.sub(r'```

.*?

```', '', text, flags=re.S)   # code blocks
    text = re.sub(r'#+\s*', '', text)                    # headings
    text = re.sub(r'\s+', ' ', text).strip()            # collapse spaces
    return text

docs = [clean_md(Path(p).read_text()) for p in Path('docs').rglob('*.md')]

2⃣ Chunk Like a Pro

LangGraph works best when each node receives reasonable‑sized chunks (≈200‑400 tokens). Too big → expensive embeddings; too small → loss of context.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,   # keep a bit of context between chunks
    separators=["\n\n", "\n", " "]
)

chunks = []
for doc in docs:
    chunks.extend(splitter.split_text(doc))

print(f"🔢 Got {len(chunks)} chunks")

Pro tip: If your docs contain tables or code snippets, add a custom separator (e.g., "`" ) so those blocks stay together.

3⃣ Embed & Index

LangGraph plays nicely with any vector store (FAISS, Pinecone, Qdrant). Below we’ll use FAISS for simplicity.

`python
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

emb = OpenAIEmbeddings(model=”text-embedding-3-large”) # cheap, strong
vectorstore = FAISS.from_texts(chunks, embedding=emb)

Persist for later runs

vectorstore.save_local(“faiss_index”)
`

4⃣ Build the LangGraph Pipeline

LangGraph lets you define nodes (functions) and edges (routing logic) as a directed graph.

Our simple graph:

`
User Query --> Retriever --> (optional) Reranker --> LLM --> Response
`

`python
from langgraph.graph import Graph
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

1⃣ Retriever node

def retrieve(state):
query = state[“query”]
docs = vectorstore.similarity_search(query, k=5)
return {“retrieved_docs”: docs}

2⃣ (Optional) Reranker node – we’ll use a tiny cross‑encoder

def rerank(state):
from sentence_transformers import CrossEncoder
cross = CrossEncoder(“cross-encoder/ms-marco-MiniLM-L-12-v2”)
query = state[“query”]
docs = state[“retrieved_docs”]
scores = cross.predict([(query, d.page_content) for d in docs])
top_docs = [doc for _, doc in sorted(zip(scores, docs), reverse=True)[:3]]
return {“reranked_docs”: top_docs}

3⃣ LLM node

def answer(state):
llm = OpenAI(model=”gpt-4o-mini”)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”, # simple concat‑then‑answer
retriever=None, # we already have docs
)
answer = qa.run(state[“query”], documents=state[“reranked_docs”])
return {“answer”: answer}

Wire it together

graph = Graph()
graph.add_node(“retrieve”, retrieve)
graph.add_node(“rerank”, rerank)
graph.add_node(“answer”, answer)

graph.add_edge(“retrieve”, “rerank”)
graph.add_edge(“rerank”, “answer”)
graph.set_entry_point(“retrieve”)
`

Now you can call the graph like a function:

`python
def ask(query: str):
result = graph.run({“query”: query})
return result[“answer”]

print(ask(“How do I reset my API key?”))
`

5⃣ Test, Iterate, & Deploy

Stage What to watch Quick fix
Latency Retrieval + rerank time > 300 ms? Reduce k or switch to a faster cross‑encoder
Hallucinations LLM answers with unsupported features? Add a guardrail node that checks for “unknown” patterns
Cost Embedding calls blowing the budget? Switch to text-embedding-3-small for non‑critical docs

🛠 Pro Tips & Tricks

  • Hybrid Retrieval: Combine sparse (BM25) and dense (embeddings) scores for better coverage.
  • Metadata‑driven Routing: Tag chunks with section or product_version and let LangGraph branch based on the query’s intent.
  • Batch Embedding: When you have >10k docs, embed in batches of 500 to stay under OpenAI rate limits.
  • Cache Answers: Store (query, answer) pairs in Redis; many support tickets are repeats!
  • Observability: Use LangGraph’s built‑in logging to trace which node took how long—perfect for debugging bottlenecks.

🎯 Conclusion & Call‑to‑Action

You’ve just turned a chaotic pile of docs into a smart, graph‑orchestrated RAG app that:

  1. Cleans & chunks data for optimal embeddings.
  2. Indexes with a cheap vector store.
  3. Routes queries through a LangGraph pipeline (retriever → reranker → LLM).

The real magic isn’t the code itself—it’s the habit of fine‑tuning your data before you hand it to the model. When the foundation is solid, LangGraph does the heavy lifting, and you get a bot that actually helps.

🚀 Your turn:

  • Fork this repo, swap in your own docs, and watch the difference.
  • Got a cool LangGraph pattern (e.g., multi‑LLM voting)? Drop a comment below!

Let’s keep the conversation going—happy hacking! 🎉

📚 References (optional)


This content originally appeared on DEV Community and was authored by Youvandra Febrial