This content originally appeared on DEV Community and was authored by Andrew Smith
Introduction
This post walks through how I built a Retrieval Augmented Generation (RAG) chatbot using TypeScript and Next.js, along with a companion Node.js tool to fetch and index documents. The chatbot UI is based on Vercel’s Chat SDK, which is built on their excellent AI SDK.
Retrieval Augmented Generation (RAG) is a technique for providing additional context to an LLM to improve the accuracy of its response. This may be needed if the information is held in a private knowledge base for example, or if you want the LLM to always use the most up-to-date documentation.
By the end of the post, you’ll know how to create a RAG chatbot that uses your own knowledge base to answer questions, and links to the source documents.
All the code is available in this GitHub repo: https://github.com/emertechie/rag-ai-chatbot. The cal.dom developer documentation is used as an example knowledge base in this post and the code.
Who is this post for?
This post is aimed at JavaScript developers who want to explore how to build a RAG chatbot, but are not familiar with Python or Python-based libraries like LangChain. You may have heard of RAG terms like embeddings and vector databases, but are not sure how they fit together.
Setup
If you want to get the project running locally, follow the steps below:
- Fork the rag-ai-chatbot repository on GitHub
- Clone your forked repository locally
- Create a PostgreSQL database and retrieve its connection string to set the
DATABASE_URL
environment variable in.env.local
- Create a Redis instance and set its connection string in the
REDIS_URL
environment variable in.env.local
- Sign up for an OpenAI platform account (different to a ChatGPT account), add some credit—you’ll likely need much less than $5, create an API key, and then set the
OPENAI_API_KEY
environment variable in.env.local
. - Install dependencies using
pnpm install
- Start the development server using
pnpm dev
These steps are based on the Chat SDK local setup guide, but with OpenAI used instead of Grok as the AI provider. I also didn’t configure a blob store, as I won’t be using file storage in this example.
RAG overview
Before describing the system components, it’s important to have a basic understanding of RAG systems. Feel free to skip this section if you’re already familiar.
RAG systems provide additional context to an LLM so it can accurately answer a user’s query. A RAG system is based around embeddings, also known as vectors. An embedding is a numerical representation of text or images that captures its semantic meaning. From the AI SDK documentation:
Embeddings are a way to represent words, phrases, or images as vectors in a high-dimensional space. In this space, similar words are close to each other, and the distance between words can be used to measure their similarity.
An ingestion process fetches source documents, breaks them up into smaller chunks, and creates an embedding (vector) for each. Embeddings are stored in a vector database along with the original chunk text, and details of the source document.
A vector database efficiently stores and indexes vectors (surprise) and provides a way to query the distance between vectors in the high-dimensional space, which is the key to finding closely related information.
When a user queries the chatbot, their query is turned into an embedding and used to find the closest vectors in the database. The chunks associated with those vectors contain information highly related to the query and are passed to the chatbot LLM so it can generate its response.
System components
This system integrates the following components to build a RAG chatbot:
- AI library: AI SDK
- AI provider: OpenAI
- Vector database: Postgres pgvector extension
- Document indexer: Custom implementation
- Chatbot UI: Chat SDK
- Document source UI component: Custom implementation
- Chatbot state databases: Postgres and Redis
The following sections describe some of these in more detail.
AI library
Using an AI library lets you work with a unified API over the different LLM providers, models, and capabilities available in AI systems today.
I initially considered the Python-based LangChain library, given that it was one of the original LLM frameworks and is the foundation of many AI systems. But I wanted something in the JavaScript space that I could get up and running with more quickly.
Fortunately, there’s now an excellent TypeScript option in the AI SDK from Vercel. The SDK has two main libraries:
- AI SDK Core: “A unified API for generating text, structured objects, tool calls, and building agents with LLMs.”
- AI SDK UI: “A set of framework-agnostic hooks for quickly building chat and generative user interface.”
The SDK also has excellent documentation. Something that can be lacking in other libraries.
Document indexer
To get my source documents into a vector database format, I needed a document indexer process to handle:
- Document ingestion: loading documents from various sources
- Chunking: breaking a document into smaller pieces called chunks
- Embedding: creating an embedding for each chunk, representing its semantic meaning
- Indexing: storing the embeddings and chunks in a vector database to enable querying
I built my indexer in Node.js, and the following sections describe the different steps in the process in more detail. The indexer source code is available in the /indexer
folder of the repo.
Document ingestion
I wanted to index markdown documents from my local machine, so I could quickly experiment with different chunking and indexing options. I also wanted to support other document sources in future, so the indexer has a DataSource
base class with an overridable discoverDocuments
function:
/**
* Discover indexable documents from this data source
* @param options Configuration options for discovery
* @returns AsyncGenerator that yields indexable documents one by one
*/
abstract discoverDocuments(options: DataSourceOptions): AsyncGenerator<IndexableDocument, void, unknown>;
Note that this uses an async generator to return documents one by one as they are loaded, avoiding the need to load all documents into memory before processing can start.
The indexer comes with a couple of DataSource
implementations, described below.
FileSystemDataSource
The FileSystemDataSource
type was the first implementation and allowed me to use local markdown files to quickly get an end-to-end test working.
However, I wanted to link to source documents from the chatbot UI—to show where the LLM got its information—so local file paths were not much use to anyone else! I needed a data source to discover and fetch documents from a URL, described next.
URLDataSource
I was considering building a web scraper, but then I thought of the new llms.txt
proposal, which attempts to standardize how sites present information for LLMs. In the proposal, now adopted by many documentation sites such as Stripe and Cloudflare, the llms.txt
file contains a list of documentation links, but each has a .md
extension to return the content as Markdown.
So using llms.txt
effectively gives you a sitemap of a site’s content, but all links point to Markdown files, which are much easier for an LLM to consume than HTML. And by simply removing the .md
extension at the end of each link, you have the link to the human-readable version, which you can use in the chatbot UI to link to document sources.
This was a perfect solution for my purposes since the cal.com docs I was using to test against also supported llms.txt
.
Chunking
The next step in the indexing process is chunking, which breaks up larger documents into smaller chunks. An embedding is created for each chunk to capture its semantic meaning.
The size and content of each chunk affect how well its embedding matches with a query. If a chunk is too big, its encoding may represent too many details, and won’t be “close” in the high-dimensional space to a user’s query. Alternatively, too small and it may not capture enough context to make a good match possible.
Unfortunately, there isn’t a one-size-fits-all solution to chunk sizing. It’s highly dependent on your content and its structure. For example, a novel with long narrative passages may perform well with larger chunks, but using larger chunks with a dense, technical Markdown document may include too much distinct information in each chunk.
Text splitters
There are different strategies to split up text into chunks, including by length, by sentence or paragraph breaks, and by using additional knowledge of the document structure such as Markdown headers.
There are tools to help you visualize how the different strategies and parameter values affect chunks, such as this one: https://huggingface.co/spaces/m-ric/chunk_visualizer.
In this project, I was indexing Markdown documents so I wanted a Markdown-specific splitter. Unfortunately, Vercel’s AI SDK doesn’t include any text splitters, but I was able to use the MarkdownTextSplitter
from LangChain’s TypeScript library and fall back to the RecursiveCharacterTextSplitter
for any other file type.
import { MarkdownTextSplitter, RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
...
const textSplitter = isMarkdownDoc
? new MarkdownTextSplitter({ chunkSize, chunkOverlap })
: new RecursiveCharacterTextSplitter({ chunkSize, chunkOverlap, separators });
const chunks = await textSplitter.createDocuments([document.content]);
Embedding
Once you have your documents split into chunks, you can create embeddings for each using a text embedding model. For this project I used OpenAI’s text-embedding-3-small
model, and used the AI SDK embedMany
function to process multiple chunks at a time:
const { embeddings } = await embedMany({
model: myProvider.textEmbeddingModel('embedding-model'),
values: chunks.map(chunk => chunk.content),
});
}
The embedding model is also used to create the embedding of a user’s query, so the same model must be used for comparisons to make sense.
Note: the embedding model is not the same as the LLM model used by the chatbot. A text embedding model specializes in converting text into dense vectors that capture semantic meaning. Its focus is on retrieval tasks, not generative.
Indexing
The last step in the indexer process is to store each embedding in the vector database, along with the chunk of text it represents, and a link to the source document. If a query embedding matches an embedding in the database, the corresponding original chunk of text can be passed to the LLM as context for its response, and a link to the source document shown in the chatbot UI.
Special purpose vector databases such as Pinecone can be used if performance and scalability are a concern. Since I just wanted to run this project locally though, and already had a Postgres database for general application state, I chose to use the pgvector
Postgres extension.
Vercel’s Chat SDK repo already uses the Drizzle ORM for database query and migration support. I added TypeScript schema definitions for the Resource
and ResourceChunk
tables (Document
was already in use), and used pnpm db:generate
to create the migrations shown below:
CREATE TABLE IF NOT EXISTS "Resource" (
"id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
"source_type" varchar(50) NOT NULL,
"source_uri" text NOT NULL,
"content_hash" text NOT NULL,
"createdAt" timestamp DEFAULT now() NOT NULL,
"updatedAt" timestamp DEFAULT now() NOT NULL,
CONSTRAINT "Resource_source_uri_unique" UNIQUE("source_uri")
);
CREATE TABLE IF NOT EXISTS "ResourceChunk" (
"id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
"resource_id" uuid NOT NULL,
"content" text NOT NULL,
"embedding" vector(1536)
);
ALTER TABLE "ResourceChunk" ADD CONSTRAINT "ResourceChunk_resource_id_Resource_id_fk" FOREIGN KEY ("resource_id") REFERENCES "public"."Resource"("id") ON DELETE cascade ON UPDATE no action;
CREATE INDEX IF NOT EXISTS "embedding_index" ON "ResourceChunk" USING hnsw ("embedding" vector_cosine_ops);
This migration creates a straightforward parent-child relationship between Resource
and ResourceChunk
, with a couple of notable columns:
-
Resource.content_hash
: a hash of the entire source document, used to skip processing unchanged ones -
ResourceChunk.embedding
: the embedding for the text chunk stored in thecontent
column.
A special hnsw
type index is automatically created by Drizzle to support efficient querying on the embedding
vector column.
Note: different embedding models use different numbers of dimensions, so ensure the embedding
column size matches your selected model. The AI SDK documentation has a list of supported embedding models and their dimension sizes.
In the TypeScript database code, since an embedding is just an array of numbers, you don’t need special handling when inserting one to the database:
const chunkValues = chunksWithEmbeddings.map(([chunk, embedding]) => ({
resourceId,
content: chunk.content,
embedding, // Type: number[]
}));
return await db.insert(resourceChunk).values(chunkValues).returning();
Document deletion
The document indexer inserts or updates new content each time it’s run. But what about previously indexed documents that no longer exist now? You don’t want your LLM providing old information, so the indexer also needs to remove deleted documents from the database.
By tracking the set of discovered documents on each run, and comparing it with the list of documents currently stored in the database, the indexer can determine which documents used to exist but no longer do, and delete them.
Deletion tracking is implemented in the indexer/index.ts
script, where the core logic looks as follows:
const discoveredUris = new Set<string>();
for await (const document of dataSource.discoverDocuments({})) {
discoveredUris.add(document.sourceUri);
await processDocument(document);
}
await handleDocumentDeletion(dataSource.getSourceType(), discoveredUris);
Chatbot integration
Search tool
The LLM uses a tool interface to query the information in the vector database. The AI SDK has a really nice interface for defining tools, including a type-safe parameters
collection. The repo includes example tools from the Chat SDK in the lib/ai/tools
folder, including one to get the weather for example.
I added a searchKnowledge
tool with the following definition:
export const searchKnowledge = tool({
description: 'Search for relevant information about cal.com developer documentation in the knowledge base using a natural language query',
parameters: z.object({
query: z.string().describe('The search query to find relevant information'),
}),
execute: async ({ query }) => {
...
}
})
The LLM uses a tool’s description to know when to call it, and uses each parameter description to construct the parameters passed to the execute
function.
For my search tool, the execute
function receives the user’s query and calls the AI SDK’s embed
function to create an embedding for it:
const { embedding } = await embed({
model: myProvider.textEmbeddingModel("embedding-model"),
value: query, // `query` passed to the `execute` function
});
This embedding
can then be compared with the embeddings in the database to find related chunks of text. The most common way to compare embeddings is to use cosine similarity, which measures the cosine of the angle between two vectors. This gives a floating point value (usually between 0 and 1) indicating how similar their direction is.
Embedding comparison is implemented in the searchSimilarChunks
function, which uses cosine similarity to sort and filter the results:
export async function searchSimilarChunks({
embedding,
limit = 5,
threshold = 0.6,
}: {
embedding: number[];
limit?: number;
threshold?: number;
}) {
try {
const similarity = sql<number>`1 - (${cosineDistance(resourceChunk.embedding, embedding)})`;
const results = await db
.select({
chunkId: resourceChunk.id,
chunkContent: resourceChunk.content,
resourceId: resource.id,
resourceType: resource.sourceType,
resourceUri: resource.sourceUri,
resourceCreatedAt: resource.createdAt,
resourceUpdatedAt: resource.updatedAt,
similarity,
})
.from(resourceChunk)
.innerJoin(resource, eq(resourceChunk.resourceId, resource.id))
.where(gt(similarity, threshold)) -- filter by cosine distance
.orderBy((t) => desc(t.similarity)) -- sort by cosine distance
.limit(limit); -- take top N results
return results;
} ...
}
You may need to tweak the threshold
parameter value to ensure relevant chunks are returned from the database. I had to change it from 0.75
to 0.6
after changing text embedding model for example.
System prompt
The system prompt is how you control the chatbot LLM’s overall behaviour. For example, asking it to assume a particular role or giving it constraints about what it can and cannot answer.
Here’s the original system prompt included with the Chat SDK:
You are a friendly assistant! Keep your responses concise and helpful
I wanted to test how accurately this assistant responded to queries. Since I was using the public cal.com documentation to test with though, there was a chance the LLM’s training data already included a previously copy of that site. So I needed to ensure my chatbot was actually using my indexed documents, and not its own (possibly stale) knowledge.
I started with a blank vector database and used the FileSystemDataSource
introduced earlier to index a local copy of the cal.com documentation, which included a freshly merged PR with some updated documentation that I knew wouldn’t be in the LLM’s training data.
I then asked the chatbot about that new information, and my friendly assistant helpfully hallucinated a response that sounded very plausible, but unfortunately wasn’t accurate. So I needed to tweak the system prompt.
After a bit of trial and error, I ended up with this system prompt:
You are a friendly assistant that can help with developer questions about using cal.com.
You can ONLY answer using knowledge you get from the tools you have access to.
DO NOT RELY ON YOUR OWN KNOWLEDGE TO ANSWER THE QUESTION.
If you cannot answer the question, say "I'm sorry, I don't know the answer to that". No exceptions.
You have to explicitly tell the ever-eager LLM to say “I don’t know” if it can’t answer your question. I also had to really emphasise that it shouldn’t use its own knowledge—after earlier, more gentle prompts, still led to hallucinations.
After tweaking the prompt and seeing the chatbot correctly use the information from my database, I temporarily disconnected my searchKnowledge
tool from the LLM to confirm the chatbot would now not be able to answer my original query. After starting a new chat (so previous messages with the correct answer weren’t provided as context to the LLM), I re-ran my original query and the LLM correctly told me it didn’t know the answer. Much better!
As this shows, it’s important to test not only positive cases but negative ones also, where the LLM shouldn’t answer.
While testing, you’ll probably come up with several different queries that exercise different response behavior, so it’s also important to ensure these aren’t forgotten about—if someone else adjusts the system prompt in future for example. To prevent these kind of regressions, you can use evals. These are automated tests which get the LLM to answer a set of pre-defined questions, and often use another LLM to judge if the response was correct (since the responses are non-deterministic).
One final adjustment I made was to ensure that, if the chatbot response links to a Markdown document, it strips off the trailing .md
extension from the URL so it points to the HTML version:
If linking to a document returned from the \`searchKnowledge\` tool, remove any '.md' extension from the link (That will link to a human-readable version).
Showing document sources
Similar to how online LLM tools like Perplexity link to their sources, I wanted to show which documents were used to answer the user’s query. To give more user’s more confidence in the response accuracy.
The AI SDK’s useChat
hook returns all the chat messages so far, including results from tool calls. So by filtering for my searchKnowledge
tool, I can pass the result from the searchSimilarChunks
query function, discussed above, to my SearchKnowledge
React component. The result includes any chunks relevant to the query, and the document they belong to.
The component shows a simple “Sources” header and link. The UI and UX can certainly do with some improvement, but that wasn’t my focus for this project.
Putting it all together
With all those components in place, I can index the cal.dom docs using the following command line:
npx tsx --env-file=.env.local indexer/index.ts --url=https://cal.com/docs/llms.txt
And run the chatbot with:
pmpm dev
For additional indexer flags, see the README.
Summary
In this post I showed how to use the AI and Chat SDKs from Vercel, along with a custom document indexer, to roll your own RAG chatbot. A reminder that all the source is available here: https://github.com/emertechie/rag-ai-chatbot.
I encourage you to play around with the system and try to apply it to a real use case. Having a practical application is a great way to really learn something in depth.
Also check out the docs for the AI SDK and Chat SDK, which are great resources for further learning.
This content originally appeared on DEV Community and was authored by Andrew Smith