RAG Made Simple: Technical Design and Architecture of Simplicity (Part 2) – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Sri Hari Karthick

Stunning image by Luca Bravo, courtesy of Unsplash

Introduction

Welcome to Part 2 of this multi-part series. If you haven’t read Part 1 yet, I highly recommend starting there, it sets the stage for everything we’re about to dive into.

But hey, if you’d rather live dangerously and skip ahead, I won’t stop you. Be bold. Be different.

The Minutiae

This section dives into the nitty-gritty, as much technical detail as I can reasonably pack in. Hopefully, it’s still accessible enough to keep things interesting without flying over your head.

Project Organisation

Let’s start with the structure. I tried my best to ensure high cohesion and low coupling, a single method or class does related and relevant things. This is to maintain a good separation of concerns (I’m still learning too, so please don’t judge if you’re an expert), which leads to easier extension, modification, and even testing. The following are the main folders of interest, in alphabetical order:

The dto/ folder consists of the data transfer objects for requests, responses, and return signatures of methods. This ensures that responses are properly structured and also allows for better autocompletion support.
The models/ folder contains the system models defined for the Retrieval system (following our hybrid strategy) and the Generator system (a variety of LMs pulled from Hugging Face).
The prompt_templates/ folder holds different prompt text files, configured in the config.yml, for the system to read. This allows for quick comparisons between various prompts to evaluate performance.
The rag/ folder contains the actual engine that orchestrates the end-to-end process, from receiving a query via the server to dispatching it to the retrieval and generator handlers, which in turn use the models to get the job done.
The scripts/ folder includes scripts to start the server — handy if the application needs to be containerised.
The ui/ folder contains the HTML, CSS, and JS files required to display the UI and let users interact with the system.
The utils/ folder includes utility functions for a variety of tasks — loading config, logging, and even consuming the public ArXiv API.
app.py defines the actual FastAPI server and is the main entry point where the application lifecycle starts. The config.yml includes various configuration options (each one commented for clarity) that control how the system behaves.

The Retriever

As mentioned earlier, a hybrid retrieval method is used here. Instead of relying solely on a local vector store or only on a remote API, the system attempts a hybrid retrieval. Based on the configuration mentioned in the YAML file, it first tries to fetch documents from the local vector store. If the number of results is insufficient, it fetches the remaining from the ArXiv API. The abstracts serve as the primary “documents” since we care most about their content, while the remaining metadata is also preserved.

The abstract usually captures most of the paper’s overview and is concise enough that indexing remains fast. The system also provides direct links to the full papers for those interested in reading more.

There’s an optional config setting to enable caching. When enabled, documents returned by the ArXiv API can be stored in the vector store for future retrieval. This caching happens in a background thread, so the language model can begin processing the current response immediately, avoiding unnecessary latency for the user.

The Generator

The Generator can use any of several small-sized LLMs (Mistral-7B, Phi-2, GPT-2), to ingest the documents retrieved and generate a (hopefully) cited summary using the specified prompt template. This is built end-to-end using Hugging Face libraries, which pull in the appropriate model and tokenizer.

This also opens up the possibility of experimenting with fine-tuning for better results (similar to our summarisation fine-tuning using LoRA here), but that’s well beyond the current scope of this portfolio project.

Mock mode

Running the full system can (and probably does) exceed the resources of a typical computer, so I’ve added a mock mode for both the retriever and the generator. As you might expect, this mode returns mock responses for either component, or both, making it a great way to quickly try out the system locally. It also made bug-squashing much faster during development!

Pit Stop

We’ll take another short pause here before diving into the next part. Up next: running the system live in a cloud environment, along with a look back at what worked, what didn’t, and the lessons learned along the way. Hope you’re still with me!

This content originally appeared on DEV Community and was authored by Sri Hari Karthick