RAG Made Simple: Demonstration and Analysis of Simplicity (Part 3)

June 22, 2025

This content originally appeared on DEV Community and was authored by Sri Hari Karthick

Stunning image by Ales Nostril, courtesy of Unsplash

Live Demonstration

This demonstration was run on a cloud instance with an RTX A5000 GPU using Microsoft’s Phi model as the generator. If you plan to use Mistral as the language model, note that it requires a Hugging Face API key since it is not publicly accessible. Phi and GPT models can be used without a key by configuring them in config.yml.

The first run will take longer, as the model weights and the embedding function for ChromaDB are downloaded initially. Below is a video of the system in action. (The response time is noticeably slow due to GPU limitations.)

Video

What Went Well..

Surprisingly good output from Phi: The Phi model generated coherent summaries with proper inline citations for the given query, following multiple rounds of prompt tuning. There was no fine-tuning done.
Vector store growth: Documents retrieved from the API were successfully vectorised and stored in ChromaDB for future use.
Responsive UI: The interface remained snappy and interactive throughout the demonstration.

…And What Didn’t

Slow inference: The time to generate a summary was painfully long. This could be mitigated with more powerful or distributed hardware, but that’s beyond the scope of this portfolio project.
Inconsistent summarisation: Citation formatting varied. While some responses were crisp and relevant, others veered off and generated excess tokens — a clear sign that better results would require fine-tuning on a domain-specific dataset.
Naive document ranking: A simple distance threshold was used to filter local results before falling back to the ArXiv API. While functional, more advanced re-ranking techniques could improve precision.

Conclusion

Despite its limitations, the system delivered a complete, working end-to-end RAG pipeline running in the cloud. Seeing it in action underscored both the promise and the computational demands of even small-scale LLMs. Document retrieval typically took ~11 seconds, while summarisation took ~60 seconds, both of which could be improved with stronger infrastructure. Still, as a self-contained, vendor-free solution built from the ground up, this was a very satisfying personal milestone.

I hope this was a productive and insightful read. If you have any feedback, ideas for improvement, or suggestions for pushing this further, I’d love to hear from you, feel free to reach out or drop a comment!

This content originally appeared on DEV Community and was authored by Sri Hari Karthick

ai nlp python rag