This content originally appeared on DEV Community and was authored by AngelloFD

I’ve always wondered how SmallPDF processed the files and stored them so the AI knew what I was asking or talking about. How does DeepSeek know about the files I send it so it can understand what’s in them and reply accordingly? How do implementations like chatbots search through millions of documents and find a possible answer for the user? Well, I wouldn’t have to die without knowing the answer to these questions since I had the ~~miss~~fortune of developing and implementing it. It didn’t come without its challenges and problems, however, and since I’ve never done this before, those problems would feel 10x worse. That’s why I am here to contribute to the already extensive list of RAG implementation experiences, bringing my perspective and hoping it can differentiate itself from the others in some way.

The Technical Landscape: Project Kick-off and Its Quirk

When I dove into this RAG implementation, the technical stack was immediately apparent: a Laravel 9 (PHP) backend, running on a multi-tenancy architecture, ironically, for a single-tenant application. Its dependency installation alone was a red flag, routinely eating up 20 to 30 minutes while on the frontend, React handled the UI, though my focus remained squarely on the backend RAG integration.

The real “quirk”, however, lay in the source data. The vast collection of PDFs wasn’t just files on a disk; their metadata and relationships were mapped in a deeply nested, hierarchical database structure. Think rows upon rows of parent-child relationships, where finding a specific document’s lineage felt like a recursive nightmare. This convoluted digital “folder” system was easily the most cumbersome aspect to untangle before we could even begin processing documents for RAG.

Nevertheless, we eventually processed and understood this madness. With a clearer picture, we laid out our initial technology choices – Supabase and OpenAI – and mapped the key process sequences for both document conversion to embeddings and the subsequent chatbot flow.

Process for the document conversion to embeddings and its upload to Supabase

Process for the chatbot flow

The Supabase Siren Song: Our First Foray into Embeddings

Did you know that Supabase can store vectors? Well it can! And it provides you with neat documentation for the implementation of semantic search so we saw an opportunity in this service.

After discovering there was no official PHP SDK for Supabase – meaning we’d have to raw-dog HTTP requests directly into our documents table – we pressed on. We started populating it, chunk by chunk, with our document vectors, eager to test its retrieval capabilities. Our initial tests with a subset of our data were a success! The responses seemed perfectly legible and contextually relevant based on the information provided. We were ready to declare victory and-

“Yeah we want it to be able to ask every single PDF in the entire knowledge base.” – The Client

…Right. So, I guess that meant differentiating between a general chatroom and a PDF-specific one, then adapting our semantic search to query the entire dataset. But as we attempted to scale, Supabase began throwing cryptic errors. The system choked. To be fair, with over 109,000 document chunks (vectors), perhaps we should have anticipated the inherent scalability limitations of our chosen approach within Supabase. The “siren song” had led us to a performance cliff.

The Scramble: A Desperate Pivot

Hitting a brick wall with Supabase’s scalability and blindsided by the new, expanded requirements, we were effectively set back to square one. Facing imminent deadlines and having already requested multiple extensions, a desperate, seemingly ingenious idea emerged: deploying open-source embedding models on a rented, powerful GPU server. The logic, in our panic, was simple: greater control, potentially lower long-term cost, and the raw compute power to process our ever-growing dataset.

(Narrator: It wasn’t ingenious. It was a mess.)

This meant not only the ongoing expense of a dedicated GPU instance but a complete shift in our data processing pipeline. Our first hurdle was engineering a robust Python script capable of initializing the models and rapidly batch-embedding our millions of PDF chunks. We wrestled with parallelism strategies and relentlessly pushed to utilize every ounce of GPU resource. The process itself, surprisingly, worked: embeddings were generated, and Qdrant eagerly ingested them, making the entire knowledge base queryable.

However, the real showstopper wasn’t performance, but relevance. Despite successfully retrieving information, the answers generated by our RAG system were consistently too general for the client’s needs. The quality of the open-source embeddings, or perhaps their interaction with the chosen open-source LLM, simply wasn’t capturing the nuanced, specific context required for technical manuals.

Even after striking a deal to narrow the scope – focusing user queries on a selected subset of PDFs rather than the entire knowledge base – the core problem persisted. We simply couldn’t get the system to yield the precise, accurate answers the client demanded. The high-cost, high-effort open-source pivot was failing on its most critical metric: output quality.

The Eleventh Hour: A Forced Retreat and a Breakthrough

The hammer finally dropped. The client issued a non-negotiable final deadline before… well, before something disastrous would happen. We sat there, staring at my Bruno client, watching the still-too-general responses trickle in, each passing minute of the rented GPU server devouring our hopes, dreams, and budget. The problem wasn’t merely in our RAG pipeline’s execution; it was a fundamental misunderstanding of the retrieval scope. We had been trying to make an omniscient RAG, capable of answering from millions of documents at once, when the real context was much narrower.

Then, a lightbulb moment. My boss, now firmly all hands on deck in the panic room, had the final, surprisingly simple breakthrough, triggered by a desperate re-evaluation of our initial data structure:

“Why are we even querying for the brand or series or entire library? If the user starts a conversation within a specific container—that folder-like hierarchy in the DB—we already know the exact PDFs relevant to that context! We can get the unique identifier of that container, pull those specific PDFs, and query only them!”

It felt like we’d come full circle, almost a regression. We had previously attempted to limit the search by “level” (a broader category), but this was different. This breakthrough meant reducing the search space not just by a general category, but by the precise set of documents already implicitly defined by the user’s current interaction context. This was not merely a filtering mechanism; it was a paradigm shift in our retrieval strategy. My broken, tired spirit, against all odds, latched onto this. After a quick deliberation, we settled on this radical simplification and began immediate implementation. Mercifully, integrating this targeted retrieval mechanism was infinitely less complicated than our previous, sprawling attempts to force relevance from an entire knowledge base.

We sent the set of questions our client gave us for testing and one by one we checked, reading them cautiously. The responses were specific, had a professional tone, and directly answered the questions. Finally… it was all done.

The client was happy, our minds were finally at peace. Thanks to this entire ordeal, we learned invaluable lessons for implementing similar systems in the future.

Takeaways for you and me

Context is King in RAG: The most profound insight was realizing that effective RAG isn’t about querying everything. It’s about intelligently narrowing the retrieval scope to the precise context of the user’s intent. Leveraging existing data structures for contextual filtering can be a game-changer for relevance and performance.
Pragmatism Over Purity (Open Source vs. Managed): While open-source solutions offer control, they demand significant engineering effort for setup, optimization, and achieving quality. Under pressure, a reliable, managed service (like OpenAI’s embeddings) can provide the necessary stability and performance, even if it’s a “retreat” from a more ambitious technical goal. Understand the true cost of “free.”
Understand Your Data’s True Nature: The initial, “quirky” PDF hierarchy in our database was initially a burden but ultimately provided the key to our final solution. A deep understanding of your data’s inherent structure can unlock unexpected optimizations.
Embrace the Unfiltered Journey: Every challenge, every misstep, and every desperate pivot was a learning opportunity. This “painful” implementation was, in retrospect, an accelerated masterclass in building resilient RAG systems.

Cover credits:
Refactoring, Meeting buffers, naming files, and RAG systems by Luca Rossi