Sushify

This content originally appeared on DEV Community and was authored by Isaac Hagoel

I recently (well, today ) released Sushify, an open-source dev tool that helps test apps with complex LLM integrations by surfacing prompt* issues early. I write prompt* because, as you already know if you’ve worked on such apps, the prompt itself is just a small part of the broader context management required to make production-grade AI apps work well. Sushify uncovers issues in everything that gets passed into the LLM (including tools, output schemas, history, etc.).

In other words, it helps you turn your prompt salad into precision-cut sushi .

It’s still early days, and I’m looking for feedback and contributions from early adopters.

This post is a walkthrough of how I got from initial frustration to publishing a tool others can now use.

The Problem

Production apps that utilize LLMs often express a lot of their logic in free text—because that’s what LLMs understand.

Prompts are usually composed of static snippets and/or templates that get stitched together at runtime (sometimes using loops or conditional logic) and shared across different agents or workflows. On top of that, we pass in tools: each with a top-level description, parameters with descriptions, and usually prompt fragments that refer to the tool’s input/output format. The same goes for output schemas (a.k.a. structured outputs).

If one thing changes, say a tool’s output format, but there’s still a lingering reference to the old version anywhere in the prompt, the LLM can get confused and start misbehaving.

To make things worse, instructions to the LLM can easily end up too vague, too restrictive, or even contradictory (sometimes contradicting your past self).

As a result, the LLM starts ignoring instructions or behaving unpredictably. The typical response is to make things worse by piling on even more free-text instructions as a patch.

I ran into this struggle repeatedly, both in production-grade AI apps and even in small side projects. One day I stepped back, realized how absurd the whole situation was, and decided to do something about it.

Initial Instinct – Static Analysis

My first thought was: for code we have linters and compilers, so why not do the same for the inputs that go into an LLM?

I wanted to support at least Python and TypeScript and provide source maps that would pinpoint every issue back to its exact origin.

I spent a few weeks trying different approaches. The idea was to locate the LLM call in the code and build a DAG (Directed Acyclic Graph) tracing all dependencies through the codebase, ultimately reconstructing the full prompt* for analysis.

I tested this with real side projects I had built beforehand, but it wasn’t reliable enough. Some data injected into prompts is only known at runtime and would require mocking (e.g., RAG-retrieved docs, tool outputs, API call results). Prompt composition could also be deceptively tricky – even simple ternary expressions were hard to untangle.

No matter how sophisticated I made it, it never felt “good enough.” On top of that, it was expensive. I burned through over $100 just trying to generate dependency graphs, not even analyzing prompts yet. Eventually, I had to step back and rethink the approach.

Second Attempt – Runtime Tracking

If static analysis wasn’t cutting it, the alternative was to track LLM calls at runtime.

This had some clear advantages: no guesswork or mocking – the tool would see exactly what the LLM sees. Sure, to analyze every possible permutation of the prompt*, we’d need to actually execute those code paths, but is it really that hard for a dev to make the relevant calls?

Another big plus: we’d see the LLM responses too. That meant cross-referencing potential input issues with actual model behavior, capturing follow-up messages, history, tool responses, and even context-compaction bugs.

It made a lot of sense but there was still plenty to figure out.

Iterations

I won’t bore you with every failed experiment, but here’s the gist.

First, I built a POC using an SDK: the monitored app had to call this SDK and pass in the same payload it sent to the LLM (or wrap every LLM provider SDK with it).

This quickly felt wrong: too much friction, too error-prone (e.g., Zod schemas not being transformed into JSON schemas), and too restrictive. I wanted plug-and-play simplicity: something I could drop into any app with minimal effort.

That’s when I landed on using a proxy. Instead of requiring the app to wire anything manually, the proxy would wrap the app and intercept calls to the LLM – capturing the exact same request and response, reliably and transparently.

And of course, it had to support Docker, since nearly every production app I work on is containerized.

After all that exploration, I ended up with Sushify.

It’s still barebones, but already super useful. It helped me uncover issues in projects I thought were fine. It makes debugging prompt-related problems ridiculously easy.

Even though there’s plenty of room for growth, I’m confident many developers can get serious value out of it today.

Check out the GitHub repo – it has everything you need to get started in a few minutes, plus a quick demo and screenshots. Oh, and feel free to leave a while you’re there .

Would love to hear your thoughts or questions in the comments!

This content originally appeared on DEV Community and was authored by Isaac Hagoel