This content originally appeared on HackerNoon and was authored by Rohit Jacob
“The first time I built an agentic workflow, it was like watching magic, i.e., until it took 38 seconds to answer a simple customer query and cost me $1.12 per request.”
\ When you start building agentic workflows where autonomous agents plan and act on multi-step processes, it’s easy to get carried away. The flexibility is incredible! But so is the overhead that comes with it. Some of these pain points include slow execution, high compute usage, and a mess of moving parts.
\ The middle ground in Agentic workflows is where the performance problems and the best optimization opportunities usually show up.
\ Over the last year, I’ve learned how to make these systems dramatically faster and more cost-efficient without sacrificing their flexibility, and decided to create this playbook.
:::tip Before I talk optimization, I wanted to make sure you all know what I mean when I use the following words:
- Workflows: Predetermined sequences that may or may not use an LLM altogether.
- Agents: Self-directing, and they can decide which steps to take and the order in which they choose to execute.
- Agentic Workflows: This is a hybrid where you set a general path but give the agents in your workflow the freedom to move within certain steps.
:::
\
Trim the Step Count
Something everyone needs to keep in mind while designing agentic workflows is that every model call adds latency. Every extra hop is another chance for a timeout. And let’s not forget about how it also augments our chance of hallucinations, leading to decisions being made that stray away from the main objective.
\ The guidelines here are simple:
- Merge related steps into a single prompt
- Avoid unnecessary micro-decisions that a single model could handle in one go
- Design to minimize round-trips
\ There’s always a fine balance in this phase of design, and the process should always start with the least number of steps. When I design a workflow, I always start with a single agent (because maybe we don’t need a workflow at all) and then evaluate it against certain metrics and checks that I have in place.
\ Based on where it fails, I start to decompose the parts where the evaluation scores didn’t meet the minimum criteria, and iterate from then on. Soon, I get to the point of diminishing returns, just like the elbow method in clustering, and choose my step count accordingly.
Parallelize Anything That Doesn’t Have Dependencies
Borrowing context from the point above, sequential chains are latency traps, too. If two asks don’t need each other’s output, run them together!
\ As an example, I wanted to mimic a customer support agentic workflow where I can help a customer get their order status, analyze the sentiment of the request, and generate a response. I started off with a sequential approach, but then realized that getting the order status and analyzing the sentiment of the request do not depend on each other. Sure, they might be correlated, but that doesn’t mean anything for the action I’m trying to take.
\ Once I had these two responses, I would then feed the order status and sentiment detected to the response generator, and that easily shaved the total time taken from 12 seconds to 5.
Cut Unnecessary Model Calls
We’ve all seen the posts online that talk about how ChatGPT can get a little iffy when it comes to math. Well, that’s a really good reminder that these models were not built for that. Yes, they might get it 99% of the time, but why leave that to fate?
\ Also, if we know the kind of calculation that needs to take place, why not just code it into a function that can be used, instead of having an LLM figure that out on its own? If a rule, regex, or small function can do it, skip the LLM call. This shift will eliminate needless latency, reduce token costs, and increase reliability all in one go.
Match The Model To The Task
“Not every task is built the same” is a fundamental principle of task management and productivity, recognizing that tasks vary in their nature, demands, and importance. In the same way, we need to make sure that we’re assigning the right tasks to the right model. Models now come in different flavors and sizes, and we don’t need a Llama 405B model to do a simple classification or entity extraction task; instead, an 8B Model should be more than enough.
\ It is common these days to see people designing their agentic workflows with the biggest-baddest model that’s come out, but that comes at the cost of latency. The bigger the model, the more the compute required, and hence the latency. Sure, you could host it on a larger instance and get away with it, but that comes at a cost, literally.
\ Instead, the way I go about designing a workflow again would be to start with the smallest. My go-to model is the Llama 3.1 8B, which has proven to be a faithful warrior for decomposed tasks. I start by having all my agents use the 8B model and then decide whether I need to find a bigger model, or if it’s simple enough, maybe even go down to a smaller model.
\ Sizes aside, there has been a lot of tribal knowledge about what flavors of LLMs do better at each task, and that’s another consideration to take into account, depending on the type of task you’re trying to accomplish.
Rethinking Your Prompt
It’s common knowledge now, but as we go through our evaluations, we tend to add in more guardrails to the LLM’s prompt. This starts to inflate the prompt and, in turn, affects the latency. There are various methods for building effective prompts that I won’t get into in this article, but the few methods that I ended up using to reduce my round-trip response time were Prompt Caching for static instructions and schemas.
\ This included adding dynamic context at the end of the prompt for better cache reuse. Setting clear response length limits so that the model doesn’t eat up time, giving me unnecessary information.
Cache Everything
In a previous section, I talked about Prompt Caching, but that shouldn’t be where you stop trying to optimize for with caching. Caching isn’t just for final answers; it’s something that should be applied wherever applicable. While trying to optimize certain expensive tool calls, I cached intermediate and final results.
\ You can even implement KV caches for partial attention states and, of course, any session-specific data like customer data or sensor states. While implementing these caching strategies, I was able to slash repeated work latency by 40-70%.
Speculative Decoding
Here’s one for the advanced crowd: use a small “draft” model to guess the next token quickly and then have a larger model validate or correct them in parallel. A lot of the bigger infrastructure companies out there that promise faster inference do this behind the scenes, so you might as well utilize it to push your latency down further.
Save Fine-Tuning For Last – and Do It Strategically
Finetuning is something a lot of people talked about in the initial days, but now, some of the newer adopters of LLMs don’t seem to even know why or when to use it. When you look it up, you’ll see that it’s a way to have your LLM understand your domain and/or your task in more detail, but how does this help latency?
\ Well, this is something not a lot of people talk about, but there’s a reason I talk about this optimization last, and I’ll get to that in a bit. When you fine-tune an LLM to do a task, the prompt required at inference is considerably smaller than what you would have otherwise, because now, in most contexts, what you put in the prompt is baked into the weights through your fine-tune process.
\ This, in turn, feeds into the above point on reducing your prompt length and hence, latency gains.
Monitor Relentlessly
This is the most important step I took when trying to reduce latency. This sets the groundwork for any of the optimizations listed above and gives you clarity on what works and what doesn’t. Here are some of the metrics I used:
- Time to First Token (TTFT)
- Tokens Per Second (TPS)
- Routing Accuracy
- Cache Hit Rate
- Multi-agent Coordination Time
\ These metrics tell you where to optimize and when because without them, you’re flying blind.
Bottom Line
The fastest, most reliable agentic workflows don’t just happen. They are a result of ruthless step-cutting, smart parallelization, deterministic code, model right-sizing, and caching everywhere it makes sense. Do this and evaluate your results, and you should see 3-5x speed improvements (and probably even major cost savings) are absolutely within reach.
This content originally appeared on HackerNoon and was authored by Rohit Jacob