This content originally appeared on DEV Community and was authored by Suzuki Yuto
At Kaizen Agent, weβre building something meta: an AI agent that automatically tests and improves other AI agents.
Today I want to share the architecture behind Kaizen Agent, and open it up for feedback from the community. If you’re building LLM apps, agents, or dev toolsβyour input would mean a lot.
Why We Built Kaizen Agent
One of the biggest challenges in developing AI agents and LLM applications is non-determinism.
Even when an agent βworks,β it might:
- Fail silently with different inputs
- Succeed one run but fail the next
- Produce inconsistent behavior depending on state, memory, or context
This makes testing, debugging, and improving agents very time-consuming β especially when you need to test changes again and again.
So we built Kaizen Agent to automate this loop: generate tests, run them, analyze the results, fix problems, and repeat β until your agent improves.
Architecture Diagram
Hereβs the system diagram that ties it all together β showing how config, agent logic, and the improvement loop interact:
Note: Due to dev.to’s image compression, click here to view the full resolution diagram for better clarity.
Core Workflow: The Kaizen Agent Loop
Here are the five core steps our system runs, automatically:
[1]
Auto-Generate Test Data
Kaizen Agent creates a broad range of test cases based on your config β including edge cases, failure triggers, and boundary conditions.
[2]
Run All Test Cases
It executes every test on your current agent implementation and collects detailed outcomes.
[3]
Analyze Test Results
We use an LLM-based evaluator to interpret outputs against your YAML-defined success criteria.
- It identifies why specific tests failed.
- The failed test analysis is stored in long-term memory, helping the system learn from past failures and avoid repeating the same mistakes.
[4]
Fix Code and Prompts
Kaizen Agent suggests and applies improvements not just to prompts, but also modifies your code:
- It may add guardrails or new LLM calls.
- It aims to eventually test different agent architectures and automatically compare them to select the best-performing one.
[5]
Make a Pull Request
Once improvements are confirmed (no regressions, better metrics), the system generates a PR with all proposed changes.
This loop continues until your agent is reliably performing as intended.
What Weβd Love Feedback On
Weβre still early and experimenting. Your input would help shape this.
We’d love to hear:
- What kind of AI agents would you want to test with Kaizen Agent?
- What extra features would make this more useful for you?
- Are there specific debugging pain points we could solve better?
If youβve got thoughts, ideas, or feature requests β drop a comment, open an issue, or DM me.
Big Picture
We believe that as AI agents become more complex, testing and iteration tools will become essential.
Kaizen Agent is our attempt to automate the testβanalyzeβimprove loop.
Links
- GitHub: https://github.com/Kaizen-agent/kaizen-agent
- Twitter/X: https://x.com/yuto_ai_agent
This content originally appeared on DEV Community and was authored by Suzuki Yuto