This content originally appeared on DEV Community and was authored by Teemu Piirainen
Who’s this for: Builders and skeptics who want honest numbers: did an AI coding agent really save time, money, and sanity or just make a mess faster?
TL;DR
~60 h build time (↓~66 % from 180 h)
$283 token spend
374 commits, 174 files, 16’215 lines of code
1 new teammate – writes code 10× faster but only listens if you give it rules
Series progress:
Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▇▇▢▢▢
This is Part 4, the final piece: my honest verdict.
Now the question: Was it worth it?
- Did the numbers add up?
- Where did the agent pay off?
- Where did it backfire?
- How would I push it further next time?
Series Roadmap – How This Blueprint Works
One last time, here’s the big picture:
- Control – Control Stack & Rules → trust your AI agent won’t drift off course (Control – Part 1)
- Build – AI agent starts coding → boundaries begin to show (Build – Part 2)
- Release – CI/CD, secrets, real device tests → safe production deploy (Release – Part 3)
- Retrospect – The honest verdict → what paid off, what blew up, what’s next
Why care?
The AI agent is a code machine that never sleeps, knows every library, and wants to push commits 24/7 but without your control, it has no clue what the end product should be.
Let’s break it down – this is Part 4.
1. Did the Numbers Add Up?
Back in Part 1, I posed the core challenge:
Can we build a fully autonomous AI agent (one that an organisation can own and audit end-to-end) and make it deliver real, production-grade code, with just a fraction of human input?
That meant no black-box SaaS tools. No prompt-hacking toys. Just a scoped AI teammate working inside a real, observable control loop: Planner → Executor → Validator, backed by rules I could evolve and CI/CD pipelines I already trust.
Short answer: YES.
Here’s the breakdown:
- Effort – ~60 h of my time with the AI agent delivered the same output I’d expect from 180 h solo
- Money – $283 in Gemini 2.5 Pro tokens
- Speed – Flutter work flew by 5–7× faster, native Swift/Kotlin dragged to <2×, landing a real-world ~3× boost
- Delivery – 374 commits, 174 files, 16’215 lines of code
- Trust – Every task passed through my control loop, tested and clean. Full control. Full traceability.
So was it cheap? Absolutely – but only because I stayed in the loop.
The $283 didn’t magically buy 180 hours of code. It bought an extra pair of hands that turned my 60 hours into a full 180h deliverable.
Bottom line: The 3× boost didn’t come from magic, it came from structure.
The agent didn’t invent new skills; it scaled the ones I already had.
Sometimes it even surprised me, but only because the groundwork made it possible.
The stack I share here worked and was battle-tested in June 2025. Treat this write-up as a snapshot, not a rulebook.
2. Lessons Learned – What I’d Keep Next Time
Some parts of the setup worked better than expected and these are the ones I’d repeat from day one. Most of them are invisible from the outside, but they made the difference between chaos and clarity.
2.1 Trust Isn’t Given – It’s Built
One of the biggest takeaways from integrating an AI agent into my workflow is that trust doesn’t happen by default. You don’t get it just because the agent can write “good” code. You earn it by proving, over and over, that the agent can operate reliably inside the same guardrails as the human team.
When trust is missing, adoption stalls. Every bug becomes a reason to sideline the agent rather than improve it. Pull requests sit unreviewed because no one wants to take responsibility. Eventually, the “AI teammate” becomes just another unused tool.
The turning point was treating the agent like a real developer:
- Make its actions visible so everyone can see what it’s doing and why
- Start small and collect wins before scaling up
- Learn from mistakes and feed those lessons back into its instructions and rules
- Require approval for every plan before coding starts
- Apply the same rules as for human devs (no shortcuts because it’s an agent)
Built through visibility, shared rules, guardrails, and real accountability, that trust made me comfortable approving the agent’s work. Without it, none of the technical improvements would have mattered.
2.2 Control Stack First, Prompts Second
Giving the agent a state‑machine style loop (Planner → Executor → Validator), similar what Anthropic’s best‑practice write‑up. It forced the AI agent to think before splatting code and gave me natural checkpoints to cancel nonsense.
2.3 Rules as Live Documentation
/rules/airules.md
began as nothing, ballooned into a 400-line beast, and finally slimmed down to a tight 70-liner that covers only what matters. By week’s end the agent spoke my dialect (thinking process, code architecture, commit style) with minimal reminders.
JetBrains’ Junie guideline files show the same pattern: write rules once, enforce forever. But “forever” takes discipline.
2.4 Ruthless Task Scoping
- Start with a living PRD: Draft a concise Product Requirements Document that maps the entire service: user flows, non-functional needs, “nice-to-have” ideas, everything.
- Slice every feature into bite-size tasks: Break big rocks into shovel-ready tickets. Do it yourself, subdivide the work, just be sure each task fits comfortably in a one AI agent task context.
- Let the agent explode tasks into execution units: When implementation starts, the agent generates its own subtasks, acceptance notes, and edge cases, and keeps that checklist current as it commits code.
2.5 Secrets Stay Secret
Fine‑grained PAT plus GitHub Secrets meant the agent never held a signing key. The 2025 Wiz secrets‑leak report is proof that anything less is asking for page‑one headlines.
2.6 Real-Device CI/CD – The Only Trustworthy Loop
CI/CD isn’t optional when working with AI agents, it’s what turns speed into reliability. No pipeline, no autonomy – just faster mistakes.
Every pull request goes through the same pipeline: build, sign, ship to TestFlight and Play Console. That means the code lands on real hardware, gets tested by real eyes, and reveals real bugs the agent never saw coming.
The first sprint showed what works. These are the bets I’ll double down on next time to turn speed into consistency – not just more commits.
3. Lessons Learned – Where I’ll Push Further
3.1 Smarter Context, Longer Autonomy
Big LLMs forget fast. The fix isn’t just more tokens. It’s structured, real-time access to the whole repo, open tasks, recent merges – everything that defines “what’s really going on.”
The longer a single chat grows, the worse the output gets. So I’d like to push for smarter retrieval next time: live task lists, commit-aware reasoning, and context that updates as the codebase evolves.
Both Devin and Anthropic hit the same wall: without structured, evolving context, long autonomous runs just fall apart. Even though one favors single-agent and the other multi-agent setups.
In my own sprint, I tackled the same challenge by keeping tasks small and starting each with a clean context. A simple but surprisingly effective workaround.
3.2 Scaling the Team
One dev plus one agent is simple. But the moment you add more people (or more agents) things can get messy fast.
Who owns what? What changed while the agent was thinking? What if two agents fix the same bug? Without shared state and safe commit boundaries, you don’t get more speed. Just more conflict.
One thing I’d like to try: task leasing. AI agents (Planner) pick up tasks from a shared hub (like task.md
), evaluates is there other tasks running parallel (by other agents / humans) that might impact to work and validate state (Validator) before pushing code. Paired with clean CI/CD and commit guards, that might keep the swarm aligned.
These ideas would need careful testing in real-world coding workflows, as current multi-agent systems often fail due to shared-state complexity.
3.3. True TDD Loop
Tests after code worked fine, but next run the agent will flip it: failing test first, fix second. This tightens the feedback loop and cuts down surprises at the Validator step.
Anthropic recommends the same test-first mindset in their Claude Code Best Practices: write the tests first, confirm they fail, then guide your agent to turn them green. The goal is the same: catch bugs early, not after they hit production.
3.4. Deeper Static Analysis
Syntax checks aren’t enough. Unlike experienced developers, AI agents don’t intuitively spot complex or fragile code structures. Adding tools like SonarQube or Qodana to the CI pipeline gives early feedback on code quality, helping catch issues the AI agent might repeat without realizing.
3.5. UAT Feedback Automation
One practical issue that I didn’t solve was: how to get human testers’ feedback into the agent’s task list? When working in a team, I would create a separate hook integrated to Jira or Slack (or what ever tool the team is using), so that testers could report issues and the AI agent would pick them up and automatically create a linked task. But in this case I didn’t have a team, so I just added the issues manually into task.md
and let the agent handle them.
Mobile dev: In general, I think that we could get the biggest improvement if the AI agent would have access to Android emulator UI during the development. AI agent would have been able to run the tests and fix majority of issues automatically during the development. But as I mentioned earlier, due to technical limitations in Firebase Studio, real-time emulator access wasn’t available in this project.
4. MCP (Model Context Protocol) – The Next Frontier
Everything so far (coding speed-up, the Planner–Executor–Validator loop, and the CI/CD pipeline) was built and tested during the initial 30-day sprint. But while writing this recap, one thing became clear:
There’s still room to grow. And it starts with context.
The core limitation I ran into was this: the AI agent didn’t truly “know” what is happening outside the project. Each task was handled in context-isolation with one AI agent.
That missing context (the lack of shared memory or real-time awareness) is what I’ll explore next.
Why MCP matters
Model Context Protocol (MCP) lets an agent bolt on extra tools in real time. Need repo search? A test runner? Up to date docs? Hook it up as a structured API call instead of fragile prompt glue.
Below are some candidate features I’m excited to try:
Shared brain, real‑time
Every agent writes to (and reads from) a live task.md
hub. Spin up ten runners, one planner, one validator; they all share the same queue and state, acting like one coherent engineer instead of a Slack channel on fire.
Always‑fresh docs (Context 7 FTW)
Because MCP streams docs in via context 7
, the agent’s knowledge is never stale. Update the README, push to main, the next call sees it automatically.
Enterprise-ready glue
Forget brittle webhooks – MCP exposes Jira, GitHub, Slack as typed plugin calls. The agent plugs straight into your existing workflows without extra scaffolding.
Bottom line: MCP turns one clever AI agent into a fully-armed AI agent squad, all speaking the same language and pulling from the same live playbook.
5. Final Word
AI agents won’t replace you, but they will scale what you can deliver.
All this worked because I’ve got 20+ years of real dev work behind me: the blueprint, the rules, the guardrails all come from doing the work first.
If we hand every line of code to the AI, that craft fades and soon there’s nothing left to steer.
So protect your craft. Build the hard parts yourself. Keep your edge, so the agent stays a partner – not the boss.
Wire it tight, scope it clear, and let your AI agent prove it can keep up!
6. If You Want The 3× Boost – Do This
- Control Stack First. Planner → Executor → Validator – no shortcuts.
-
Keep
/rules
alive. Update instructions as your agent learns. - Scope tasks tight. Small tasks, clear acceptance notes, tracked commits.
- Secrets stay secret. Repo-scoped PAT + CI/CD secrets only.
- Real tests. Always run on real hardware, no emulator-only trust.
- Watch, learn, tweak. Your agent only stays smart if you guide it.
Ready for next?
I’m planning to plug MCP into the All‑Hands AI framework next, linking multiple agents with a shared brain and tighter feedback loops. I’ll share how that turns out once I’ve pushed it far enough to see what breaks.
Seen anything I missed? Or got your own battle story testing AI agents on real projects? Drop it in the comments. I read them all.
This content originally appeared on DEV Community and was authored by Teemu Piirainen