How I designed a multi-agent AI support copilot



This content originally appeared on DEV Community and was authored by Eelco Los

Every support ticket starts the same way at our SaaS company: open the ticket, scan the description, then spend the next 15 minutes manually gathering context across five different systems. Check application telemetry for exceptions. Look up the customer in the CRM. Grep provisioning logs for failed sync events. Search work item systems and source control for related bugs. Cross-reference identity policy config if authentication is involved.

That context is all out there. It’s just scattered. By the time you’ve assembled it, you’ve already spent most of your time budget on information retrieval, not on reasoning.

That’s what I set out to fix. This is the first post in a series about the multi-agent AI support copilot.

The idea: a side-by-side AI copilot, not a replacement

The typical AI support story starts with a ticket routing chatbot or an automatic responder. That’s not this. We’re not changing the helpdesk agent’s job or asking the model to speak to the customer. Instead, we start a background process alongside the human support agent. The moment a ticket arrives, the copilot gathers context from the surrounding systems and checks whether that evidence corroborates, weakens, or contradicts what the customer is reporting.

The first milestone is deliberately modest from the product side: show the ticket and the supporting evidence side by side so the support agent can reason faster with better context. Under the hood, the implementation already goes further and can synthesize ranked hypotheses with confidence scores, but I still want the first user-visible win to be evidence the human can inspect. For now, the human stays in charge.

Everything that touches the customer still requires a human to approve it.

The design had three hard requirements:

  1. Parallel evidence gathering: all domains queried at the same time, not sequentially
  2. Structured outputs: every agent returns validated JSON, not prose that needs re-parsing
  3. Human-in-the-loop for every action: the copilot informs judgment, humans approve, deterministic code acts

Where the design came from

The initial architecture sketch came from a long ChatGPT conversation, the kind where you brain-dump a problem and the model helps you think through the components. That session produced the five-agent skeleton: identity resolution, observability telemetry, provisioning logs, work items, and a synthesis layer. It also produced the confidence arbitration formula:

FinalConf = sigmoid(Support - Conflict) × AgreementMultiplier

Where Support and Conflict are weighted sums of evidence claims, each claim weighted by the originating agent’s reliability (R_agent) and the claim’s local confidence.

The second source was internal: a reusable agentic scaffolding template we use for experiments. It had a graded architecture philosophy that mapped almost exactly to what we needed:

Grade Template capability What we needed
1 CLAUDE.md session memory SupportAgent persona + IncidentContext schema
2 Domain expertise YAML files Mental models per support domain (APM, CRM, SCIM, B2C, ALM)
3 Skills (SKILL.md with frontmatter) Evidence agent skill implementations
4 Closed-loop validation Re-evidence loop + Reviewer gate
5 Orchestration (parallel worker dispatch) Management Agent with parallel evidence dispatch

The template’s confidence_score field in expertise YAML files turned out to be exactly our R_agent reliability weight. Agents that update their own expertise files after each incident self-improve over time.

The two-layer repo structure

The template introduced a two-layer structure we adopted verbatim:

support-agent-research/
├── .agentic/              ← domain knowledge layer
│   ├── CLAUDE.md          - SupportAgent persona, session bootstrap
│   ├── memory/            - architecture decisions, ADR log
│   ├── plans/             - living docs for major system design changes
│   ├── expertise/         - per-domain YAML mental models
│   └── specs/             - per-incident IncidentContext files
│
└── .claude/               ← execution layer
    ├── agents/            - agent definitions (.md with frontmatter)
    ├── skills/            - evidence skill implementations (SKILL.md)
    └── settings.json      - hook wiring (Reviewer gate, Policy Gate)

The important part is that .agentic/ is the knowledge layer. It keeps the system’s memory between incidents: persona, architecture notes, domain expertise, plans, and incident files. In this article, plans/ just shows that the system’s design and build history live in the same place. .claude/ is the execution layer. It turns that knowledge into agents, skills, and hooks. Keeping them separate means you can update domain knowledge without touching execution logic, and vice versa.

How this matches broader agentic patterns

I didn’t invent this split in a vacuum. The Techorama 2025 sessions I captured in my notebook pointed at the same shape: a central orchestrator, specialist workers, shared context, and a clean split between knowledge and execution. I then turned that shape into a reusable agentic template, basically a starter kit that bootstraps those layers into a repo and keeps the supporting plans, memory, and delivery notes in one place.

That also matches the plan modes now showing up in current CLIs: they separate thinking from doing. The template goes further by preserving state, specialist workers, and feedback loops, so the agent can plan, execute, and improve without starting from zero each time.

These public repos corroborate the same building blocks. microsoft/skills covers skills, custom agents, AGENTS.md templates, and MCP configs. dotagents and source-agents focus on keeping one canonical instruction set synchronized across tools. claude-reflect turns corrections into durable memory and reusable skills. agnix adds validation gates by linting agent configs before they break workflows. For the orchestrator-worker shape itself, I point to the architecture docs from Anthropic and OpenAI below rather than forcing a weak repo comparison.

Anthropic draws a line between workflows and agents and recommends starting simple. The composable patterns they call out, like routing, parallelization, orchestrator-workers, and evaluator-optimizer loops, are the same kinds of building blocks I ended up using here. OpenAI’s Agents SDK makes a similar point by keeping the primitive set small: instructions, tools, handoffs, guardrails, sessions, and tracing. It also separates orchestration done by the LLM from orchestration done in code. That distinction matters to me because I want the model to reason, but I want deterministic code to route work and enforce boundaries. See Anthropic’s “Building effective agents”, OpenAI Agents SDK docs, and OpenAI Agents orchestration docs.

AGENTS.md fits this pattern too. It’s basically a repo-local README for agents, a human-readable place for durable instructions that complements the normal README.md. That’s exactly what .agentic/CLAUDE.md is doing for me here. See AGENTS.md.

.agentic piece What it stores Broader pattern
.agentic/CLAUDE.md Persona, operating rules, bootstrap guidance Repo-local agent instructions, AGENTS.md
.agentic/memory/ Architecture decisions and shared context Durable instructions and session memory
.agentic/plans/ Living record of the system’s design and build history Bootstrap paths, planning, and rollout history
.agentic/expertise/ Domain-specific mental models Specialist workers and routing inputs
.agentic/specs/ Incident-scoped state and evidence Persistent session state, blackboard
.claude/agents/ Worker definitions Handoffs, agent-as-tool
.claude/skills/ Narrow deterministic actions Tools and guardrails
.claude/settings.json Hook wiring and policy gates Deterministic routing and execution control

That’s why I ended up with a central Management Agent, specialized workers, a shared IncidentContext, and deterministic gates. The manager handles routing, the workers stay narrow, and the shared context gives every worker the same incident memory without letting them talk to each other directly. Then the reviewer and policy hooks decide what can actually happen. I want the model to gather evidence and propose answers, but I want code to decide when work is allowed to move forward. That same split shows up in Anthropic’s guidance, OpenAI’s Agents SDK docs, and AGENTS.md.

+------------------------+      +------------------------+
| .agentic               | ---> | .claude                |
| knowledge layer        |      | execution layer        |
|                        |      |                        |
| - CLAUDE.md            |      | - agents/              |
| - memory/              |      | - skills/              |
| - plans/               |      | - settings.json        |
| - expertise/           |      |                        |
| - specs/               |      |                        |
+------------------------+      +------------------------+

Core design decisions

Before writing a line, we laid out the key architectural bets in decisions.md:

  • Orchestrator-Worker over Decentralised: agents never pass control to each other. This prevents circular reasoning and cascading hallucinations. The Orchestrator is the only router.
  • IncidentContext as Blackboard: the shared knowledge base all agents write to and the Synthesis Agent reads from. Enables cross-domain correlation without agent-to-agent communication. “Identity policy changed last week” + “provisioning returned null” + “telemetry shows NullReferenceException” all point to the same root cause.
  • CLI-first auth: every skill uses the vendor’s own CLI (az, gh, acli) for authentication. Skills are credential-free; the bootstrap section validates auth before any query runs.
  • No LLM both concludes and acts: reasoning and execution are separate. The Policy Gate is deterministic code, not a model.

Orchestrator-Worker at a glance

      Support ticket
            |
            v
     Management Agent
   /   |    |    |   \
  /    |    |    |    \
  v    v    v    v     v
 APM  CRM  SCIM  B2C  ALM
  \    |    |    |    /
   \   |    |    |   /
    v  v    v    v  v
     IncidentContext
            |
            v
     Synthesis Agent

IncidentContext at a glance

IncidentContext
  - intake
  - resolved_identity
  - evidence
  - hypotheses
  - audit_log

That’s the design story. In part 2, I’ll show what happened when the first live tickets hit the wiring.

References


This content originally appeared on DEV Community and was authored by Eelco Los