Run Multi-Agent AI in the Cloud Without a Local GPU Using Docker Offload and Compose



This content originally appeared on DEV Community and was authored by Atsushi Suzuki

With the release of Docker Desktop 4.43 (July 3, 2025), you can now declare AI models, agents, and MCP tools in a single compose.yaml file and launch them all with a single docker compose up.

On top of that, the beta release of Docker Offload allows you to run Compose projects directly in the cloud with NVIDIA L4 GPUs. This opens the door to running large-scale models from even a modest laptop.

In this post, I’ll walk through how to use docker/compose-for-agents‘ official A2A Multi‑Agent Fact Checker sample entirely with Compose. I’ll also demonstrate how to offload the workload to the cloud using Docker Offload.

Some images in this post are sourced from the official Docker Offload Content Kit provided to Docker Captains.

Sample Overview

The A2A Multi‑Agent Fact Checker is a multi-agent system built with Google’s ADK (Agent Development Kit) and the A2A protocol. It features three agents—Auditor, Critic, and Reviser—that work together to research, verify, and revise a given claim, then return a final conclusion.

  • Auditor: Breaks down the user’s claim into subtasks and delegates them to Critic and Reviser. Collects the final answer and returns it via the UI.
  • Critic: Performs external web searches using the DuckDuckGo MCP tool to gather supporting evidence.
  • Reviser: Refines and verifies the output using the evidence gathered by Critic and the initial draft from Auditor.

The Critic communicates with the outside world via the MCP Gateway, and the inference model (Gemma 3 4B‑Q4) is hosted via Docker Model Runner.

screenshot

Key Highlights of compose.yaml

Here’s the full compose.yaml used to define the multi-agent system:

services:
  # Auditor Agent coordinates the entire fact-checking workflow
  auditor-agent-a2a:
    build:
      target: auditor-agent
    ports:
      - "8080:8080"
    environment:
      - CRITIC_AGENT_URL=http://critic-agent-a2a:8001
      - REVISER_AGENT_URL=http://reviser-agent-a2a:8001
    depends_on:
      - critic-agent-a2a
      - reviser-agent-a2a
    models:
       agents:
         endpoint_var: MODEL_RUNNER_URL
         model_var: MODEL_RUNNER_MODEL

  critic-agent-a2a:
    build:
      target: critic-agent
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway
    models:
       gemma3:
         # specify which environment variables to inject into the container
         endpoint_var: MODEL_RUNNER_URL
         model_var: MODEL_RUNNER_MODEL

  reviser-agent-a2a:
    build:
      target: reviser-agent
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway
    models:
       gemma3:
         endpoint_var: MODEL_RUNNER_URL
         model_var: MODEL_RUNNER_MODEL

  mcp-gateway:
    # mcp-gateway secures your MCP servers
    image: docker/mcp-gateway:latest
    use_api_socket: true
    command:
      - --transport=sse
      - --servers=duckduckgo
      # add an MCP interceptor to log the responses
      - --interceptor
      - after:exec:echo RESPONSE=$(cat) >&2

models:
  # declare LLM models to pull and use
  gemma3:
    model: ai/gemma3:4B-Q4_0
    context_size: 10000 # 3.5 GB VRAM
    #context_size: 131000 # 7.6 GB VRAM

Top-Level models

As of Compose v2.38, you can declare LLM images as OCI Artifacts under the top-level models field. Docker Model Runner will automatically pull the image and expose it as an API endpoint.

Per-Service models

Each service defines which model to use and how to inject its URL and model name via environment variables:

export MODEL_RUNNER_URL=http://model-runner:12434
export MODEL_RUNNER_MODEL=gemma3

This means your app can simply read environment variables without hardcoding the model path.

MCP Gateway

The docker/mcp-gateway image acts as a secure relay for MCP servers like DuckDuckGo. It communicates with the Critic agent using Server-Sent Events (SSE). The --interceptor flag logs the raw responses directly to stderr.

Dependency Management

Like traditional Compose setups, depends_on is used to manage startup order: MCP Gateway → Critic/Reviser → Auditor. This eliminates the need for retry logic.

Running Locally

You can launch the stack locally with:

docker compose up --build
[+] Running 8/9
 ✔ reviser-agent-a2a                  Built       0.0s
 ✔ critic-agent-a2a                   Built       0.0s
 ✔ auditor-agent-a2a                  Built       0.0s
 ⠴ gemma3 Configuring                            76.5s
 ...

Since Gemma 3 4B‑Q4 is quantized, it even runs on my MacBook Air M2.

screenshot

Open your browser to http://localhost:8080, type in a claim like:

How far is the Moon from the Earth?

The Critic performs a DuckDuckGo search, the Reviser polishes the output, and the Auditor returns the final answer.

screenshot

Using Docker Offload

If you want to use a larger model like Gemma 27B Q4, local GPUs might not cut it. That’s where Docker Offload comes in—just enable the feature and override your model config to run in the cloud.

Enabling Docker Offload

First, sign up for beta access on Docker’s official site. (As a Docker Captain, I received early access.)

Then, go to Settings > Beta Features and enable both:

  • “Enable Docker Offload”
  • “Enable Docker Offload GPU Support”

screenshot

Switch the Docker Desktop toggle to the cloud icon to activate Offload (or run docker offload start).

screenshot

compose.offload.yaml

Prepare a separate file to override the model definition:

models:
  gemma3:
    model: ai/gemma3-qat:27B-Q4_K_M 
    context_size: 10000 # 18.6 GB VRAM
    # context_size: 80000 # 28.37 GB VRAM
    # context_size: 131000 # 35.5 GB VRAM

Running with Offload

To launch in the cloud, combine the two Compose files:

docker compose -f compose.yaml -f compose.offload.yaml up --build

This overrides the top-level models field with the Offload-specific config.

screenshot

screenshot

Docker Offload gives you 300 GPU credits for free, and any additional usage is billed at \$0.015 per GPU second. Don’t forget to stop the service afterward:

docker offload stop

screenshot

Final Thoughts

Trying out Compose and Offload together really shows the power of unified agent, model, and tool orchestration. It’s incredibly convenient to use the same docker compose up command for both local and cloud environments.

The agent space is evolving rapidly, so if you have a better workflow or tips, I’d love to hear about them.

If you’re curious about Docker Offload, start experimenting within the free credit limit—you might be surprised how far you can go.

References


This content originally appeared on DEV Community and was authored by Atsushi Suzuki