Building a Zoom Meeting Bot in 2025 on AWS



This content originally appeared on DEV Community and was authored by Alex Tsimbalistov

Zoom “meeting bots” are everywhere—note takers, assistants, and automated recorders are now so common that entire threads debate whether people even need to attend calls anymore 1.

If you’re a developer trying to build one, you’ll quickly find there’s no single “Zoom bot API.” Instead, you’ll need to combine Zoom’s SDKs/APIs with a streaming pipeline, speech-to-text, and outbound communication.

This article outlines a practical architecture for a Zoom meeting bot—based on our real ChatterBox implementation—that streams meeting audio over WebSockets, transcribes it in real time, and provides both near real-time and post-call transcripts. We’ll cover Zoom platform options, media capture choices, audio/transcription details, and operational trade-offs.

What “Zoom bot” actually means on Zoom’s platform

On Zoom, “bot” is not a single product. You typically combine one or more of:

  • Meeting SDK: Embed the Zoom Meeting experience in your app. You can programmatically join meetings, interact with the meeting participants, and integrate with your product. Subject to Zoom’s feature review and distribution requirements 4.
  • Video SDK: Build fully custom video/audio apps on Zoom’s low‑level infrastructure. This does not join standard Zoom Meetings; it powers your own experience 6.
  • Zoom REST APIs: Manage users, meetings, recordings, webhooks, and chat. Useful for provisioning and post‑meeting workflows 5.

Developers frequently ask how to “auto‑join” a meeting headlessly. The short answer: Zoom is not designed for headless automation that screenscrapes a UI or bypasses the SDKs. Those approaches are brittle and often violate terms or break when Zoom updates the client. If you need an automated participant that joins a regular Zoom Meeting, you’ll generally build a Meeting SDK app that programmatically joins with appropriate review/approval 4. If you need full media control without a Zoom Meeting, build with Video SDK instead 6.

Getting media from Zoom (key decision)

There’s no direct “Zoom → Kinesis Video Streams” switch. You must capture audio via one of Zoom’s supported paths, then bridge it to your processing stack:

  • Meeting SDK Raw Data – Real-time mixed/per-participant audio. Pros: Low latency, per-speaker streams. Cons: Feature review required, platform-dependent support.
  • Zoom Live Streaming (RTMP/S) – Mixed program feed. Pros: Simple to enable. Cons: Adds latency, needs RTMP ingestion/demux.
  • SIP Connector – Telephony-grade integration. Pros: Reliable. Cons: Separate license, mono audio, needs RTP→PCM conversion.
  • Cloud Recording + Webhooks – Post-meeting only. Pros: Easy start. Cons: No real-time.
  • Video SDK – Full raw media control (non-meeting). Pros: Complete flexibility. Cons: Different UX/compliance profile.

Our reference architecture

In ChatterBox, we use AWS Kinesis Video Streams (KVS) for durability and fan-out once audio leaves Zoom. This is just one approach—equivalents exist in other clouds (e.g., Azure Media Services, GCP Live Stream).

Zoom Bot Architecture

A robust bot pipeline needs to:

  • Ingest live meeting audio/events
  • Transcribe audio in real time
  • Deliver updates to UIs and AI agents with minimal latency
  • Recover gracefully from errors

Our implementation:

  • Single HTTP server – Hosts REST API and Socket.IO WebSockets
  • Media bridge – Captures audio from Zoom and normalizes it
  • KVS – Fans out audio to consumers (ASR, recording, analytics)
  • SQS consumers – Handle session lifecycle, speaker changes, transcript segments
  • Pluggable transcription – Swap ASR providers without core changes
  • WebSocket streaming – Pushes updates to clients and agents

Canonical audio format for streaming ASR

For broad ASR compatibility, normalize to PCM s16le, 16 kHz, mono:

  • Widely supported by AWS Transcribe, Deepgram, AssemblyAI
  • Predictable bandwidth (~32 KB/s)
  • Small chunks (0.25–0.5 s) balance latency and reliability

Example conversion function:

tsCopyEditexport const SAMPLE_RATE_HZ = 16000;
export const CHANNELS = 1;
export const BYTES_PER_SAMPLE = 2;

export function float32ToS16le(float32: Float32Array): Buffer {
  const buffer = Buffer.alloc(float32.length * BYTES_PER_SAMPLE);
  for (let i = 0; i < float32.length; i++) {
    let s = Math.max(-1, Math.min(1, float32[i]));
    s = s < 0 ? s * 0x8000 : s * 0x7fff;
    buffer.writeInt16LE(s, i * BYTES_PER_SAMPLE);
  }
  return buffer;
}

REST + WebSockets in one process

A lean gateway can unify REST API and real‑time WebSockets on a single port. This simplifies deployment and coordination.

// src/server.ts
import http from 'http';
import express from 'express';
import { Server as SocketIOServer, Socket } from 'socket.io';

type Message = { type: string; payload: unknown };

const app = express();
app.use(express.json());

// Simple health endpoint
app.get('/health', (_req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString(), uptime: process.uptime() });
});

// Minimal REST API surface example
app.post('/users', (req, res) => {
  const { email, name } = req.body || {};
  if (!email || !name) return res.status(400).json({ error: 'email and name required' });
  // Insert into DB here...
  res.status(201).json({ id: 'user_123', email, name });
});

const server = http.createServer(app);
const io = new SocketIOServer(server, {
  cors: { origin: '*', credentials: true },
  allowEIO3: true, // network compatibility
});

type SessionEntry = {
  socket: Socket;
  queue: Message[];
};

const connected: Record<string, Record<string, SessionEntry>> = {};

io.use((socket, next) => {
  // Example handshake keys: userId and sessionId
  const { userId, sessionId } = socket.handshake.auth as { userId?: string; sessionId?: string };
  if (!userId || !sessionId) return next(new Error('Missing auth/session'));
  (socket as any).userId = userId;
  (socket as any).sessionId = sessionId;
  next();
});

io.on('connection', (socket) => {
  const userId = (socket as any).userId as string;
  const sessionId = (socket as any).sessionId as string;

  connected[userId] ??= {};
  const entry = connected[userId][sessionId] ?? { socket, queue: [] };
  entry.socket = socket;
  connected[userId][sessionId] = entry;

  while (entry.queue.length) {
    socket.emit('message', entry.queue.shift());
  }

  socket.on('disconnect', () => {
    delete connected[userId][sessionId];
    if (Object.keys(connected[userId] || {}).length === 0) {
      delete connected[userId];
    }
  });
});

export function sendToClient(userId: string, sessionId: string, msg: Message) {
  const entry = connected[userId]?.[sessionId];
  if (entry?.socket?.connected) {
    entry.socket.emit('message', msg);
  } else {
    if (!connected[userId]) connected[userId] = {};
    if (!connected[userId][sessionId]) connected[userId][sessionId] = { socket: undefined as any, queue: [] };
    connected[userId][sessionId].queue.push(msg);
  }
}

const PORT = process.env.PORT ? Number(process.env.PORT) : 3099;
server.listen(PORT, () => console.log(`HTTP+WS listening on :${PORT}`));

This setup gives you:

  • REST endpoints (e.g., health, user management)
  • WebSockets for real‑time transcript chunks, agent actions, and UI updates
  • In‑memory queuing by user/session to survive brief reconnects

Limitations: this in‑memory state is per process; it’s lost on restart and not horizontally scalable. For production, externalize queues/state with Redis, Kafka, or a message broker.

Ingestion and fan‑out with Kinesis Video Streams

KVS GetMedia returns a Matroska (MKV) container, typically with Opus/AAC audio—not raw PCM. You must demux and decode before feeding ASR.

A resilient ingest pipeline separates concerns:

  • A single KVS reader produces MKV fragments
  • An FFmpeg demux/decode step converts MKV→PCM s16le 16 kHz mono
  • PassThrough streams fan out PCM to:
    • Recording (optional)
    • Transcription
  • Lifecycle control handles reconnects, cleanup, and session finalization
// src/processKvs.ts (practical excerpt)
import { PassThrough } from 'stream';
import { spawn } from 'child_process';
import { KinesisVideoClient, GetDataEndpointCommand } from '@aws-sdk/client-kinesis-video';
import { KinesisVideoMediaClient, GetMediaCommand } from '@aws-sdk/client-kinesis-video-media';
import { SAMPLE_RATE_HZ } from './audioParams';

async function openKvsReadable(streamName: string) {
  const kv = new KinesisVideoClient({});
  const { DataEndpoint } = await kv.send(new GetDataEndpointCommand({
    StreamName: streamName,
    APIName: 'GET_MEDIA',
  }));
  const kvm = new KinesisVideoMediaClient({ endpoint: DataEndpoint });
  const media = await kvm.send(new GetMediaCommand({
    StreamName: streamName,
    StartSelector: { StartSelectorType: 'NOW' },
  }));
  return media.Payload as unknown as NodeJS.ReadableStream; // MKV container
}

export async function startIngest(sessionId: string, opts: { record: boolean; transcribe: boolean }) {
  const kvsReadable = await openKvsReadable(`meeting-${sessionId}`);

  // FFmpeg demux + decode MKV → PCM s16le 16k mono
  const ffmpeg = spawn('ffmpeg', [
    '-hide_banner',
    '-loglevel', 'error',
    '-f', 'matroska',
    '-i', 'pipe:0',
    '-ac', '1',
    '-ar', String(SAMPLE_RATE_HZ),
    '-f', 's16le',
    'pipe:1',
  ], { stdio: ['pipe', 'pipe', 'inherit'] });

  kvsReadable.pipe(ffmpeg.stdin!);

  const pcm = new PassThrough({ highWaterMark: 1 * 1024 * 1024 });
  ffmpeg.stdout!.pipe(pcm);

  // Fan out to recorder (optional)
  if (opts.record) {
    const recorder = new PassThrough();
    pcm.pipe(recorder, { end: false });
    recorder.on('data', (chunk) => {
      // persist PCM frames or re-encode as needed
    });
  }

  // Fan out to transcription
  if (opts.transcribe) {
    const asr = new PassThrough();
    pcm.pipe(asr, { end: false });
    asr.on('data', (chunk: Buffer) => {
      // send PCM frames to your ASR provider
      // make sure to chunk on frame boundaries (~0.25–0.5 s)
    });
  }

  const cleanup = (why: string, err?: unknown) => {
    try { ffmpeg.stdin?.destroy(); } catch {}
    try { ffmpeg.stdout?.destroy(); } catch {}
    try { ffmpeg.kill('SIGKILL'); } catch {}
    // finalize session; persist state
  };

  ffmpeg.on('close', (code) => {
    if (code !== 0) {
      // bounded reconnects if appropriate
    }
    cleanup('ffmpeg-exit');
  });

  pcm.on('error', (e) => cleanup('pcm-error', e));
}

Use bounded reconnects (e.g., up to 3 attempts) when a stream ends unexpectedly, and persist final session state on the last cleanup. Keep listeners tight and remove them to avoid leaks.

Decoupled event consumers with SQS

To keep your streaming loop lean, offload orchestration to message queues. A simple setup runs three consumers in parallel:

  • session‑start events
  • speaker‑change events
  • transcript segment events
// src/consumers.ts (simplified, AWS SDK v3)
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';

async function runConsumer(queueUrl: string, handler: (m: any) => Promise<void>, signal: AbortSignal) {
  const sqs = new SQSClient({});
  while (!signal.aborted) {
    const resp = await sqs.send(new ReceiveMessageCommand({
      QueueUrl: queueUrl,
      MaxNumberOfMessages: 10,
      WaitTimeSeconds: 20, // long poll
      VisibilityTimeout: 60,
    }));

    const messages = resp.Messages || [];
    for (const msg of messages) {
      try {
        const body = JSON.parse(msg.Body || '{}');
        await handler(body);
        await sqs.send(new DeleteMessageCommand({ QueueUrl: queueUrl, ReceiptHandle: msg.ReceiptHandle! }));
      } catch (err) {
        // log and optionally leave message for retry/DLQ
      }
    }
  }
}

export function startConsumers(signal: AbortSignal) {
  runConsumer(process.env.SESSION_START_Q!, handleSessionStart, signal);
  runConsumer(process.env.SPEAKER_CHANGE_Q!, handleSpeakerChange, signal);
  runConsumer(process.env.TRANSCRIPT_SEGMENT_Q!, handleTranscriptSegment, signal);
}

async function handleSessionStart(e: any) {
  // create session, start ingest
}

async function handleSpeakerChange(e: any) {
  // update diarization context
}

async function handleTranscriptSegment(e: any) {
  // forward to WebSocket clients & agent pipeline
}

Tie this into graceful shutdown. On SIGINT/SIGTERM, abort the consumers, drain streams, and close the HTTP server. This reduces in‑flight loss.

Transcription provider abstraction

Swapping between AWS Transcribe, Deepgram, or AssemblyAI is easier with a small adapter layer that accepts the same PCM frames and emits normalized transcript events. Include per‑speaker metadata if you split meeting audio by participant. Mixed and per‑speaker modes enable different downstream use cases:

  • Mixed: capture a single meeting mix. Good for basic notes and summaries.
  • Per speaker: richer diarization and attribution for action items and CRM updates.

Pass hinting (language, domain, custom vocab) into the provider to increase accuracy.

Compliance, distribution, and review on Zoom

A critical part of shipping a meeting bot is compliance with Zoom’s developer policies:

  • Choose the right SDK. Meeting SDK apps embed the Zoom UI and can join meetings as an application participant, subject to review and distribution rules.
  • Plan for review. Zoom requires a feature review for apps that access certain capabilities or are distributed broadly 4.
  • Use official APIs for provisioning and post‑meeting workflows. The Zoom REST API covers users, meetings, recordings, chat, and webhooks 5.
  • Track SDK changes. Keep an eye on SDK changelogs for-breaking changes and new capabilities 7.

Community anecdotes are helpful but not authoritative. Avoid screen automation that tries to “auto‑open Zoom and join a meeting” as a human would; these approaches are unreliable and often against policy. Build on sanctioned SDKs and APIs instead.

Production hardening checklist

The reference setup above works well for a single Node.js process. Before production:

  • Replace in‑memory session registries with a shared store (Redis) and enforce a state machine for lifecycle transitions.
  • Externalize queues and delivery. Use Kafka/NATS or a managed service to handle backpressure and persistence.
  • Add authentication and tenancy. Secure your WebSocket handshake and REST endpoints; map users to sessions/meetings.
  • Implement backoff and jitter on reconnect loops.
  • Add metrics and tracing for queue depth, lag, transcription latency, and end‑to‑end time to summary.
  • Budget for bandwidth. Raw 16 kHz mono s16le is ~32 KB/s per stream; mixed + per‑speaker streams add up quickly.
  • Handle PII and consent. Make sure participants know an assistant is present. Align with Zoom’s and your org’s privacy policies.

Quick end‑to‑end flow

Sequence Diagram

Putting it all together:

  1. User creates a meeting session via REST; the backend provisions a session record and any required Zoom SDK tokens.
  2. The media bridge captures meeting audio (Meeting SDK raw data, RTMP, or SIP) and publishes to KVS for durability and fan‑out.
  3. The ingest pipeline reads KVS, demuxes/decodes with FFmpeg to PCM s16le 16 kHz mono.
  4. PCM frames are pushed to the transcription provider; segments are emitted with speaker metadata.
  5. Segments are pushed to WebSocket clients (e.g., a dashboard) and to an agent runtime (via MCP or your own tool interface).
  6. SQS consumers finalize the session, persist artifacts, and clean up KVS resources with bounded retries.

This separation—capture, ingestion, delivery, and automation—keeps each piece focused and debuggable.

Final thoughts

“Meeting bot” is a broad label. On Zoom, success hinges on picking the right SDK path, then designing a streaming backbone that’s resilient, low‑latency, and friendly to AI agents. A single‑port HTTP+WebSocket server, a sanctioned media capture bridge, KVS fan‑out (with proper demux/decoding), SQS consumers, and SES templates form a compact but powerful core.

You can start local with a lean API and grow into production with shared state, persistent queues, and a formal lifecycle. Keep Zoom’s review requirements in mind, respect participant consent, and watch your bandwidth budget. With that foundation, you can deliver fast transcripts, helpful summaries, and action items that actually get done.

Sources


This content originally appeared on DEV Community and was authored by Alex Tsimbalistov