This content originally appeared on DEV Community and was authored by Angad Singh
The journey of building production-ready voice agents with Pipecat + Attendee, and the infrastructure changes I had to implement.
The Problem I Was Trying to Solve
Building a voice agent that could actually join Zoom/Google Meet meetings and have conversations.
The challenge?
I needed to integrate Pipecat (a voice AI framework) with Attendee (a meeting bot infrastructure), but there was a fundamental mismatch in how they handled audio data.
The Original Problem: Mixed Audio vs. Speaker Identification
When I first started, Attendee was only sending mixed audio packets – basically one big audio stream with everyone’s voices combined. This was a nightmare for voice AI because:
- No speaker identification – I couldn’t tell who was talking
- Poor transcription quality – Mixed audio is harder to transcribe accurately
- Context loss – The AI couldn’t respond appropriately without knowing the speaker
I had two options:
- Per-participant audio packets – Individual audio streams per person
- Transcription frames – Pre-processed transcription with speaker info
I chose transcription frames because they’re more efficient and give me exactly what I need: clean text with speaker identification.
The Infrastructure Changes I Made
1. Added Transcription Frame Support to Attendee
I had to modify the Attendee infrastructure locally to support transcription frames. Here’s what I added:
# In bot_controller.py - New method to send transcription frames
def send_transcription_to_pipecat(self, speaker_id: str, speaker_name: str, text: str, is_final: bool, timestamp_ms: int, duration_ms: int):
"""Send transcription frame with speaker information to Pipecat via WebSocket"""
if not self.websocket_audio_client:
return
if not self.websocket_audio_client.started():
logger.info("Starting websocket audio client for transcription...")
self.websocket_audio_client.start()
payload = transcription_websocket_payload(
speaker_id=speaker_id,
speaker_name=speaker_name,
text=text,
is_final=is_final,
timestamp_ms=timestamp_ms,
duration_ms=duration_ms,
bot_object_id=self.bot_in_db.object_id,
)
self.websocket_audio_client.send_async(payload)
logger.info(f"Sent transcription to Pipecat: [{speaker_name}]: {text}")
# Added logic to disable mixed audio when transcription frames are enabled
def disable_mixed_audio_packets(self):
"""Check if mixed audio packets should be disabled in favor of transcription frames"""
audio_settings = self.bot_in_db.settings.get("audio_settings", {})
return audio_settings.get("disable_mixed_audio_packets", True)
def add_mixed_audio_chunk_callback(self, chunk: bytes):
# Skip WebSocket transmission if mixed audio packets are disabled
if self.disable_mixed_audio_packets():
logger.debug("Mixed audio packets disabled, skipping WebSocket transmission")
return
# ... rest of the method
2. Created New WebSocket Payload Format
I had to create a new payload format for transcription frames:
# In websocket_payloads.py - New transcription payload
def transcription_websocket_payload(
speaker_id: str,
speaker_name: str,
text: str,
is_final: bool,
timestamp_ms: int,
duration_ms: int,
bot_object_id: str
) -> dict:
"""Package transcription data with speaker information for websocket transmission."""
return {
"trigger": RealtimeTriggerTypes.type_to_api_code(RealtimeTriggerTypes.TRANSCRIPTION_FRAME),
"bot_id": bot_object_id,
"data": {
"speaker_id": speaker_id,
"speaker_name": speaker_name,
"text": text,
"is_final": is_final,
"timestamp_ms": timestamp_ms,
"duration_ms": duration_ms,
},
}
3. Updated Bot Configuration
I added a new setting to control this behavior:
# In serializers.py - New audio setting
"disable_mixed_audio_packets": {
"type": "boolean",
"description": "Whether to disable mixed audio packets in favor of transcription frames. When True, the bot will use per-participant transcription frames instead of mixed audio for better speaker identification.",
"default": True
}
Why I Chose Transcription Frames Over Per-Participant Audio
This was a key decision. Here’s why I went with transcription frames:
Per-Participant Audio Problems:
- Higher bandwidth – Multiple audio streams instead of one
- More complex processing – Need to handle multiple audio inputs
- Latency issues – More data to process and transmit
- Still need transcription – Would have to do STT on each stream
Transcription Frames Benefits:
- Lower bandwidth – Just text data instead of audio
- Speaker identification built-in – Attendee already knows who’s talking
- Better accuracy – Attendee’s transcription is optimized for meetings
- Simpler processing – Just handle text, not audio.
The transcription frames approach was cleaner and more efficient for my use case.
Building the Pipecat Integration
1. Custom Frame Serializer
I had to create a custom serializer to handle the new transcription format:
class AttendeeFrameSerializer(FrameSerializer):
async def deserialize(self, data: str | bytes) -> Frame | None:
json_data = json.loads(data)
# Handle the new transcription frames from my modified Attendee
if json_data.get("trigger") == "realtime_audio.transcription":
transcription_data = json_data["data"]
speaker_id = transcription_data.get("speaker_id", "unknown")
speaker_name = transcription_data.get("speaker_name", "Unknown")
text = transcription_data.get("text", "")
# Create transcription frame with speaker info
transcription_frame = TranscriptionFrame(
text=text,
user_id=speaker_id,
timestamp=timestamp_ms
)
transcription_frame.speaker_name = speaker_name
return transcription_frame
# Fallback to per-participant audio if needed
elif json_data.get("trigger") == "realtime_audio.per_participant":
# ... handle per-participant audio
2. Custom Transcription Processor
I built a processor to add speaker context to the LLM:
class TranscriptionFrameProcessor(FrameProcessor):
async def process_frame(self, frame, direction):
if isinstance(frame, TranscriptionFrame):
speaker_name = getattr(frame, 'speaker_name', 'Unknown Speaker')
# Add speaker context for the LLM
if frame.text and frame.text.strip():
original_text = frame.text
frame.text = f"[{speaker_name}]: {original_text}"
logger.info(f"🎤 Processing transcription from {speaker_name}: {frame.text}")
await self.push_frame(frame, direction)
3. Optimized Pipeline
I fine-tuned the pipeline for natural conversation:
# Optimized VAD parameters for responsive detection
vad_params = VADParams(
confidence=0.4, # Lower confidence for more responsive detection
start_secs=0.15, # Faster response time
stop_secs=0.25, # Shorter stop time for natural flow
min_volume=0.25, # Lower volume threshold
)
# Use transcription processor instead of STT (transcription comes from Attendee)
transcription_processor = TranscriptionFrameProcessor()
logger.info("Using Attendee transcription with speaker diarization instead of local STT")
The Result: A Working Voice Agent
After all this work, I ended up with a voice agent that:
- Joins real meetings via Attendee’s infrastructure
- Knows who’s speaking thanks to my transcription frame modifications
- Responds naturally with optimized conversation flow
- Handles real-time audio with low latency
- Has a web interface for easy configuration
The Complete Flow:
- User configures the agent via web interface
- Bot joins meeting via Attendee (with my modifications)
- Attendee sends transcription frames (from closed captions) with speaker info
- Pipecat processes the transcription with speaker context
- LLM generates appropriate responses
- TTS converts to natural speech
- Agent responds in the meeting
What I Learned
1. Infrastructure Changes Are Sometimes Necessary
The original Attendee infrastructure wasn’t designed for voice AI use cases. I had to modify it to support transcription frames, which was the right architectural decision.
3. Speaker Context Is Everything
Without knowing who’s speaking, a voice agent is just a chatbot. The transcription frame approach gave me perfect speaker identification.
4. Real-Time Performance Matters
Every millisecond of latency affects conversation quality. I spent a lot of time optimizing VAD parameters and buffer sizes.
5. Testing Is Critical
I built comprehensive tests to validate the integration:
def test_complete_flow_simulation(self):
"""Test the complete flow from transcription message to processed text frame."""
# Create transcription message as sent by my modified Attendee
transcription_message = {
"trigger": "realtime_audio.transcription",
"data": {
"speaker_id": "participant_123",
"speaker_name": "John Doe",
"text": "Hello, this is a test transcription.",
"is_final": True,
"timestamp_ms": 1703123456789,
},
}
# Test the complete pipeline
json_message = json.dumps(transcription_message)
transcription_frame = asyncio.run(self.serializer.deserialize(json_message))
# Verify it works end-to-end
self.assertIsInstance(transcription_frame, TranscriptionFrame)
The Bottom Line
This is an active development project where I’m iterating, experimenting, optimizing, and adding new features. Building this voice agent isn’t just about learning Pipecat – it’s about understanding how to integrate complex systems and make architectural decisions that actually work.
The transcription frame approach was cleaner, more efficient, and gave me exactly what I needed for natural conversation.
This is a work in progress, and I’m always open to questions and feedback!
This content originally appeared on DEV Community and was authored by Angad Singh