Building InboxHiiv: Event-Driven Podcast Processing with Firebase Functions and Gemini AI



This content originally appeared on DEV Community and was authored by Akhil

Building InboxHiiv: Event-Driven Podcast Processing with Firebase Functions and Gemini AI

InboxHiiv was built to solve a straightforward problem: people want to consume podcast content but don’t always have time to listen. The solution automatically converts podcast episodes into comprehensive newsletters using AI. Here’s how the system was engineered.

What was built

  • Event-driven processing pipeline using Firebase Functions and Firestore triggers
  • Parallel AI content generation with Google’s Gemini AI (via Vertex AI)
  • Performance optimization: Reduced processing time from 8-10 minutes to 3-4 minutes
  • Modular component system for transcript, summary, newsletter, and blog generation
  • Production-grade error handling with retry logic and resource management

Architecture Overview

InboxHiiv runs on a four-stage event-driven pipeline built on Firebase Functions and Google’s Gemini AI (accessed via Vertex AI). Each stage operates independently, communicating through Firestore documents.

┌──────────────────────────────────────────────────────┐
│                   Next.js Web App                    │
└──────────────────────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────┐
│              PROCESSING PIPELINE                     │
├──────────────────────────────────────────────────────┤
│  Triggers → Upload → AI Processing → Distribution    │
└──────────────────────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────┐
│                 INFRASTRUCTURE                       │
├─────────────┬────────────┬────────────┬──────────────┤
│  Firestore  │  Functions │   Storage  │  Gemini AI   │
└─────────────┴────────────┴────────────┴──────────────┘

The Processing Pipeline

Stage 1: Triggers

The system initiates processing through two mechanisms. A scheduled function polls RSS feeds every few hours to detect new episodes. Users can also manually trigger processing through an authenticated endpoint.

Both paths create a notifier document in Firestore. This document serves as a work queue entry, containing the podcast ID, episode GUID, and audio URL. Firestore documents were chosen over Pub/Sub because they provide persistent state for debugging and allow tracking of processing history.

Stage 2: Audio Upload

When a notifier document appears, the upload function springs into action. The function implements document-level locking to prevent duplicate processing – a simple boolean flag that’s proven reliable at this scale.

Notifier Created → Lock Document → Download Audio → Upload to GCS → Create Episode Doc

The system streams audio downloads directly to Cloud Storage, avoiding memory constraints for large files (podcasts regularly exceed 100MB). The episode document contains all metadata and serves as the trigger for the next stage.

Stage 3: AI Processing

This is where the sophisticated processing happens. The component processor reads the episode document and orchestrates multiple AI tasks in parallel based on configuration.

┌─────────────────────────────────────────────────────┐
│              Episode Document Created               │
└─────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    [Transcript]  [Combined Summary]  [Newsletter]
          │               │               │
          └───────────────┼───────────────┘
                          ▼
              Gemini AI (Parallel Processing)

The system uses a modular component architecture where each content type is independently configurable. The processorConfig.js defines four distinct processors:

  • Transcript: Full conversation with speaker labels and timestamps
  • Combined Summary: Chapters, executive summary, key takeaways, quotes, and social media posts
  • Newsletter: Email-ready content with subject lines, preheaders, and formatted body
  • SEO Blog Post: Optimized content with meta descriptions and target keywords

Each component has its own prompt engineering and JSON schema validation. All AI calls run concurrently using Promise.all(), reducing end-to-end processing time from sequential to parallel execution. The system validates responses against predefined schemas, ensuring consistent output structure across all content types.

Stage 4: Distribution

Email distribution integrates with Firebase’s email extension but maintains a custom templating layer. When a user requests an email recap, the system generates an HTML template using server-side rendered components and drops it into the mail collection.

The system respects user timezone preferences and implements rate limiting based on subscription tier.

Frontend Architecture

The web application uses Next.js with the App Router. Routes are organized into logical groups:

app/
├── (main)/          # Authenticated routes
│   ├── dashboard/   # Subscription management
│   └── episodes/    # Content viewing
└── (public)/        # Landing, pricing, etc.

For state management, the application uses React Query for server state with aggressive caching strategies. Authentication flows through Firebase Auth with custom hooks wrapping the SDK.

Firestore-Triggered Architecture

One of the most elegant design decisions was leveraging Firestore’s built-in triggers for the entire event-driven pipeline. Instead of managing a complex messaging system or orchestration layer, data is simply organized into specific collections, and Firebase Functions automatically respond to document changes:

/file_upload_notifier/{doc} → triggers audioUploader function
/podcasts/{podcastId}/episodes/{doc} → triggers processor function
/mail/{doc} → triggers email extension

This pattern provides a truly event-driven architecture with minimal overhead. No message queues to manage, no orchestration services to configure – just documents appearing in collections and functions responding. The beauty is that each function only knows about its input and output, making the system highly modular and testable.

For example, when a notifier document is created, the audioUploader function automatically triggers, processes the audio, then creates an episode document. That episode document creation immediately triggers the processor function, which generates all the AI content. It’s event-driven by default, with Firestore handling all the plumbing.

Why Firebase?

Firebase provides an integrated platform that accelerated our development. Authentication, database, storage, and functions all work together out of the box. Firestore’s real-time listeners enable instant UI updates when processing completes.

The serverless model aligns perfectly with podcast publishing patterns – sporadic bursts of activity when new episodes drop, followed by quiet periods. As the load increases or decreases, Google responds by rapidly scaling the number of virtual server instances needed to run your function.

Why Gemini AI via Vertex AI?

After evaluating OpenAI, Anthropic, and Google’s Gemini AI (via Vertex AI), Gemini was chosen for four key reasons:

  1. Unified API for both transcription and text generation
  2. Native GCP integration with the Firebase storage layer
  3. Large context windows – handles long podcast episodes without truncation
  4. Intelligent model selection – auto-selects between Gemini Flash and Pro based on token count and complexity
  5. Built-in JSON schema validation – responses are validated server-side before returning

Gemini’s transcription quality matches dedicated services, and the LLM performs exceptionally well for structured content generation. Most importantly, Vertex AI’s schema validation eliminates the need for client-side response parsing and validation, reducing error handling complexity.

Security Implementation

Authentication flows through Firebase Auth with email verification required. We implement authorization at the Firestore rules level, ensuring users can only access their own data.

Audio files in Cloud Storage use signed URLs with 1-hour expiration. This prevents hotlinking while allowing temporary access for processing.

Rate limiting happens at multiple layers – Cloud Armor for DDoS protection, Firebase Functions for API endpoints, and application logic for feature-specific limits.

Error Handling and Reliability

The system implements comprehensive error handling across the pipeline:

AI Processing Resilience: While Gemini’s schema validation reduces malformed responses, the system still implements retry logic with exponential backoff for rate limiting and temporary failures. The component processor gracefully handles partial failures – if one content type fails, others can still complete.

Stream Processing: Large audio files (3+ hour podcasts) are handled through streaming downloads with configurable timeouts. The system avoids loading entire files into memory, instead streaming directly from source URLs to Cloud Storage.

Resource Management: Firebase Functions use an in-memory filesystem where files consume available memory. The system implements strict cleanup procedures to prevent memory leaks, especially important for concurrent episode processing.

Document-Level Locking: The uploader implements document-level locking to prevent duplicate processing when multiple triggers fire simultaneously – a simple but effective solution at this scale.

Technical Implementation Highlights

Prompt Engineering and Schema Design

Each component processor uses carefully crafted prompts with strict JSON schema validation. For example, our newsletter component generates structured content with six distinct sections, while our transcript processor creates timestamped segments with speaker identification.

// Example from processorConfig.js
newsletter: {
  prompt: `Using the podcast's sub-components (such as chapters, executive summary, 
           key takeaways, and key quotes), generate newsletter content.
  Requirements:
  - Provide a catchy subject line.
  - Create a preheader text that teases the newsletter content.
  - Write an engaging introduction.
  - Compose the main content by integrating the provided sub-components.
  - Include a compelling call-to-action.
  - Generate a footer with contact details and social media links.
  Response must be valid JSON matching the specified schema.`,
  schema: {
    type: "object",
    properties: {
      subject: { type: "string" },
      preheader: { type: "string" },
      introduction: { type: "string" },
      content: { type: "string" },
      callToAction: { type: "string" },
      footer: { type: "string" }
    },
    required: ["subject", "preheader", "introduction", "content", "callToAction", "footer"]
  }
}

Concurrent Processing Architecture

The component processor leverages JavaScript’s async/await patterns to process multiple content types simultaneously:

const results = await Promise.all([
  processComponent('transcript', audioUrl),
  processComponent('combinedSummary', audioUrl),
  processComponent('newsletter', audioUrl),
  processComponent('seoBlogPost', audioUrl)
]);

This parallel approach reduces processing time from 8-10 minutes (sequential) to 3-4 minutes (parallel) for typical podcast episodes.

Configuration-Driven Processing

The modular design allows for easy adjustment of content generation by modifying configuration files rather than code. Each podcast can have different processing requirements – some need only summaries, while enterprise customers get the full content suite.

Contact: akhilcjacob.public@gmail.com | akhil@modrnmagic.app | Project: InboxHiiv


This content originally appeared on DEV Community and was authored by Akhil