Building an Edge AI Sound Classifier (Part 1): Collecting and Preparing the Dataset



This content originally appeared on DEV Community and was authored by Ertugrul

In this three-part series, I’ll walk you through how I built a tiny edge AI system that runs on a Raspberry Pi Pico (RP2040) and can recognize four everyday sounds:

  • 👶 Baby cry
  • 🔔 Doorbell
  • 🚨 Smoke alarm
  • 🌫 Other / background noise

In Part 1, we’ll cover how the dataset was collected, cleaned, and transformed into snippets ready for feature extraction and training.

🎯 Why Sound Classification on the Edge?

Imagine a baby monitor that can alert you if your child is crying, a smart home device that recognizes the doorbell even when you’re wearing headphones, or an embedded smoke alarm detector. Running directly on the Pico (with only 264 KB RAM and no OS) forces us to keep models lightweight and efficient.

To make this possible, we need a clean and balanced dataset.

📥 Data Collection

I gathered raw audio from three sources:

  • YouTube recordings (e.g., baby cries, alarms)
  • Freesound.org clips
  • Personal recordings with a phone mic

Each class had around 5–8 minutes of audio.

✂ Cutting into Snippets

We don’t feed the raw 5-minute audio directly. Instead, we cut it into manageable snippets:

  • Baby cry: 1.5s windows
  • Doorbell / smoke alarm / other: 2.0s windows
  • Hop (stride): 0.25s

We also removed silence using an RMS energy threshold (e.g., -55 dB).

This was automated with the script bulk_cut_data.py:

CLASSES = [
    ("./dataset/raw/door_bel",    "./dataset/prep/doorbel",    2.0, 0.25, -55.0, 240),
    ("./dataset/raw/fire_alarm", "./dataset/prep/fire_alarm", 2.0, 0.25, -55.0, 240),
    ("./dataset/raw/baby_cry",   "./dataset/prep/baby_cry",   1.5, 0.25, -50.0, 240),
    ("./dataset/raw/Negativ",    "./dataset/prep/Negativ",    2.0, 0.25, -55.0, 320),
]

Each tuple describes how to process a class:

  • Path to raw audio
  • Output folder for snippets
  • Snippet length (seconds)
  • Hop length (seconds)
  • RMS threshold (dB)

* Target max snippets

🗂 Dataset Structure

After preprocessing, the dataset looked like this:

dataset/
  raw/           # original long recordings
  prep/
    baby_cry/
    doorbel/
    fire_alarm/
    Negativ/

Each class had a target count:

  • Baby: ~240 snippets
  • Doorbell: ~240 snippets
  • Smoke alarm: ~240 snippets
  • Other: ~320 snippets

bar_chart

⚖ Balancing and Deduplication

One common issue: if you cut with overlap, you may end up with almost identical snippets. To avoid this, the script:

  • Filters out silent segments (via rms_db())
  • Uniformly downsamples when there are too many candidates
  • Ensures roughly equal snippet counts per class

Spectrogram collage of snippets from different classes

✅ Output of Part 1

At the end of this stage, we had:

  • Balanced, labeled snippets (~1000 total)
  • Silence removed
  • Ready for feature extraction

🔍 Code Deep Dive: cut_file()

A closer look at how a long recording is split:

def cut_file(path, snip_s, hop_s, rms_thr_db, out_dir):
    y, _ = librosa.load(path, sr=SR, mono=True)
    y = norm_peak(y)
    win = int(snip_s*SR); hop = int(hop_s*SR)
    saved = 0
    src = sanitize(path)
    for i in range(0, max(0, len(y)-win+1), hop):
        seg = y[i:i+win]
        if len(seg) < win:
            seg = np.pad(seg, (0, win-len(seg)))
        if rms_db(seg) < rms_thr_db:
            continue
        name = f"{src}__{uuid.uuid4().hex[:8]}.wav"
        sf.write(os.path.join(out_dir, name), seg, SR, subtype="PCM_16")
        saved += 1
    return saved
  • librosa.load loads the raw audio at 16 kHz.
  • norm_peak ensures audio has consistent peak amplitude.
  • Window / hop define the sliding snippet extraction.
  • RMS check filters out silent/very quiet segments.
  • Each saved snippet is named with a unique UUID.

This guarantees consistent, labeled training snippets.

🔜 What’s Next

In Part 2, we’ll dive into feature extraction and training:

  • Extracting MFCC-like features (Goertzel bands, RMS, centroid, etc.)
  • Training a logistic regression classifier
  • Evaluating with precision/recall and confusion matrices

Stay tuned 🚀


This content originally appeared on DEV Community and was authored by Ertugrul