Building an Edge AI Sound Classifier (Part 1): Collecting and Preparing the Dataset

This content originally appeared on DEV Community and was authored by Ertugrul

In this three-part series, I’ll walk you through how I built a tiny edge AI system that runs on a Raspberry Pi Pico (RP2040) and can recognize four everyday sounds:

Baby cry
Doorbell
Smoke alarm
Other / background noise

In Part 1, we’ll cover how the dataset was collected, cleaned, and transformed into snippets ready for feature extraction and training.

Why Sound Classification on the Edge?

Imagine a baby monitor that can alert you if your child is crying, a smart home device that recognizes the doorbell even when you’re wearing headphones, or an embedded smoke alarm detector. Running directly on the Pico (with only 264 KB RAM and no OS) forces us to keep models lightweight and efficient.

To make this possible, we need a clean and balanced dataset.

Data Collection

I gathered raw audio from three sources:

YouTube recordings (e.g., baby cries, alarms)
Freesound.org clips
Personal recordings with a phone mic

Each class had around 5–8 minutes of audio.

Cutting into Snippets

We don’t feed the raw 5-minute audio directly. Instead, we cut it into manageable snippets:

Baby cry: 1.5s windows
Doorbell / smoke alarm / other: 2.0s windows
Hop (stride): 0.25s

We also removed silence using an RMS energy threshold (e.g., -55 dB).

This was automated with the script bulk_cut_data.py:

CLASSES = [
    ("./dataset/raw/door_bel",    "./dataset/prep/doorbel",    2.0, 0.25, -55.0, 240),
    ("./dataset/raw/fire_alarm", "./dataset/prep/fire_alarm", 2.0, 0.25, -55.0, 240),
    ("./dataset/raw/baby_cry",   "./dataset/prep/baby_cry",   1.5, 0.25, -50.0, 240),
    ("./dataset/raw/Negativ",    "./dataset/prep/Negativ",    2.0, 0.25, -55.0, 320),
]

Each tuple describes how to process a class:

Path to raw audio
Output folder for snippets
Snippet length (seconds)
Hop length (seconds)
RMS threshold (dB)

* Target max snippets

Dataset Structure

After preprocessing, the dataset looked like this:

dataset/
  raw/           # original long recordings
  prep/
    baby_cry/
    doorbel/
    fire_alarm/
    Negativ/

Each class had a target count:

Baby: ~240 snippets
Doorbell: ~240 snippets
Smoke alarm: ~240 snippets
Other: ~320 snippets

Balancing and Deduplication

One common issue: if you cut with overlap, you may end up with almost identical snippets. To avoid this, the script:

Filters out silent segments (via rms_db())
Uniformly downsamples when there are too many candidates
Ensures roughly equal snippet counts per class

Output of Part 1

At the end of this stage, we had:

Balanced, labeled snippets (~1000 total)
Silence removed
Ready for feature extraction

Code Deep Dive: `cut_file()`

A closer look at how a long recording is split:

def cut_file(path, snip_s, hop_s, rms_thr_db, out_dir):
    y, _ = librosa.load(path, sr=SR, mono=True)
    y = norm_peak(y)
    win = int(snip_s*SR); hop = int(hop_s*SR)
    saved = 0
    src = sanitize(path)
    for i in range(0, max(0, len(y)-win+1), hop):
        seg = y[i:i+win]
        if len(seg) < win:
            seg = np.pad(seg, (0, win-len(seg)))
        if rms_db(seg) < rms_thr_db:
            continue
        name = f"{src}__{uuid.uuid4().hex[:8]}.wav"
        sf.write(os.path.join(out_dir, name), seg, SR, subtype="PCM_16")
        saved += 1
    return saved

librosa.load loads the raw audio at 16 kHz.
norm_peak ensures audio has consistent peak amplitude.
Window / hop define the sliding snippet extraction.
RMS check filters out silent/very quiet segments.
Each saved snippet is named with a unique UUID.

This guarantees consistent, labeled training snippets.

What’s Next

In Part 2, we’ll dive into feature extraction and training:

Extracting MFCC-like features (Goertzel bands, RMS, centroid, etc.)
Training a logistic regression classifier
Evaluating with precision/recall and confusion matrices

Stay tuned