This content originally appeared on DEV Community and was authored by Ertugrul
In this three-part series, I’ll walk you through how I built a tiny edge AI system that runs on a Raspberry Pi Pico (RP2040) and can recognize four everyday sounds:
Baby cry
Doorbell
Smoke alarm
Other / background noise
In Part 1, we’ll cover how the dataset was collected, cleaned, and transformed into snippets ready for feature extraction and training.
Why Sound Classification on the Edge?
Imagine a baby monitor that can alert you if your child is crying, a smart home device that recognizes the doorbell even when you’re wearing headphones, or an embedded smoke alarm detector. Running directly on the Pico (with only 264 KB RAM and no OS) forces us to keep models lightweight and efficient.
To make this possible, we need a clean and balanced dataset.
Data Collection
I gathered raw audio from three sources:
- YouTube recordings (e.g., baby cries, alarms)
- Freesound.org clips
- Personal recordings with a phone mic
Each class had around 5–8 minutes of audio.
Cutting into Snippets
We don’t feed the raw 5-minute audio directly. Instead, we cut it into manageable snippets:
- Baby cry: 1.5s windows
- Doorbell / smoke alarm / other: 2.0s windows
- Hop (stride): 0.25s
We also removed silence using an RMS energy threshold (e.g., -55 dB).
This was automated with the script bulk_cut_data.py
:
CLASSES = [
("./dataset/raw/door_bel", "./dataset/prep/doorbel", 2.0, 0.25, -55.0, 240),
("./dataset/raw/fire_alarm", "./dataset/prep/fire_alarm", 2.0, 0.25, -55.0, 240),
("./dataset/raw/baby_cry", "./dataset/prep/baby_cry", 1.5, 0.25, -50.0, 240),
("./dataset/raw/Negativ", "./dataset/prep/Negativ", 2.0, 0.25, -55.0, 320),
]
Each tuple describes how to process a class:
- Path to raw audio
- Output folder for snippets
- Snippet length (seconds)
- Hop length (seconds)
- RMS threshold (dB)
* Target max snippets
Dataset Structure
After preprocessing, the dataset looked like this:
dataset/
raw/ # original long recordings
prep/
baby_cry/
doorbel/
fire_alarm/
Negativ/
Each class had a target count:
- Baby: ~240 snippets
- Doorbell: ~240 snippets
- Smoke alarm: ~240 snippets
- Other: ~320 snippets
Balancing and Deduplication
One common issue: if you cut with overlap, you may end up with almost identical snippets. To avoid this, the script:
- Filters out silent segments (via
rms_db()
) - Uniformly downsamples when there are too many candidates
- Ensures roughly equal snippet counts per class
Output of Part 1
At the end of this stage, we had:
- Balanced, labeled snippets (~1000 total)
- Silence removed
- Ready for feature extraction
Code Deep Dive: cut_file()
A closer look at how a long recording is split:
def cut_file(path, snip_s, hop_s, rms_thr_db, out_dir):
y, _ = librosa.load(path, sr=SR, mono=True)
y = norm_peak(y)
win = int(snip_s*SR); hop = int(hop_s*SR)
saved = 0
src = sanitize(path)
for i in range(0, max(0, len(y)-win+1), hop):
seg = y[i:i+win]
if len(seg) < win:
seg = np.pad(seg, (0, win-len(seg)))
if rms_db(seg) < rms_thr_db:
continue
name = f"{src}__{uuid.uuid4().hex[:8]}.wav"
sf.write(os.path.join(out_dir, name), seg, SR, subtype="PCM_16")
saved += 1
return saved
-
librosa.load
loads the raw audio at 16 kHz. -
norm_peak
ensures audio has consistent peak amplitude. - Window / hop define the sliding snippet extraction.
- RMS check filters out silent/very quiet segments.
- Each saved snippet is named with a unique UUID.
This guarantees consistent, labeled training snippets.
What’s Next
In Part 2, we’ll dive into feature extraction and training:
- Extracting MFCC-like features (Goertzel bands, RMS, centroid, etc.)
- Training a logistic regression classifier
- Evaluating with precision/recall and confusion matrices
Stay tuned
This content originally appeared on DEV Community and was authored by Ertugrul