Transientica: AudioLab
You beatbox into a mic. The game turns it into drums. 45 milliseconds, no MIDI controller, just your mouth.
The Concept
Dead simple idea: what if your voice was the controller? Beatbox a kick, snare, or hi-hat into your laptop mic, and a rhythm game responds like you're hitting real pads.
The first version was terrible. 180ms of lag. Try clapping on a beat that responds almost a fifth of a second late - it's unplayable. So I spent the rest of the project trying to get that number down.
Answer: 45ms. Fast enough that it feels instant. The system trains on your voice specifically - give it 30 examples of each sound and it learns how you beatbox. Different person, different mic - just retrain it and it adjusts. My kick sounds nothing like yours, and that's fine - the model adapts.
Live demo - beatboxing into the mic, game responding in real time
How It Works
Hear the Sound, Classify It, Respond
Mic audio hits a 512-sample FFT window (11ms of sound). Spectral centroid, zero-crossing rate, and envelope shape get extracted. SVM classifies the hit in 2ms. Result gets sent to Unity over UDP. Total time from mouth to game: under 45ms. A neural net would have taken 15ms just on inference - that's why I picked SVM.
Why Kicks and Snares Are Hard
Both are sharp transients with broadband noise. To a basic classifier, they look almost identical. The difference is where the energy lives: kicks sit in the low end, snares are mostly noise spread across the spectrum. Spectral centroid is what finally let the system tell them apart. That single feature took accuracy from 78% to the low 90s.
Two Processes, One Game
Python does the audio math. Unity runs the game. They talk over OSC on localhost. Seems like overkill, but it was worth it. I could tweak the classifier, retrain it, and test it without restarting Unity. And when the game crashed (often), the audio pipeline just kept running.
What Went Wrong & How I Fixed It
180ms of Lag (Four Times Too Slow)
What Was HardThe first working prototype recognized sounds. It just did it 180ms late. Try tapping a rhythm when the response comes 180ms late. It feels completely off.
What I DidI cut latency at every step. Smaller FFT window. Circular buffer to kill I/O blocking. SVM instead of neural net (2ms inference vs 15ms). UDP instead of TCP. Each change saved 20-40ms. Final result: 45ms at 92% accuracy.
The Classifier Thought Every Sound Was a Kick
What Was HardKicks and snares both hit hard and fast. To a naive classifier looking at amplitude and timing, they're basically the same sound. Accuracy was stuck at 78%.
What I DidStopped looking at volume. Started looking at where the energy lives. Spectral centroid (low = kick, high = snare), zero-crossing rate, envelope shape. Also built separate models for solo hits vs rapid patterns. 78% became 92%.
Audio Would Just... Stop
What Was HardOn slower machines, the ML inference would hog the CPU. The audio buffer would underrun. No error, no warning - just silence in the middle of playing.
What I DidLock-free ringbuffer between threads. Audio capture runs on its own thread and never waits for ML. If classification is slow, audio keeps flowing and the next hit still gets captured. Zero dropouts after the fix.