Skip to main content

AI Audio Research

Can AI generate game sounds in real time? I built a prototype and tested three approaches. They were all too slow - but caching made it work.

Role Solo Researcher
Type Bachelor Thesis
Year 2025 - 2026
Stack Python, UE5 C++, WebSocket
Python ElevenLabs API WebSocket Unreal Engine 5 MMAudio AudioGen

What I Tested and Why

Games ship with thousands of WAV files. Gunshots, footsteps, explosions - all pre-recorded, all taking up space. What if AI generated them on the fly? Every sound unique, no giant audio banks. The problem: latency. A gunshot that arrives 100ms late feels broken. I wanted to measure exactly how far off we are.

Unreal Engine 5 sends a text prompt to a Python server over WebSocket. The server calls an AI model and streams audio back. I tested three: ElevenLabs (cloud API - sounds amazing, latency is a lottery), MMAudio (local GPU - I hacked a video-to-audio model to accept text prompts), and AudioGen (Meta's text-to-audio - most consistent, still way too slow).

What I Tested

ElevenLabs

Cloud API. Best sounding by far - sometimes you can't tell it's generated. But the latency is a dice roll: 1 second on a good request, 12 on a bad one. Same day, same prompt.

MMAudio

Runs locally on GPU. 2-5 seconds per sound. It was built to add soundtracks to video, not games - I had to hack the pipeline to accept text prompts instead of video frames.

AudioGen

Meta's text-to-audio. Most consistent of the three at about 3 seconds. Reliable, predictable - and still 3 seconds too slow for a gunshot.

Caching Sounds Ahead of Time

Every model failed the speed test. The fastest was still over a second - roughly 60x too slow for a gunshot. I'd added a simple cache early on just to avoid regenerating the same sound twice. Then I looked at the delivery times: 35-90ms. Hold on. That's fast enough.

So I stopped trying to make AI faster and asked a different question: when should you generate? Load screens. Menu screens. Quiet moments. Pre-generate the sounds, cache them. Player pulls the trigger, sound plays from memory. Queue a fresh variation in the background. Nobody notices.

Documentation

Read the Full Paper

The full thesis - benchmarks, failures, and the caching approach that ended up solving the latency problem.

Read Paper (PDF)