Audio, music, and sound effects

Key Points

[ ] How is triggering audio samples like and unlike triggering animations?

Audio

Next let's talk about audio. Game audio design can be the topic of an entire course on its own, but I want to draw your attention to a couple of concepts.

First, how does a computer make sound? Speakers work by applying an oscillating electric field to an electromagnet to cause a paper cone or other membrane to vibrate rapidly (from there, the air or other medium vibrates, and then your face and ear bones). A larger pair of opposing charges yields a more intense vibration and thus a louder sound amplitude. Peaks that are closer together are perceived as higher frequency sounds, while peaks that are lower apart are perceived as lower pitched sounds, and complex waveforms carry multiple pitches superimposed on each other.

So far, so analog. You may have seen audio quality settings in video editing applications or your OS with numbers like 44.1KHz or 22KHz—these figures refer to how many samples per second are used to drive the audio stream. We can produce more precise superpositions of waveforms with finer temporal resolution—but audio output from a computer is different from the audio from a musical instrument or voice. On a computer we can in theory represent these waveforms analytically as equations, but at output time we nearly always produce them as streams of digital samples: 44,100 numbers per second of audio, each of which describes how far to displace the speaker at a given instant of time. At some point, digital-analog converter circuitry transforms our 8-bit bytes or 32-bit IEEE floating point number bit patterns into charges which deflect the speaker's magnet one way or another, producing sound.

Advantages of samples include that mixing of distinct audio streams can happen in software just by adding together (and possibly normalizing) numbers; we can modulate audio by multiplication, introduce delays by memory copying, and do all kinds of fun tricks like that. Algorithms like the Fourier transform allow us to go between the amplitude domain (where each number describes the instantaneous amplitude) and the frequency domain (where numbers describe the underlying pitches). But the iron law of computer audio is feeding the stream of samples without taking too much time—if the buffer is filled too slowly, audio gets distorted and jumpy, and if the buffer is filled too quickly we notice choppy, sped-up sound. We rarely want to produce 44,100 samples in one go, since a second can be a long time; we'd rather fill up the buffer and submit it to the "renderer" (the audio hardware via the OS) a little bit at a time, trading off between low latency (and high, frequent CPU costs) and higher memory usage (and possibly delayed sound effects).

In our Rust projects, I suggest using the kira crate since it seems reasonably cross-platform and expressive enough for our needs. It has both simple APIs for triggering sounds and means to arrange and synchronize multiple sounds in concert, as well as a way to define sound "events" triggered automatically by timers.

The most basic form of in-game audio features a looping background track (possibly with a lead-in or lead-out section when it starts and stops) and overlaid sound effects. On very old game hardware, the same sound channels had to be shared among music and sound effects; but modern computers can mix huge numbers of audio streams seamlessly. Do note, however, that just because you can play a hundred overlapping sounds does not mean you should! Common tricks used to avoid muddy-audio situations include limiting the number of sound effects that can be triggered within a single frame, putting a cooldown on any one sound effect before it can be played again, forcing sound effects to play in sync with the background music in some way (say, queuing them up to play only on the song's down beats), and so on.

In games, we have to deal with interactive (and often even spatial) audio streams. Whether it's the song changing as we switch between rooms, additional music tracks or tempo changes in dangerous situations, or a cacophony of weapon sound effects and enemy reaction sounds, we don't always know in advance how many simultaneous sound sources we have or how they're mixed. Using kira, you might keep around the InstanceHandle for a spatial sound (so it doesn't get dropped!) and adjust its volume based on the player's and source's relative distance and orientation.

To sum up, it's convenient to provide for yourself some API where you can queue up or trigger sounds or loops, with parameters indicating what to do if the same sound is already playing (retrigger, cancel the old sound with or without crossfade, or do nothing). Sound files and sample buffers should be treated similarly to image resources in that you want to load them in advance of their use and package related ones up together to be unloaded when necessary.

Besides kira, another excellent choice specifically for games is the C library fmod, for which the rust-fmod bindings exist. Fmod is a substantial, professional-quality library with lots of features, and exploring it is out of the scope of these notes—but if you want to use it, please be sure to talk us through how it goes!

Quick aside: If you're looking for 8-bit style sound effects, make your own with sfxr! Of course there's also a Rust implementation.

Activity: Good vibes only. Thinking about your game projects, find four places to put music changes or sound effects. Think about what systems your audio playback must be integrated into in order to get sounds playing at the right time (and not playing over each other too much)—what code would you need to modify? Should audio triggering live in the engine or in your game?