8 Psychoacoustic Principles For Music Producers And Sound Designers

The principles of sound and the way human beings hear have always been fascinating. For those of you who have no idea about what psychoacoustics is, it's like looking at sound from the psychological and physiological point of view.

There are certain characteristics about how our brains perceive sound. 

 1. Hearing is non-linear:

 Our ears do not pick up sound linearly. The ear responds to increasing sound intensity in a logarithmic way.The ability to hear a pin drop (about 10 dB SPL) is a sign of sensitive hearing. 0 dB SPL (0.00002 Pa) is the threshold of hearing and 120 dB SPL (20 Pa) is the threshold of pain. That means that the range of sounds that the human ear is capable of hearing are spread across a very large range of amplitudes.


2. Equal loudness contours and Fletcher-Munson curves:

Splitting the frequency spectrum from 20Hz to 20KHz into 3 bands – lows, mids and highs – it is observed that lows and highs are perceived to be less loud than the mids at low listening levels of about 55 dB SPL. At around 85 dB SPL we perceive all the bands to seem almost equally loud. At 95 dB SPL our perception of bass and treble is enhanced.

Fletcher and Munson carried out extensive experiments on this concept and came up with a set of graphs that represent how frequencies are perceived at different loudness levels. In order to hear bass, mids and treble at equal levels, the bass has to be boosted by upto 64 times, and treble has to be boosted by 16 times.

In real-world terms, the difference in auditory perception between playing back a piece of music at low levels vs. high levels can be quite different. At low levels, the lead instruments or vocals would probably be clearly audible but the rest of the instruments would sound too low. Since the lows, mids, and highs even out while listening loud, a common trick to make music “seem” louder is to reduce the mid frequencies and/or boost the high frequencies while maintaining a solid bottom end. 

By doing this, we trick our ears into believing that the music is quite loud, even at low listening levels. This scooping out of mid frequencies can sometimes be too obvious, and must be done without drastically changing the tone of instruments. Also, be mindful of the fact that low frequencies take up more space in a mix than high frequencies (refer graph above).

In many forms of music which are inherently loud (Like EDM, Metal, etc), the sounds for every instrument are designed in such a way that they do not lose their tonality even after scooping out a lot of the mid frequencies. Bright lead synths / crisp guitars and fat bass textures make the track sound louder than it actually is. 

 3. Inter-aural Time Difference:

The inter-aural time difference or ITD is the difference in arrival time for a sound to reach one ear versus the same sound to reach the other ear. Although the time difference is very small, the brain uses this information to estimate the direction and angle of a sound source. 

This principle can be helpful for designing mixes which sound 3D taking advantage of our binaural hearing. 

4. Inter-aural Intensity Difference:

 IID is the difference in sound pressure level (perceived as loudness) and frequency distribution between the ears. Thus the audio signal reaching each ear is slightly different. The brain uses this information to estimate the distance from the sound source. 

A combination of IID and ITD is how the brain can make out the direction of sound. With the help of modern technology, several plugins can emulate binaural panning which can place sounds above, below, left, right, in front, behind, all in stereo output.

The concept of surround sound, such as 5.1, 7.1, Auro-3D, are all derived from IID and ITD. For example, a signal can be made to sound like it's coming from behind the ears by applying a special technique known as pinna filtering. The shape of our pinna filters out certain frequencies from an audio signal. Replicating this effect makes any sound seem like it's source is behind us.

 5. The Haas effect:

Haas effect talks about how the brain interprets direct source sound and reflected sound in a closed space. Helmut Haas discovered that if the time difference between two copies of the same sound is between 10 – 35 ms, the brain perceives them to be one sound. The delay between the two sounds creates an effect which makes the combined sound seem larger.

 In a song, Haas effect can be applied in several different ways. To add stereo width and space in a song, simply duplicate any layer (voice, guitars, etc), pan them 100% in opposing directions and delay one signal by 10-35 ms by either using a delay plugin or manually shifting the waveform. Doing this makes that layer sound bigger and wider and adds depth to the song.

 This can also be used to trick the mind about the direction of the sound. After duplicating, delaying, and panning the signal, making one of them louder than the other by at least 10 dB makes the mind believe that the primary source of the sound is from that side, even though the other copy is playing in the opposite direction. This way, elements in a song can be panned without completely making certain directions empty or unbalanced.

 Sometimes, detuning the copies by a few cents each way also adds greatly in making the sound bigger. Some software instruments and plugins have the option of creating this unison effect by themselves. The best way to do this on live instruments would be to use two different takes, as they would almost always contain enough differences in intensity, time, and pitch.

 The timing is important when implementing this principle. If the delay time is less than 10 ms, it can get affected by phasing and create weird artifacts. If the delay time is more than 35 ms, the brain may start to perceive it as two different sounds instead of one.

 6. The cocktail party effect:

 The cocktail party effect is primarily a binaural effect which relies on localization of sound to help focus on a certain source. When multiple conversations are going on in a party, if you’re not paying attention to any of the conversations, it seems like noise. However, if you decide to pay attention to a particular conversation you can hear and listen to it perfectly well. When you decide to switch to another conversation happening in the same party you can do so and you will be able to hear and listen to that conversation. We can focus on whichever conversation we want to and whichever we decide to tune in on becomes the one we listen to and understand.

 In music, this principle can be used while layering different elements together. New elements can be introduced in a way that they attract attention towards them and take away attention from existing elements. On the other hand, new elements can be introduced to emphasize the effect of existing elements. 

 In movies, this principle is used while establishing a soundscape with dialog, sound effects and music. The soundscape is designed so as to make the audience focus on what is desired.  

7. Auditory masking:

Auditory masking is the process that happens when one sound overshadows another, i.e., the perception of one sound is affected by the presence of another sound. This can happen if the two sounds in question have a huge difference in level or frequency or timbre or complexity or any combination of these. 

There are primarily two types of masking:

  • Simultaneous masking

  • Temporal masking

 Simultaneous masking is primarily frequency-based. It occurs when two sounds are playing in the same frequency band and one sound is very loud compared to the other. By doing this, the other sound blends into the louder sound, provided that the loudness difference between the two sounds is great enough. 

Temporal masking is time-based. It occurs mostly in the presence of transient sounds. If two transients occur one after the other, the first one being the louder one and the second one occurring within 100 ms of the first, the first transient masks the second one. This is known as forward masking or post-masking. Masking can also occur when the softer, secondary transient occurs within 20 ms before the louder, primary transient. This is known as backward masking or pre-masking.

Auditory masking is an important factor in audio compression used in formats such as MP3. The file size difference between WAV and MP3 is substantially huge, and this occurs mainly by setting thresholds on the masked elements. To cut down on file size, MP3 converters take away a lot of bits (or pieces of information) from the masked elements which the human ear would have found difficulty in hearing them anyway.

Taking away these bits creates a distortion in the sound, but the better the quality of the MP3 file, the more difficult it becomes to hear this distortion. A high-bitrate MP3 file (256 kbps or higher) would not take as much information away from the file as a low-bitrate MP3 file (64 kbps, 128 kbps) and hence, differentiating between a high quality MP3 file and a WAV file can be very difficult.

8. Layering:

 Human ears find it difficult to distinguish one sound from another when listening to a huge layer of sounds. For example, a kick drum layered with a snare, layered with a tom, layered with a cymbal - When all of them are played together at the same time as one hit, our brain treats it as one big impact.

 This effect is used extensively in loud music with aggressive sounds, such as heavy metal, dubstep, etc. A combination of several guitars playing big chords and riffs are treated as one big guitar layer. Similarly, a dubstep wobble bass is usually perceived as one huge bass layer, though it almost always contains several layers from various frequency bands.This is applied even in orchestral music in the form of big chords, spread across various instruments and various octaves. It all adds up to make it seem like one huge sound.

To make layering more effective, one must choose layers in such a way that they blend well with each other. Chord voicing is also important when layering. EQ and other effects can be used to remove overlapping frequencies between the different layers. These sounds can be glued together by processing them with bus effects, such as bus compression, distortion/saturation effects, reverb, delay and so on. Doing all this makes it more difficult for the ear to identify the individual elements but makes the combination of sounds feel cohesive.