TUTORIAL for the HANDBOOK FOR ACOUSTIC ECOLOGY


SOUND-MEDIUM INTERFACE


This topic is not how most texts on the subject of sound will likely begin – basic concepts in oscillation and magnitude are more likely, and we will get to those later – but from the interdisciplinary approach that we have established, it would seem more appropriate to begin with the most basic concepts that ground their knowledge.

Most models to describe sonic phenomena involve the transfer of energy, signals and information via processes which we are referring to here, in the broadest sense, as the sound-medium interface. The model is a very simple linear one, as shown below, where vibration in a sound source is transferred to a medium through which it reaches a receiver. Note these generic terms, which are typical of a scientific approach, are aimed at generalizing a situation, including whether the medium is air and the receiver is a human!

SOURCE ––––––> MEDIUM ––––––> RECEIVER

Specifically, we are interested in the terms which describe the process of energy and information transfer, from the source to the medium, what determines its characteristics, establishes its limits and results in its implications. We have chosen five major types of energy and information transfers for comparison.

(click on the link under the capital letter to go to that section)

A) Acoustic: the physical transfer of energy from a vibrating object to the surrounding medium (called "acoustic radiation") and the physical transfer of acoustic energy in a medium (called "sound propagation")

B) Psychoacoustic: the processing of incoming sound waves by the auditory system to extract usable information for the brain, the process most commonly called "hearing"

C) Electroacoustic: the transfer of energy from an acoustic to an electrical form (a process called "transduction") including various possible intermediate stages of electrical manipulation, storage and retrieval, analog and digital formats, as well as the re-conversion of the signal to an acoustic form

D) Electromagnetic: the transfer of audio signals from transmitter to receiver via an electromagnetic wave, a process called "radio transmission"

E) Soundscape: the processing of information from the sonic environment to extract usable information for the brain which can influence human behaviour, the process most commonly called "listening"

In the original Handbook chart, a sixth term is given for the sound-medium interface in music, namely "performance", which in this context can be thought of as the transfer of musical information from the composer to the listener usually via a human performer and a score, but now also possible via loudspeaker systems. This topic is not illustrated here, but it is good to keep in mind that this linear concept of music is pervasive, and alternative models should be considered where appropriate, as in the soundwalk where the composer, performer and listener are one and the same.

The main drawback in this approach is that once we get started on any discipline, at some point we have to break it off and go to the next one; however, with links available to subsequent material, this shouldn't be too frustrating. Also, it is inevitable that new terms have to be used here without giving them a full definition. Luckily, if you're in that situation, use the Handbook link that is provided to get at least a quick definition or some additional context, but don't get caught up in the details, and use the back button to return quickly to the tutorial.

P) A downloadable summary pdf of Digital Representation of Numbers and Audio Samples (right click or control click)

Q) A review quiz

home

A) Acoustic: an important part of acoustics deals with the characteristics of energy transfer (called radiation) from a vibrating body to the surrounding medium, or from one type of medium to another, such as from the air to the ear canal (i.e. from open to enclosed air space). The emphasis is usually on the efficiency of energy transfer. In addition, we discuss some basic parameters of sound propagation through a medium; the influence of the environment on such propagation, however, will be discussed under Sound-Environment Interaction.

After a century of electrical amplification, and even longer for large machinery, it may seem odd to realize that purely acoustic sound energy is actually rather fragile and does not easily reach our ears with a great deal of loudness unless the energy transfer process from the sound source to the air, called acoustic radiation, is made more efficient. Moreover, it takes a great deal of physical energy to produce low frequency sounds, compared with high frequency ones.

Let's consider the transfer of acoustic energy from a solid to the air. First the object must be set in motion by some force; the result is called a forced vibration, such as tapping or hitting, bowing, plucking, scraping, shaking, or in the case of a tube, blowing into or across it.


In this video example we hit a tuning fork, similar to the one shown and tuned to the reference pitch known as A 440 (Hz), on a table (or other object) and hold it in the air. Unless we hold it close to our ears, it is barely audible at a larger distance. So how do you explain what happens in the video when the fork is placed on the table? Here's an answer.

This example introduces the very important acoustic phenomenon called resonance and although its implications are many, at this point we will simply emphasize one of them:

attaching a vibrating object to a resonator will amplify the sound by making the energy transfer more efficient

Since we are so used to electrical amplification to produce any degree of loudness, it is important to realize how it differs from this form of "natural" acoustic amplification: no energy is added to the tuning fork+table system (unlike an electronic amplifier); instead, the energy transfer is made more efficient. From the point of view of acoustic ecology, this distinction is fundamental, because it shows that acoustic energy, as it leads to the perception of loudness, is constrained in the acoustic world, but not so in the electroacoustic world. This is not to say that electrical amplification is necessarily "bad", but it must be constrained by human intervention.

1. Personal listening experiment: aka the famous suspended coat hanger demo.
Take an ordinary metal coat hanger and suspend it on a piece of string that is 2-3 feet long. Bang it against some objects and note the rather thin sound it makes. Then wrap the ends of the string around both of your index fingers and plug both ears with them, while balancing the coat hanger in front of you. Now bang the same same objects with it (and don't worry about looking silly!).

Why is the sound you hear so entirely different? Answer here.


In the next module on Vibration, we will learn that the modes of vibration in a resonator are at specific frequencies. The tuning fork vibrates at predominantly one frequency, so the fact that the table acted as a resonator shows that it must be able to vibrate at many frequencies, including that of the tuning fork. A very small piece of wood would not work as a resonator for the tuning fork, because its surface area is small, and its resonating frequency is likely to be much higher than that of the fork.

However, if we wanted to design a resonating amplifier specifically for a certain frequency (and pitch) then we will need to know how to calculate that. The common solution, for instance in percussive keyboard instruments like a xylophone, marimba or vibraphone, is to place a tube of the correct length under each key. The tube for each key will differ in length, with longer tubes for low pitches. The modes of vibration of a tube will be covered in the Vibration module.

Conversely, if we want to reduce the efficiency of acoustic radiation, in the sense of noise reduction, we can use the following tactics:

  • reduce physical contact between the sound source and anything nearby; this is called decoupling the vibration (physical distancing?)
  • surround the sound source with materials that will absorb the vibration; this is called damping the vibration
  • support bulky objects on heavy duty springs which impede the transmission of the vibration
  • enclose the sound source, although technically this is mainly to prevent the sound propagating through the air; it will still travel through any solid attachment that remains
Acoustic Impedance. The technical concept behind all of these examples is the acoustic concept of acoustic impedance (which also operates in electrical theory as resistance). It describes the properties of materials in terms of how efficiently acoustic or electrical energy is transferred. Not surprisingly there are equations for this which are beyond our scope. But the basic concept is still a valuable one, particularly in the context of resonators, coupling and de-coupling and so on. Impedance matching means that we are modifying an acoustic system of energy transfer to improve energy transfer, as in the tuning fork+table example, or to reduce energy transfer, as in the de-coupling, damping and absorption instances described above.

A special case of impedance matching is the transfer of acoustic energy in or out of an enclosed air space to the open air. Visually, we do not see a barrier, such as at the opening of a tube, or in the case of hearing, at the opening of the ear to the auditory canal, but acoustically there is an impedance mismatch and the efficiency of transfer will be low.

I'm sure you've never thought of your outer ears as an impedance matching device, but they are! The outer ear, properly called the pinna, reflects and guides acoustic energy into the ear canal, also called the auditory canal. We have also learned to "cup" the ear with our hand to improve the loudness of what we're hearing, or the opposite, plugging our ears to reduce it.

2. Personal listening experiment: Cup your ear with one hand and listen to various sounds. Does this amplify the high or low frequencies better? Here's the answer.



For centuries, people with hearing loss have used an ear trumpet or ear horn to amplify sound coming to their ears, as in the above diagrams and if you Google the term, you will find amazing images of ear trumpets in all kinds of shapes, sizes and materials, including some "stereo" examples, not to mention often humorous images and cartoons of those using them. They were the forerunners of hearing aids and stethoscopes, and have become a common meme for hearing in general.

Many such horns are conical as above, but it's generally accepted that an exponential shape, such as that used in all brass instruments, where it is called the bell, is the most efficient at transferring sound out to the open air, and minimizing it being reflected back into the tube, or in the case of the ear trumpet, maximizing sound entering the ear canal. Loudspeaker design and other forms of speaking tubes follow the same principle.

In this photo, we see a typical recording session from around 1920, prior to electrical amplification, where the sound of a musical ensemble had to be funneled into a cone-shaped tube and fed to a recording stylus. Since levels could not be controlled, louder instruments had to be placed farther away. It all looks rather crowded!



In the following diagrams we see Emile Berliner and the first phonograph, with its typical cone-shaped speaker, and in the ad for an early phonograph (1906) we see both the exponential horn, and a conical speaking tube to record sound. Prior to electrical amplification, these were the only methods to collect sounds or to radiate them by purely acoustic means. The frequency response was severely limited to the upper mid-range (250 Hz-3 kHz) which was sufficient for speech, but less adequate for music.




Sympathetic Vibration. One of the subtlest examples of acoustic energy transfer via resonance is the rather poetically named phenomenon call sympathetic vibration. When you have two resonating objects that have the same resonant frequencies and are physically connected to each other, even by a string or the air, setting the first one into vibration, then damping it will cause the second object to start vibrating on its own, admittedly with a weaker strength than the original, but at the same pitch or pitches.


A note is sung near piano strings that are free to vibrate (i.e. no damping)
Can you distinguish the two vowels from the sympathetic response of the strings?


A hand clap near piano strings that are free to vibrate (i.e. no damping)
Since the sound is unpitched, why do the strings still sound? Answer here.

Given how many visual references there are in everyday language, it is interesting that this purely acoustic phenomenon is often used as a metaphor for human relations. Think of "being on the same wavelength", or "in tune" with someone, or even more literally, "resonating" with them and their ideas or personality. These acoustic metaphors for the exchange of energy based on a shared "frequency" are rather exact!
Intrinsic and Imposed Sound Morphology.

A comparison of the morphology produced by singular through to continuous energy input (source: Wishart)

Trevor Wishart, in his book On Sonic Art, provides a useful model shown here about the form of acoustic energy input, from a singular impact through repetitive impacts, to continuous energy input. In his model, a singular impact produces an intrinsic morphology (or sound shape) where the acoustic result is mainly about the internal properties of the struck object (although as we will demonstrate later in "dual processing" we can simultaneously identify the nature of the brief energy source). At the upper part of the diagram, continuous energy input has an aural result that is mainly about the gesture being imposed on the source. This distinction can also be described as follows:

MORPHOLOGY
 INTRINSIC <–––––––––––––––––––––––––––––––––––––––––––>  IMPOSED
ENERGY INPUT
 Impact/Resonance <–––––––––––  Iterative  ––––––––––––> Continuous
INFORMATION
 Sound Object <–––––––––––––––  Texture –––––––––––––> Gesture

In terms of the information provided by these types of sound events, intrinsic morphology produces a primary image of the sound object (such as a bell, as in the lower right of the diagram), and secondarily of the energy source; imposed morphology produces a primary image of the communicative gesture, and secondarily of the sound source (as with the voice, in the upper left). Iterative energy input tends to produce textures that seem to balance the two extremes. As an example, compare striking a bottle and blowing across its lip.

Electroacoustic processes, being freed from the physics of acoustic energy transfer, can be designed to reflect real-world sources and gestures (i.e., they are abstracted from the real world) or to "defy" them with purely abstract sounds and shapes.

Geometric Spreading. Sound radiates in all directions from a sound source via a process called geometric spreading. Energy is lost as the sound travels because it is spread out over increasingly large areas. In two specific cases with simple geometry, by assuming the source to be a point source in space, or a line source, the equations are simple enough to predict the loss of energy per doubling of distance, namely 6 dB per doubling in distance from a point source, and 3 dB per doubling of distance from a line source. We will return to this under Sound-Environment Interaction, along with all of the other factors that modify the sound wave as it propagates.

The speed of sound as it radiates and propagates is also well understood, at about 344 metres/sec or 1130 ft/sec at 20°C (68°F) in air, and it gets faster at warmer temperatures, and slower at cooler ones, mainly because of the kinetic energy of the air molecules in warmer air. However, it should also be kept in mind that:

all frequencies travel at the same speed, and therefore arrive in synch at their destination

What is less intuitive is that sound travels faster in water and steel than in air, almost 1500 m/sec (or 4700 ft/sec) in water, and 5000 m/sec (16,400 ft/sec) in steel. You might be tempted to think that it is because these media are denser, but in fact that is not the case. The correlation is with a parameter that physicists call the elasticity of the medium, which as it increases, allows the acoustic energy to be transmitted through the medium much more efficiently. So, sound underwater travels 4-5 times faster than in air, depending on temperature, which will affect the delays being detected in sonar.


Index

B) Psychoacoustic: The discipline generally known as psychophysics emerged in the 19th century as the study of the relationship between objective physical stimuli and the resulting perceptual sensations they created. The goal was to quantify the latter (the response) such that they could be correlated with the standard measurements of the former (the stimulus) in a kind of input/output model. Today we think of this as the Stimulus-Response paradigm (or S/R), and it is the logical extension of the energy transfer model in terms of what happens to the transmitted energy at the receiver of the energy, in this case the human ear.

Psychophysics includes all manner of sensory stimulation, and therefore psychoacoustics began as a subset of those studies to deal with sound. Today we might be more sceptical about the value of measuring sensations, but to 19th century scientists such as Gustav Fechner (1801-1877) it seemed very exciting to be able to do these kind of measurements. His work was not specifically focussed on hearing, but largely vision and some other sensory modes. His main theory, known as the Weber-Fechner law, was that sensation increased as the logarithm of the stimulus, or in simpler terms, stimuli need to increase by a multiplication factor in order to increase the resulting sensation linearly. Although the law doesn't hold in all contexts, the general idea of a logarithmic relation in S/R perception is a common observation.

This basis for classical psychoacoustics is typically expressed in five different types of measurements that reflect the S/R model:
a) response characteristics, that is, how the magnitude of the sensation caused by the stimulus relates to the physical magnitude of the stimulus

b) the threshold of sensation, that is, the minimum stimulus that produces sensation

c) the just noticeable difference (jnd) in a certain parameter of the stimulus

d) the resolution or resolving power of the system to separate simultaneous stimuli, or the way simultaneous stimuli cause a composite sensation

e) how stimulus sensation changes over time, the most common change being habituation to a repeated stimulus
We will consider briefly each of these typical types of laboratory measurements for sound. In order to relate them to frequency, the stimulus is typically a sine wave with a single frequency. In the 1920s, electrical amplification made it possible to produce such controllable sounds, just as today, digitally produced stimuli are often used in psychoacoustic experiments. At the same time, amplitude came to be measurable in decibels, so the experimental context was set for frequency and amplitude.

The goal of this approach is to have independent control over individual variables, whereas when we progress to more complex real-world sounds, we will find that parameters are much more interrelated. However, it is still useful to understand the classical S/R approach to psychoacoustics because it continues to be influential.

In this example you will hear 7 different frequencies in descending loudness ramps of 10 steps (each step is 5 dB). Count how many tones you can hear. Do not raise the level of your playback. You may want to keep track of the frequencies you are listening to, keeping in mind they are heard twice in a row:
they are 125, 250, 500 Hz, followed by 1, 2, 4, and 8 kHz
In which of the 7 sequences did you to hear the most tones? Answer
Source: IPO6


a) Response characteristics. The classic example of this for psychoacoustics was originally called the Fletcher-Munson curves, which were published in 1933 by Harvey Fletcher and Wilden Munson, working at Bell Labs. They were later revised and updated as an ISO (International Standards Organization) standard.

However, it is interesting to start with the original curves for what they called equal loudness. Tests with sine tones were done over the entire range of frequencies and at different levels of intensity (measured by the then new unit of the decibel). They found that the auditory system responds very differently to frequency at low and medium sound levels, and much more equally at high levels. Here's how this can be experienced.



Low levels













High levels


When we look at the Fletcher-Munson curves of equal loudness upside down (no, it's not a mistake), we get a chart of how sensitive our ears are at different frequencies, keeping in mind that the low level quieter sounds are now at the top, and the high level louder sounds are at the bottom. Frequency still goes from low to high on the horizontal axis, left to right. The 10 loudness steps you heard in the above audio example are probably correlated with 5 of the upper horizontal grid lines (which are 10 dB apart), now going in ascending order from louder to softer.

The examples started with the 100 Hz tone towards the left side of the diagram. The highest dark line shows the threshold level for that frequency, so in the first four descending sequences, the tones will fall below this threshold as they got quieter, whereas at the 1, 2 and 4 kHz ranges, you should be able to hear all 10 tones (unless you have some hearing loss at those frequencies which in the Audiology module, we will discover is quite common for noise-exposed subjects). The last sequence at 8 kHz probably sounded a bit less loud.

So, even if this isn't clear at this point (and we will return to the right-side-up version of the diagram in the Magnitude module), what you are looking at here is a description of the sensitivity of the ear to different frequencies at different loudness levels. The dark lines show that:

the auditory system's sensitivity to frequencies below 500 Hz falls off at mid to low intensity levels;
we are most sensitive to frequencies in the 1-4 kHz range at all loudness levels;
loudness sensitivity falls off again above 4 kHz

3. Personal listening experiment: Listen to a recording of music that is richly orchestrated, that is, with lots of bass and treble frequencies. The first time, play it at your normal loudness listening level, then repeat it at one or two much lower levels. How does the balance between treble and bass frequencies change? Answer here.




Two examples of the application of stimulus-response methodology in other related contexts. Left: the equal sensation response to vibration frequencies on the skin at different intensities; note that the general shape of the curves is similar to the Equal Loudness Contours, except at about 10% of the frequency, with the greatest sensation in the 100-400 Hz range. Right: public response to aircraft noise in terms of the measured sound level in the Noise and Number Index system.

b) Threshold of hearing. The top line in the upside down diagram above showed the minimum sound pressure level that will stimulate the sense of hearing. It is strikingly curved, so that means it takes a lot more sound pressure for a low-frequency sound to become audible, but keep in mind that low-frequency sounds are usually made by high energy sources.

Also, keep in mind that this diagram is measured under ideal conditions of a totally soundless lab environment, with subjects with normal hearing using headphones, i.e. with no hearing loss. In everday situations, our threshold of hearing is constantly adjusting to the average, ambient level of the soundscape. In a noisy environment, we have a higher hearing threshold than in a quiet one, as you may have observed at night, or in a particularly quiet environment, where even small sounds are quite prominent.

Conversely, after being exposed to a loud environment, you may notice that when you leave and go to somewhere quieter, you may feel you can't hear sounds of normal loudness as well. You are experiencing a temporary threshold shift (TTS) that will take some time to recover, and if you don't get that recovery time, the threshold shift may become permanent hearing loss.


c) Just noticeable difference (jnd). The third typical psychoacoustic measurement in a Stimulus-Response type of experiment is that where we try to find what the smallest difference in the stimuli is that will result in a change in sensation, i.e. a just noticeable change. This difference, commonly referred to as the jnd, is sometimes called the Differential Threshold, that is, the minimum detectible difference in a stimulus. The lower the jnd, the more sensitive we are. In general, we are very sensitive to pitch changes caused by very small frequency differences. This kind of pitch sensitivity exhibits itself most strongly in pitch changes in the voice called inflections, and of course in musical tuning, pitch bending and other types of glissandi.

If you'd like to spend about 5 minutes taking a jnd test - which will give you some appreciation for what test subjects in psychoacoustic experiments often have to endure – try this experiment. It will show you that with a bit of experience, you can detect extremely small changes in frequency, particularly at medium loudness levels with a 1 kHz tone (where our sensitivity is at a maximum). This is because the temporal firing rate of the hair cells in the inner ear is very fine-tuned as we will emphasize in subsequent modules. If the test gives you a threshold shift, take a break for a few minutes before continuing. Good luck!


d) Resolving power of the ear. There are two standard psychoacoustic measures for frequency discrimination involving the perception of two separate tones, and the multiple frequencies found in a spectrum. We will deal with this topic in some detail under Sound-Sound Interaction. The second of these, the critical bandwidth is more important because of its role in the analysis of the spectrum of a sound, that is, its frequency content (as discussed in the second module on Vibration).

The general idea is that the critical bandwidth describes how far apart the hair cells are along the basilar membrane in the inner ear that fire independently. It is therefore a spatially based analysis, compared to the temporal firing of the hair cells that accounts for the small jnd just discussed (less than 1%). By contrast, the critical bandwidth, over the middle range of frequencies is a bit less than a quarter of an octave. This crucial distinction goes a long way to understanding how we perceive pitch and timbre, and will recur several times in the course of this Tutorial.


e) Response over time. Exposure to a constant stimulus does not result in an equally constant response to it. In fact, it is quite the opposite – the response falls off as we habituate to the stimulus, also known as adaptation.

These simple diagrams illustrate the basic patterns involved in terms of perceived loudness to a constant stimulus (the hatched area).
(a) the response falls off exponentially to a constant tone.
(b) after two minutes the stimulus is doubled in loudness, and there's a momentary "bump" in the response, but it never reaches the same level of sensation as at the beginning, but continues to decline.
(c) the stimulus is halved after 2 minutes, and there's a momentary drop in sensation, but the general pattern of decline resumes. This reflects the hair cells being "fatigued", that is, they are over-stimulated and cease to fire, producing what was referred to above as a threshold shift. They require some period of rest to recover and resume firing.
From a communicational point of view, this process of adaptation correlates with the pattern of "listening-in-readiness", that is, repetitive information with little salience is treated as background sound, and as long as the soundscape allows it, the hearing system is on alert for new information. Given the frequent presence of mechanized sounds in the industrialized soundscape, there can be a delicate balance between adapting to continuous sounds without them covering up (masking) informative sounds, a topic that will be picked up again in Sound-Sound Interaction.

4. Personal listening experiment: When you have gone to a party or bar with loud music playing, how loud did you think it was when you arrived, compared to an hour later? Did you notice if anyone raised the level over time?

Index

C) Electroacoustic:  There are three characteristic processes involved in electroacoustics, perhaps the most basic of which is the process of transduction, that is, the conversion of acoustic energy to an equivalent electrical signal, the audio signal, and vice versa. Secondly, there are electrical and digital devices which modify and manipulate such signals, and thirdly, there are devices which store and/or retrieve these signals. In addition, digital recording and playback converts the audio signal to digital form and vice versa, all of this creating a digital box within the "black box" in this diagram of the basic electroacoustic process.


Black box model of the transduction of a sound wave to an audio signal and back

A fascinating aspect of this electroacoustic process, driven more by the audio industry than by audio engineering, is the concept of fidelity between the input and the output signals. The term "fidelity" is derived from the Latin fidelis or faithfulness, so what does it mean for a reproduced sound to be "faithful" to the original? The engineering approach is to measure all of the response characteristics that are described below (e.g. frequency and amplitude, but they can be extended to other issues such as phase response or transient response). On the other hand, an audio designer will be more likely to enhance sounds to give them more presence or aural impact in order to create a larger-than-life impression, and in a sense an artificial reality.

An important clue to the issue of fidelity as an aural construct comes with the history of the audio industry and advertising associated with it. Given that listeners a century or more ago had never heard reproduced sound detached from a visible source, they needed to be educated about how to listen to these new experiences. Moreover, in order to justify purchasing decisions regarding reproduction equipment and recorded content, they needed to be able to make distinctions about sound quality.

In the acoustic world, all sounds are originals, and never exact repetitions, so the idea of fidelity would make no sense. Of course, a voice, music or sound signal might be harder to hear under certain acoustic circumstances, but the listening purpose would mainly be to identify the sound source and to recognize its information.

With this in mind, we might better be able to understand the phenomenon of the so-called Edison tone tests that were carried out (and advertised) between 1915 and 1926, often with large audiences. A singer (contracted with Edison) would appear and sing live, and then with the lights typically dimmed, would be "replaced" by the recorded version of his or her voice. Keep in mind this was the era of acoustic recording and playback, which clearly lacked the ability to reproduce anything but the middle range of frequencies, including those of the voice.

Audiences routinely claimed to "not be able to tell the difference" between the live and recorded versions. It would appear that once the listener could identify the voice and hear it as musical, it would seem to be the "same" – perfect fidelity! A new, more analytical form of listening was required to detach the content of the recording from its quality of reproduction, and in fact there is some evidence that fairly quickly, connoisseurs of music started making choices about better and less good sounding recordings. From the industry's perspective, the next question was how to make consumers believe that frequent increases in fidelity every few years would justify new purchases.
You might like to take a little detour into this collection of advertising imagery from over the last century that promoted the concept of fidelity through visual means. In addition, here is a comparison set of examples of the different historical audio formats from the original cylinders in the late 19th century, through to the modern stereo Hi-Fidelity records and digital CD's in the 20th century. It is complemented by the associated advertising from the acoustic disc era, the transition to Hi-Fidelity stereo records in the 1950s and 60s, and the introduction of digital CDs in the 1980s, all key moments in the history of audio, as shown here.

The transduction process is far from neutral, as implied by the term fidelity. In changing the form of the energy from acoustic to electrical, sound now takes on the characteristics of the new medium and is no longer constrained in the same way as before. For one thing, audio now travels without noticeable delays, essentially at the speed of light. Audio connections along a wire are immediate, but interestingly enough, when it gets converted again into a digital format and is transmitted, delays can occur that are referred to as latency.

In the case of musicians playing together over the internet, this lack of synchronization ability can be a problem. On the other hand, musicians are used to the small acoustic delays depending on how far they are physically apart, keeping in mind that sound travels about 1 foot (.3 m) per second, so if they are positioned 10-12 feet (3-4 metres) apart, they are actually functioning with a 10-12 ms delay, which sets a goal for latency reduction.

And of course an audio signal can be repeated and modified at will, stored and replayed at any future date. But it also works with new constraints, such as economic factors, and provides new affordances for sound design. This is a very large topic, so for the moment we will return to the basic level of how audio signals are represented, namely in analog and digital formats, and discuss their implications.

Analog and Digital representations of sound. The term "electroacoustic" was originally hyphenated as "electro-acoustic" to emphasize the process of converting acoustic or physical energy to an electrical form, and not just sound vibrations (other variables such as movement, speed and acceleration could also be transduced). With sound, the most common form of transducer is a microphone. Conversely, the reverse process of converting electrical energy back to an acoustic form is most commonly done with the loudspeaker. The term "transduction" derives from the Latin trans or across, and duco, to lead, which emphasizes the change in type or form of energy. Note that an amplifier is not a transducer because both the input and output are electrical signals.

The transduction process results in the production of an audio signal and the term "audio" should be reserved for that form of sound. The common term A/V "audio-visual" is quite a misnomer as it confuses the electrical form with the perceptual implications of the two words used. Likewise, the term "aural" should be reserved for subjective listening experience, just as "acoustic" ideally should be reserved for the objective aspects of sound as a vibration or wave –  even though in practice, we assume it's also being heard.

The conversion of acoustic energy to an audio signal creates a two-dimensional analog version of the sound wave, most commonly by changing sound pressure to voltage. Both the eardrum and the microphone respond to changes in sound pressure, which are in fact very small. The microphone changes increases and decreases in sound pressure (around the normal atmospheric pressure) to positive and negative voltages, usually in the range of +/- 1 volt.

The eardum, on the other hand, converts sound pressure variations in the air to an equivalent vibration through the bones of the middle ear, and then to the fluid-filled cochlea, where ultimately it triggers electrical impulses that are sent to the auditory cortex via the auditory nerve. Although technically that makes the auditory system a transducer (with energy first being transferred through 3 media –  air, solid and liquid), it is the final stage of the inner ear producing electrical impulses that qualifies it as a transducer. However, this is not the most significant part of the process – it is the analysis given to the sound in the inner ear that makes the process significant.


A pink noise waveform as an audio signal

This diagram of an audio signal (pink noise in fact) is taken from a typical editing program, which deals with sound in the time domain of voltage versus time. However, in this case the time scale has been greatly enlarged to show the microlevel of the signal (approximately 10 ms of sound). A good editor will allow this domain of the waveform to be edited, but it is very tricky as any discontinuities that are introduced will likely be heard as clicks. Therefore, editing usually happens at a much larger time scale. However, the diagram does show the zero voltage line in the middle with positive and negative voltages around it. The vertical scale will likely be shown as a percentage or in decibels.

The further process of converting an audio signal to a digital form is not a matter of transduction, but a change in the form of representation of the signal, namely from analog voltages to binary digital numbers that are called samples. At this point, if you are unfamiliar with the representation of digital numbers in binary form, it would be good to review this material here, as several aspects of the digital representation of numbers affect the basic parameters of amplitude and frequency bandwidth.

Here we will compare aspects of the two forms of representation, analog and digital. On the surface of it, given that audible sound is an analog phenomenon (that is, a continuous variation of sound pressure), it may seem counterintuitive that a digital representation, as a series of discrete samples, might be considered as somehow optimal, and indeed there has been a lot of (often heated) discussion on this very point, so let's approach the subject more objectively with the help of some tables. As you study these tables, keep in mind that

the digital domain can only be entered (and exited) via the analog domain


ANALOG
DIGITAL
REPRESENTATION
Voltage
(continuous)

Binary numbers
(discrete)

TRANSFER
Transduction
acoustic --> audio
sound pressure --> voltage
Conversion
analog to digital converter
ADC

"The Enemy"
Noise
Errors

In the last row, the somewhat facetious category of "the enemy" draws attention to the inevitable flaws in each representation. In the analog domain there is always background noise, which leads to the constant "fight"  by the audio engineer to reduce the level of the background noise as much as possible, described as improving the signal-to-noise ratio (SNR). However, it cannot be entirely eliminated, so the goal is to minimize its audibility.

Somewhat similarly, in case you're under the illusion that the digital representation of sound is "perfect", there remain sources of errors either in the transmission of the digital stream in the form of dropouts or distortions of the signal, or those errors that arise in the representation itself, as discussed next. Some listeners have become accustomed to background noise in audio (e.g. hiss) which they claim they can easily ignore because it is perceived as separate (i.e. uncorrelated) from the foreground sound and similar to other ambient background sounds in a soundscape, whereas digital distortions are correlated with the signal itself and cannot be as easily ignored. What's your experience?


ANALOG
DIGITAL
RESPONSE
CHARACTERISTICS
Frequency Response

Bandwidth
(flat)
Sampling Rate SR
bandwidth = 1/2 SR

RESPONSE
CHARACTERISTICS
Dynamic Range

Linear
Dynamic Range from
Background Noise to Saturation

Number of Binary Bits
(6 dB per binary bit)
20 log 2 = 6

The quality of analog or digital representations of sound in audio compares the input and output of the transfer, and measures the difference as its response characteristics. Two of the most important of these is the response to frequency (i.e. how well the range of frequencies in the sound is represented), and dynamic range (i.e. how accurately the range of amplitudes in the sound is represented, from the lowest level to the highest). The digital equivalents will be discussed below.

In analog recording and reproduction, it has traditionally been thought that a flat frequency response is desirable over a large range of frequencies (the bandwidth). Flat frequency response implies that all frequencies are reproduced equally, despite the fact that in the acoustic world, nothing resonates equally at all frequencies (which becomes one of the issues in loudspeaker design).

For instance, in this diagram, the complex frequency response of two violin strings at left due to the resonances of the violin body, is highly desirable to add richness to all the pitches being played, by comparison with the diagram at the right showing an approximately equal or "flat" response of a loudspeaker to a fairly wide range of frequencies. Critics of the "flat frequency response" paradigm point out that this criterion was a marketing tool by the early audio industry to convince consumers of the quality of, for example, an amplifier where it is fairly easy to achieve this measurement, and yet it is not one that correlates well with perceived sound quality in reproduction.


Frequency response characteristics of an acoustic violin sound (left) and an electroacoustic loudspeaker (right)

In terms of the response characteristics to amplitude (and therefore loudness) levels, linearity is desirable, meaning that the loudness levels of sound, from quiet to loud, are reproduced proportionately. Whereas departures from full range frequency response are immediately noticeable (as being "smaller than real life"), departures from linearity in dynamic range are seldom regarded as significant, and in fact, we will deal with designed departures from linearity called compression in the Dynamics module.

In the digital domain, the bandwidth for frequency representation is a function of the number of samples per second used during the sampling process, with the theory being that this bandwidth is limited to one-half the sampling rate, the so called Nyquist frequency, named for the engineer who formulated it. This means that to reproduce a 20 kHz bandwidth, a theoretical minimum of 40 thousand samples/second are needed. This might make some sense to you if you understand that to represent a frequency, you need a minimum of two samples (positive and negative) to describe a cycle.

The standard consumer sampling rate, as used in CD's etc., is 44.1 kHz (and the professional audio rate is 48 kHz). The reason these rates are higher than 40 kHz is that the signal needs to be smoothed from the discrete steps of a digital representation (as produced by a digital-to-analog converter or DAC) to a continuous analog form by a low-pass filter set at the half sampling rate, and that filter's frequency response above the half sampling rate (the roll-off) needs to be taken into consideration. Higher sampling rates, usually multiples of 48 kHz, are also sometimes used for other reasons.


Digital sampling of a waveform

The reason for the difference between the consumer CD sampling rate of 44.1 kHz and the professional rate of 48 kHz is another historical example of industrial interests that have little to do with the engineering involved. With the advent of digital audio CD’s and Digital Audio Tape (DAT), similar to cassettes being introduced earlier, there was a great fear in the industry about mass taping of the audio material, in this case “exact” digital copies.

Therefore, the CD standard was deliberately made incompatible with the DAT machines that were also introduced during that period – at least in theory. Insiders knew that the hardware inside the DAT could in fact record and playback either sampling rate, and there was a protection mechanism that could be disabled if you knew how. And, of course, like all other taping phobias, it didn’t happen to any economically significant degree.

The dynamic range of a digital audio signal is determined by the number of binary bits used in the representation, as described in the previously cited pdf. Every binary digit allows numbers twice the size to be represented (e.g. 16, 32, 64, etc). Later under Magnitude we will show that a doubling of amplitude results in a 6 dB increase. Therefore a 16 bit representation should theoretically allow for a 6 x 16 = 96 dB dynamic range, which is adequate for most sounds we're likely to record. However, there will be distortions at the low, quiet end of that range as discussed next.


ANALOG
DIGITAL
DISTORTION
Frequency

Uneven or limited
frequency response

Foldover
(aliasing)

DISTORTION
 Dynamic Range
Non-linearity
Peak clipping
Quantization
1/2 bit maximum error


Many departures from the ideal of flat frequency response are experienced everyday, such as the narrow, mid-range bandwidth of the telephone, or the lack of low frequency reproduction in small computer speakers which are incapable of producing these low frequencies (including most headphones, despite their tendency to be designed for the typical frequency ranges of popular music).

In the digital domain, a frequency that is higher than the half sampling rate will be reproduced lower than that rate by the same amount it is higher, as described by the term foldover or aliasing – the frequency is "folded over" back into the reproduced range. It can also be understood as a kind of under-sampling – too few samples have been taken, as shown below with dots. Instead of representing the required frequency accurately, it actually produces a lower frequency as shown in the dotted lines (in this case, 1/3 the correct frequency).

Something similar happens in films of something in motion, such as a wheel. At 24 frames/sec, only certain speeds of an image can be accurately represented (up to 12 revolutions/second). Beyond that, the forward motion of the wheel is replaced by a backwards motion (a familiar sight to film buffs), but with even more speed, it will go back to forward motion, but at a much slower rate.

Foldover in undersampling a wave (solid line) with too few samples (dotted line)

We normally don't hear this distortion very clearly, but the following spectrogram shows the ghostly pattern of foldover in Barry Truax's granular synthesis piece Riverrun (1987) where the best sampling rate available at that time with the real-time synthesis hardware being used was 19.6 kHz, and therefore frequencies above about 10 kHz could not be realized, but were folded over instead.



The brightly coloured frequencies at the top are in the 5-9 kHz range, but their higher frequency components cannot be reproduced and are folded over in a kind of mirror pattern and seen here as light blue lines.

In terms of dynamic range, the digital domain suffers from the effects of quantization, the inherent inaccuracy of sampling a continuous variable at specific points. In the standard case of representing integers by binary numbers, the error is a maximum of 1/2. For instance, if the accurate value of something is 100.5, and it can only be represented as either 100 or 101, then the error is .5.

With a 16 bit representation of a waveform, amplitude values up to +32767 can be accurately represented, so a .5 error is hardly noticeable. But when a low level sound is being reproduced, say with just 4 bits of information, not 16, the maximum positive amplitude will be 7, in which case a .5 error is very large and essentially produces a lot of quantization noise. This is often heard during the long fade-out (or decay) of an otherwise harmonic sound. A more severe case of quantization is presented here along with its waveform.


A stereo waveform correctly represented by 24 bits


A stereo waveform incorrectly represented by 16 bits


A series of 5 unfiltered sine waves represented by 16 bits as shown above (note the very small steps in the waveform by clicking on the diagram) going to 12, 8, 4, and ending with a 1 bit quantization, as shown below, essentially a square wave, with each sound heard at maximum amplitude

Click to enlarge diagrams

Source: Chris Rolfe

The other type of dynamic range distortion in analog and digital forms is peak clipping, that is, where the maximum amplitude of the sound cannot be accurately represented by the dynamic range of the medium. In the analog case, the signal overloads the system and its peaks are cut off, essentially adding distortion. In the digital case (at right) the situation is even worse as once the 16 bit range is exceeded (the amplitude goes from 32,767 to 32,768), it drops to maximum negative, -32,768, which is a much worse distortion. See the binary representation of these numbers in the pdf for a fuller explanation.



Digital clipping (max + becomes maximum -)

Microphone directivity. A different kind of response measurement is needed for microphones, one that not only involves their frequency response, but also their sensitivity according to direction called the directional characteristic or directivity.


Common microphone directivity patterns

In these diagrams, direction is shown in degrees, with front represented as 0°, back at 180°, and sides as 90° and 270°. The radius of the circle indicates the strength and sensitivity of its response. These are known as polar representations, defined by angle and radius. A perfect circle indicates equal sensitivity in all directions, called omnidirectional.

In that diagram, two versions of the curves are shown, one for high frequency and the other for low. When we look at the diffraction characteristics of sound in the Sound-Environment Interaction module, we will find that low frequencies bend around small and medium sized objects such as the microphone itself, whereas high frequencies do not, and therefore high frequency sounds coming from the rear of the microphone will not be recorded properly (and this will also be true of our own binaural hearing).

In practice, the omnidirectional response pattern which might seem to be the ideal in the sense of its neutrality, does not reflect the way we actually hear, nor is it desirable from a pragmatic perspective, because in field recording, it has become a norm not to want to pick up the recordist's own movements, but rather the sounds coming from the facing direction. So, in this and many other types of considerations, there is no such thing as a neutral or objective recording, because at every stage, choices are being made about representation of sounds in a recording.

The bi-directional, or figure-eight directional characteristic (or field pattern) shows maximum sensitivity at both the front and back, which is typically used for interviews. The third example, called uni-directional, is more commonly known as the cardioid pattern, because it resembles the shape of a heart. It has maximum sensitivity at the front (shown on the left side here), reduced sensitivity at the sides, falling off to a zero response at the rear, again to minimize picking up the recordist's sounds.

A stereo version of this pattern that is preferred for wide angled recordings of music ensembles and soundscapes, is called a crossed cardioid (or X-Y) pattern.
  Two microphones placed at an angle from each other can provide a wide stereo field for the listener, and is the one used by members of the World Soundscape Project for their soundscape recordings, for instance.

For more information on microphones, see the Field Recording Module.




Peter Huse of the WSP with an early version of angled cardioid mics, ca 1971.


Bruce Davis of the WSP with crossed-cardioid mics touching each other, 1974.


Index

D) Electromagnetic: The transfer of an audio signal from a transmitter to a receiver, including its coding and decoding, via an electromagnetic wave has had a profound impact on communication in the past century. The process is commonly called radio transmission, and its details and variations are numerous. Here we present simply the basis of the process, common to all electromagnetic transmission. Keep in mind that the speed of the transmission is instantaneous (the speed of light), and the distances involved can be enormous and do not depend on a medium of transfer such as the air, hence the term wireless.

The basic process of radio transmission involves encoding an audio signal into an electromagnetic radio wave by the process of modulation which is to make a specific parameter of a carrier wave, most commonly amplitude or frequency, vary according to the modulating wave (the modulator or program signal), which in the case of radio is the audio signal. In colloquial language, the modulator is “piggy-backed” onto the carrier.

In the case of radio transmission, the carrier wave is an electromagnetic wave in a very high frequency range (kHz, MHz or GHz) which can travel large distances, and the modulator is the audio signal. The vast difference between the two frequency ranges of carrier and modulator makes it difficult to diagram, but if we arbitrarily put the carrier wave into the audio range and the modulator into the subaudio range (less than 20 Hz), and choose the amplitude form of modulation, it might look like this.


Modulating signal applied to a carrier signal to produce an amplitude-modulated signal

In the bottom part of the diagram, the modulated carrier, we can see that the instantaneous amplitude of the carrier has been made to vary according to the waveform of the modulator, and since the two signals are shown in different frequency ranges, the repetitive pattern is quite clear. At the receiver’s end, the signal is de-modulated, and the audio signal is recovered. An analogous process involving frequency modulation can be seen here. Both of these forms of modulation are used in electronic and digital sound synthesis.

Another analogy that can be drawn from radio transmission is the distribution and regulation of carrier frequencies, as established internationally, in the radio spectrum. How these broadcast frequencies are determined and allocated for various purposes is essentially a political and economic issue, but the point is that they must not overlap in order for simultaneous signals to be sent and received.

In the acoustic world this strategy is called the Acoustic Niche Hypothesis (ANH) that was initially formulated by a bioacoustician, Bernie Krause, to describe the soundscape of various acoustic habitats where individual species’ soundmaking falls into non-overlapping frequency ranges. We will revisit this concept in the Sound-Sound Interaction module.

A modulation theory of communication. The general concept of carrier and modulator can be carried over into the acoustic world of sound production which has been described in this module as the process of energy transfer. For instance, vocal sound is basically produced by processing a stream of air (the carrier) by passing it through the vocal folds to produce periodic vibrations (optionally), and through the throat and mouth acting as a resonator.

The gestures produced by this processing (the modulation) are much more important for the listener than the air stream itself, even though the latter may indicate the health of the speaker’s respiratory system. Decoding speech, then, can be thought of as a de-modulation process. In fact, all of the sound production types shown above in the Wishart diagram that result from continuous energy input may be regarded similarly.


E) Soundscape. The basis of information transfer considered in soundscape studies is the process of listening. Listening is assumed here to include all aspects of how usable information is extracted from a complex array of environmental acoustic information, how it is classified, and how the information affects subsequent behaviour, including the listening process itself.

We have shown above in the history of psychoacoustics that it began as a largely quantitative approach to aural perception designed to measure its basic variables as response patterns. The soundscape approach differs from psychoacoustics in at least four significant ways. First, it departs from a linear concept of energy transfer (or the post World War II definition of communication as a series of message exchanges) by emphasizing two-way interactions between the listener/soundmaker and the perceived acoustic environment, namely the soundscape
. It is also listener centred, as opposed to the rather passive position of being at the end of a linear series of energy transfers.

Secondly, from a communicational perspective, sound is regarded as playing a mediating role between listeners and the acoustic environment, creating relationships, both at a momentary level, and habitually in long-term patterns. It may even come to symbolize such relationships.

Thirdly, listening is understood as related to the cognitive processes of separating sound and meaning through progressive and lifelong learning. There are two main sources of meaning extraction diagrammed here: knowledge of context at the social and cultural level, and the recognition of structural patterns at the various levels of the sound itself (here identified as occurring on at least three time scales - the micro level of the waveform, the event level of the complete sound, and larger scale temporal variations). These two sources of knowledge are sometimes referred to as the outer and inner levels of complexity, respectively.



Finally, the methodology of soundscape studies is largely qualitative in terms of perceived attributes and attitudes, evaluation, psychological importance, and so on. In fact, new qualitative strategies are currently being formulated to deal with many of these complex areas. However, three of the simplest classes of perceptually based soundscape concepts, as originally formulated by R. Murray Schafer in the 1970s, are still useful places to start:
  • keynote sound: sounds perceived as background, either because of their low level constancy, such as ambience, or their frequency of occurrence, particularly if non-salient

  • sound signal: sounds perceived as foreground, usually with particular information typically encoded into them, or else associated with them

  • soundmark: sounds that are culturally and socially regarded as having significant, even symbolic, importance for a community
Other typical subjective descriptions of sounds and their environmental functions can be found under Soundscape.

5. Personal listening experiments:

(1) During a typical day when you're at home, make notes about typical sounds you hear on a regular basis, both in the background and foreground, also noting what they mean, whether expected or unexpected. Which would be classified as keynotes, signals or soundmarks and why?

(2) Find a relatively quiet public space where you can sit down. Spend 5 minutes listening to get a sense of the soundscape, then spend the next 10-15 minutes making notes about everything you hear. Afterwards, sketch a map and indicate where the sounds occurred as well as which were experienced everywhere. How would you describe this soundscape in terms of clarity, balance, functionality and personal affect?

(3) Take a half hour soundwalk in an interesting sonic environment, without recording or note-taking. Try to incorporate different pathways along the route so you can pay attention to changes in acoustic space. Also pay attention to what the soundscape makes you think about in terms of a community and other social relationships. What did the walk make you more aware of?

Index


Q.
Try this review quiz to test your comprehension of the above material, and perhaps to clarify some distinctions you may have missed. Note that a few of the questions may have "best" and "second best" answers to explain these distinctions.

home