Sources and lack of resources
Environmental sounds have received attention from various fields in auditory research. They can be treated as an instance of the problem of source recognition. Let us outline this process in simple terms. When a human being, in his everyday environment, perceives a sound coming from an object, several listening strategies are used simultaneously. Reflections produced by the surrounding space are combined with waves coming directly from the object. The auditory system parses and abstracts time-varying frequency patterns as belonging to a unique resonant body. Finally the characteristics of the excitation process are identified by their effect on the resonant body.
The mechanisms employed in auditory source recognition have been studied from different perspectives. Grantham (1995) reviews the literature related to spatial cues in the context of sound multisource determination. Bregman (1990) focuses on the problem of grouping and parsing sound events - i.e., ‘streaming’ - within the approach of auditory scene analysis. Handel (1995) uses the terms "auditory object identification," and addresses the issues involved in identifying everyday sounds.
From an ecological perspective, the concept of ‘source’ should be brought into question. Even if we overlook background sounds, moving sound sources, multiple excitations, short-term and long-term memory, emotional state, cultural and musical context, and so forth, we still have to ask, what is a "sound source?" Far from getting a definite answer, we are bound by several practical limitations. Ellis (1996) deals with the whole auditory scene instead of treating source and background as unrelated acoustic phenomena. In other words, given that the signal characteristics are modified by the space where they occur, and that in an ecological context they interact with other sources, we need to consider all processes simultaneously, not just a single isolated source. Bregman (1990, 488) puts it this way, "timbre [of a sound source] is not a result of a certain acoustic input. Timbre is to some degree created by our processes of auditory scene analysis."
These observations suggest that real-world sounds
are still outside the reach of current paradigms of theoretical and empirical
research. For the time being, we have to settle for simplified sound models
and strive not to lose the relevant characteristics of complex real-world
interactions. Of course, these limitations must be acknowledged: our synthetic
sounds are just toy examples of environmental sound complexes!
Invariants, where are they?
"The ecological approach combines a physical analysis of the source event, the identification of higher order acoustic properties specific to that event, and empirical tests of the listener’s ability to detect such information, in an attempt to avoid the introduction of ad hoc processing principles to account for perception. (. . .) The information that specifies the kind of object and its properties under change is known as the structural invariant of an event; reciprocally, the information that specifies the style of change is the transformation invariant." (Warren & Verbrugge, 1984, 705-706). A single exciting source acting on different resonant bodies and a single resonant source excited with varying strength have been described as acoustic invariants. (Warren et al., 1987).
"Invariant patterns over time refer to constant patterns of change - that is, manners or style of change." Michaels and Carello (1981) go on to give the example of a melody as a typical time-pattern which keeps invariant relationships among its elements. The structural properties are the same, even if the pitch material is transposed. By now, we know that this is quite a naive example. Is a melody played without dynamics, agogics, and timbre inflections still a melody? There are several examples of the perception of pitch being modified by interactions with other parameters (Tróccoli & Keller, 1996). So we would question the idea that a single parameter could be perceived as constant across different transformations in varying contexts. More likely, as we will discuss later, the relationship among several parameters, or collective variables as Scott Kelso (1995) likes to call them, is a better candidate for structural invariance, if such a thing exists at all.
Going back to the Gibsonian approach, "if we define
events as changes in objects or collections of objects, structural invariants
are those properties that specify the object or collection participating
in the event." (Michaels & Carello, 1981, 26). What remains to be accounted
for are the changes that occur to the objects themselves. Michaels and
Carello use the term ‘transformational invariance.’ "A transformational
invariant is the style of change in the proximal stimulus that specifies
the change occurring in or to the object." From this perspective we recognize
that an object is breaking apart by some type of invariant ‘breaking’ characteristic.
We argue that we will not find anything invariant in the process itself.
The place to look for ‘regularities’ (Bregman, 1993) is one level higher
than the physical variables themselves. It is in the dynamical coordination
among the parameters that define the sound event. And although these processes
are tightly constrained, they are random or highly complex in their micro-level
details.
Rate of change
Bregman (1990, 71) discusses amplitude modulation (AM) of a tone as an instance of auditory grouping formation. He observes that when "AM is slow, the tone is simply heard as a single tone whose loudness alternatively rises and falls. As the AM speeds up, a series of separate tone bursts is perceived, the bursts corresponding to the loud phase of the AM cycle and the inter-unit gaps to the attenuated phase. (. . .) Presumably this happens because the rate of amplitude exceeds a critical value and a perceptual boundary is formed. (. . .) It is likely that it is really the rate of rise in intensity that controls the formation of units." From a multifunctional perspective, we would be reluctant to agree that a single variable can produce any form of meaningful percept. Correlations with other variables should be considered. Alternatively, higher order variables such as the range of change in AM rate can greatly affect the perceptual effect. This is clearly exemplified by sounds processed by asynchronous granular algorithms (Roads, 1996; Truax, 1994).
Bregman’s onset hypothesis predicts that a sudden rise in amplitude should serve as a clue to indicate the beginning of a sound. We claim that this depends on the context. If this onset is embedded in a granular texture with an average amplitude distribution that approximates the isolated sound, it will be integrated as part of a high-order percept.
Bregman (1990, 72, 708) criticizes the view that temporal units are tied to periodicities in the signal, as was suggested by Mari Jones (1976). He defines a unit as having uniform properties representing distinct events in the environment, whenever a fast change of properties occurs a perceptual boundary is formed. "Units can occur at different time scales and smaller units can be embedded in larger ones. When a sequence is speeded up, the changes that signal the smaller units may be missed by the auditory system, and other changes, too gradual to form units at the slower speed, may now be sudden enough to control the formation of units." (Bregman, 1990, 644). This hints at clear range limitations in the production of ecologically feasible sound patterns. Physical parameters in the acoustic environment do not change independently. Thus, models of sound production need to correlate changes in amplitude and spectral modifications within narrow time constraints. Failure to do so will produce sounds that fall outside identifiable environmental sound classes.
Although rate of change is a major factor in the
organization of the auditory scene, correlation among variables also seems
to play an important role. The use of fast changing, widely varying parameters
in granular sounds hints at organizational strategies that can distinguish
different elements even within the ‘mess’ of several simultaneous granular
sources. Phase coherence among streams, similar rate and type of change
in different variables, correlated changes in delayed signals, all point
to a common sound source. Moreover, it is not unlikely that the auditory
system applies similar comparison strategies to completely different functional
tasks, such as the separation of source and reflected sound, the extraction
of a vibrating body’s resonant characteristics, and the estimation of the
number of similar sources.
Units, who are they?
Digital signal processing techniques provide us with a reliable method to represent sound signals at a sample level (Moore, 1990; Orphanidis, 1996). Although these techniques are well-suited for time-sensitive models such as sound localization cues or spectral filtering, it is difficult to find percepts that could be directly mapped onto a single variable at the sample level. Interactions among several acoustic mechanisms, such as those discussed in physical modeling (Smith, 1992), provide a useful prediction of higher level properties from locally defined characteristics. Nevertheless, computationally efficient implementations are generally done by lumping, i.e., simplifying, descriptions of the sound behavior to provide an output that approximates a generic acoustic model (cf. Smith, 1997). In spite of the fact that some of these models are perceptually convincing, this approach does not start from perceptual processes but from the physical sources that produce the sounds. Although there are some exceptions (Chafe, 1989; Cook, 1997), research in this area has mainly concentrated on modeling the spectral behavior of resonant bodies, leaving aside descriptions of time-related excitation patterns. This lack of research in time processes has been pointed out by Dannenberg (1996) among others.
The next higher level of signal description falls approximately in the range of grain durations. A grain, i.e., a very short sound, is simply a windowed group of samples. Its duration goes from a few samples, one to ten milliseconds, to a few hundred milliseconds. It has been popularized as the sound unit in granular synthesis (Truax, 1994), though from a broader perspective it can be defined as the window of observation (Lynn & Fuerst, 1994) in several analysis and synthesis methods (short-time Fourier transform, Wavelet transform, FOF, pitch-synchronous granular synthesis, etc.) (Cavaliere & Piccialli, 1997). The granular description of sound shares some properties with sample-based techniques, such as the possibility of shaping the spectrum from the time domain, or controlling the micro-temporal structure of sound. But it also permits the use of ecologically meaningful sound events and time-patterns that are hard to tackle within a sample-based approach (Keller & Truax, 1998).
Granular sounds require high densities of short events to produce aurally convincing sound textures. Therefore, computer music composers have adopted statistically-controlled distributions of grains limited by tendency masks, averages, deviations, probability densities, and similar methods (Xenakis, 1971; Truax, 1988). Besides the use of quasi-synchronous (periodic) grain streams in FOF (Rodet, 1984) and pitch-synchronous granular synthesis (De Poli & Piccialli, 1991), some composers have recently proposed deterministic control methods. Roads (1997) suggests a traditional note-based approach for long grain durations which can be extended to fast grain rates in order to produce micro-temporal and spectral effects. He calls this traditional compositional technique "pulsar synthesis." Di Scipio (1994) and Truax (1990) have explored the possibilities of controlling granular streams from the output of nonlinear functions. This technique offers good possibilities for the generation of macro-temporal patterns, though up to now only arbitrary mappings of isolated acoustic parameters have been used, e.g., grain frequency and grain duration. The common trend in all these approaches is to take a time line, isomorphous to absolute time, as the underlying space where the events are placed. In other words, it is in the hands of the composer to make all decisions regarding the duration, density, distribution and organization of the grains.
To constrain the space of possible organizations
would mean to throw away the idea of "limitless possibilities," "the vast
soundscape that synthesizers produce" (Paradiso, 1997, 18), "any sound
that could ever come to a loudspeaker" (Smith, 1991, 1), and the usual
rhetoric that we have heard in computer music for thirty years. As Smith
(1991, 9) says, "most sounds are simply uninteresting." The ecological
approach suggests that time be parsed into informationally relevant events.
The perceptual system is constantly searching for new patterns of information.
Thus attention-based processes are triggered by organized transformation,
not by redundancy or randomness. As we have discussed, this transformation
can be tracked by co-dependent variables, at several levels, changing at
strictly constrained rates. Thus, to define ecologically meaningful sound
events, the grain distributions and sample-based processes have to be controlled
from parameters defined by a higher level transformation. This transformation
needs to be constrained to a finite event which is feasible, at least in
theory, within our day-to-day environment. In other words, we are not working
on an abstract time line, but from a representation which parses time into
ecologically-constrained events.
Events
From a Gibsonian perspective, information is structure that specifies an environment to an animal. Thus, it is carried by a high-order organization that occurs over time (Michaels & Carello, 1981, 9). As the animal gathers information from the environment, it exerts changes on its surroundings. This activity is clearly goal-oriented.
Traditional theories of information processing consider the stimulus to be a discrete time-slice (cf. Massaro & Cowan, 1993). Michaels and Carello (1981) argue that time should be directly related to the informational structure of the stimulus. "Time is not chopped into an arbitrary succession of nows, but organized into naturally occurring events of varying duration (. . .). If time is viewed as an abstraction from change we might as well question the value of that abstraction. After all, change itself (events in space-time) is of interest to a behaving animal, not absolute time. (. . .) The notion of absolute time is given up in favor of space-time on the belief that perceivers do not perceive space and time, but events in space-time." (Michaels & Carello, 1981, 13).
Instead of working from the assumption of an absolute
time which is detached from actual occurring events, we propose a model
where time is parsed in event-dependent chunks. This implies that the perceptual
system gets reconfigured whenever it finds new information. Change acquires
a new meaning. It is not simply the variation of observed variables, but
defines how these variables should be observed. That is, the significant
unit of observation is the event defined by ecologically meaningful boundaries.
These boundaries can be tracked by monitoring incoming information compatible
with the behavior of sources existing in the environment. Sources that
are not compatible with the current environment-individual state have fewer
probabilities of being processed by the perceptual system. Nevertheless,
when new information is found it modifies the state of all the subsystems,
triggering the perception of a new event.
Patterns of change
Sound events can be described by the interaction between two systems: excitation and resonance. The excitation establishes a temporal pattern of energy input. The resonance produces a pattern of energy dissipation. When a resonant system is excited its losses are unevenly distributed, thus some frequencies are less damped than others. Generally, objects react linearly to excitations. Their response lasts a finite amount of time after the energy source stops.
Resonant systems reach a final stable state because energy is not generated within the system but it is received from an external source - an exciting system. Exciting systems may exhibit unstable states, e.g., fire, rain, dripping water. If the exciting system behaves heterogeneously, we can safely infer that there is more than one source of energy. A resonant system is heterogeneous when it comprises various subsystems excited by a single energy source.
An excitation pattern is perceived as continuous because it forms a perceptual unit at a higher level. By observing the system at several time levels we see why events that are discrete at a low level form a fused percept at a higher level. Organization at one level influences the others. Therefore, this is neither a top-down nor a bottom-up process, but a pattern-formation one. For example, the micro level characteristics of a sound grain influence the meso and macro properties of the sound event.
Ecological hypotheses:
1. Patterns of change (PCs) should be recognizable.
2. To be ecologically valid, PCs should form higher-order percepts. In other words, the interaction of low-level elements show emergent properties at higher levels. This interaction occurs among all levels.
3. PCs should be perceptually distinguishable from other PCs to be classified as different.
4. There is no unique PC for an ecologically valid percept. There is no class of PCs that has only one item.
5. Previous stimuli modify the current state of the perceptual system.
6. The perceptual system is biased toward percepts with which it has previously interacted. These percepts are more stable than unfamiliar percepts.
7. Sound models form a continuum of stable models and unstable ones. Stable models are perceived as ecologically valid.
8. Basins of attraction are defined by a memory trace of interactions between the perceptual system and the sound environment. Short-term memory shows a fast decaying trace, long-term memory a slowly decaying trace.
These simple hypotheses provide a development methodology
and a validity test for ecological models. By generalizing ecological event
concept to multi-level perceptual and physical patterns of change, we can
establish a model of timbre perception consistent with micro, meso, and
macro sound phenomena. This model is documented at length in (Keller, 1998b).
Acoustic environs
Everyday sounds occur in various surroundings which modify their temporal and spectral characteristics. "The environment destroys any simple invariance between an acoustic event and the waveform that arrives at the ear. (. . .) The perceptual system thus faces a general problem of separating properties of the source from the properties of the [environment]." (Darwin, 1990, 220-221).
Sounds reflected from surfaces are either perceived as part of the sources or are heard as separate from them. Darwin (1990) offers a speech-oriented account of the interactions between source and environment. He mentions reverberation, echo, static and dynamic spectral effects. The temporal effects that he discusses can be grouped as synchronous or asynchronous change, such as onset and offset differences, or varying modulation rates. Grantham (1995) provides a thorough survey of recent advances in spatial hearing, covering several issues which are further discussed in (Keller, 1998b): reflections, reverberation, localization, and lateralization.
In his (1993) paper, Bregman discusses the cues used by the auditory system to deal with the complexity of natural environments. He calls them regularities. These are his guidelines for understanding the perceptual processes involved in sequential and simultaneous integration of environmental sounds.
Regularities:
1. Unrelated sounds seldom start or stop at the same time.
2. Gradualness of change: (a) a single sound tends to change its properties smoothly and slowly, (b) a sequence of sounds from the same source tends to change its properties slowly.
3. When a body vibrates with a repetitive period, its vibrations give rise to an acoustic pattern in which the frequency components are multiples of a common fundamental.
4. Many changes that take place in an acoustic event will affect all the components of the resulting sound in the same way and at the same time.
A deeper understanding of the source separation process,
what psychoacoustic researchers call ‘the cocktail party effect,’ allows
us to apply ecologically-constrained sound processes to design a compositionally
effective sound space. Some tools used for this purpose, such as convolution
and phase-controlled granulation (Keller & Rolfe, 1998) are discussed
in the next section.
Summary
We have outlined the basic ideas of the ecological approach to auditory source recognition. This approach is defined by the interaction of environmental constraints with the individual’s goal-oriented activity. This activity not only takes place at the level of auditory processing, but it is also related to interactions among other sensory modalities. We address these issues in (Keller, 1998b). Listening takes place within a specific cultural context. The individual, through his sound-producing and sound-listening activities, modifies and is modified by his environment. This process was described as structural coupling by Varela et al. (1989). Its relevance to music composition is discussed in (Keller, 1998a).
The ecological approach can be characterized by a
few theoretical assumptions. A system cannot be studied by isolating its
parts. All models should be constrained to ecologically feasible events.
Ecological validity is defined by the observation of complex interactions
actually occurring in the environment. The action of the individual on
the environment and the influence of the environment on the individual
determine a process of pattern formation. This process can be approximately
modeled by algorithmic tools. Both spectral and temporal characteristics
of sound events need to be accounted for in the modeling process.
|
|
|
|
|
|
|
|