PSYCHOACOUSTICS
RELATED TO MULTI-CHANNEL AUDIO
|
The
basic psycho- and physioacoustic phenomenas behind multi-channel
reproduction are studied in this presentation. The dynamic range of human
hearing is the basis of the specifications for audio base band
frequency and dynamic range. Besides directional hearing, critical
bands and masking are discussed as a basis for understanding the new
bit compressed digital audio transmission and recording methods. The very basic theory of directional
hearing in 4pi-space is studied, Pinna Cues of
Sound Direction. These findings are also used in virtual sound
source technology. This theory offers
also a very good electrical solution for correcting the acoustical
irregularities in speaker system by improving the stereo image
stability. The distance of a sound source is difficult to perceive
in sound reproduction. Some reasons for this difficulty are
explained in the chapter Perception of Distance.
The
most important hearing characteristic of multi-channel audio is
localization of the sound source. The other important phenomenas are
hearing area, critical band and masking, especially for new digital
multi-channel systems, where these features are used to reduce the
bit count of the bitstream.
DYNAMIC RANGE OF
HEARING
The
dynamic range of hearing i.e. area between the quiet threshold and
the threshold of pain is a plane in which audible sounds can be
displayed. In its normal form, the dynamic range of hearing is
plotted with frequency on a logarithmic scale as the abscissa, and
sound pressure level in dB on a linear scale as the ordinate. This
means that
two logarithmic scales are used because the level is related to the
logarithm of sound pressure. The critical-band rate may also be used
as the abscissa. This scale is closer to the characteristics of our
hearing system than frequency.
The
usual display of dynamic range of hearing is shown in figure 1.
On the right, the ordinate scales are sound intensity in Watt per
square meter and sound pressure in Pascal. Sound pressure level is
given for a free-field condition relative to 2x10-5
Pa. Sound intensity level is plotted relative to 10-12
W/m2.
|
Figure
1 Dynamic range of hearing. The ordinate scale is not only expressed
in sound pressure level but also in sound intensity and sound
pressure.
|
CRITICAL BAND
It
is assumed that our hearing system process sounds in relatively
narrow frequency bands. It has been discovered that the part of a
noise that is most effective in masking a test tone, is the part of
its spectrum lying near the tone. Masking is achieved, when the
power of the test tone and the power of that part of the noise
spectrum lying near the tone and producing the sensation effect, are
the same. The parts of the noise outside the spectrum near the test
tone do not contribute to masking. Characteristic frequency bands
defined in this way have a bandwidth that produces the same acoustic
power in the tone and in the noise spectrum within that band, when
the tone is just masked.
Data
from many subjects has been collected to produce a reasonable
estimation of the width of the critical band. Although the lowest
critical bandwidth in the audible frequency region may be very close to
80 Hz, it is attractive to add the inaudible range from 0 Hz to 20 Hz
to that critical band, and to assume that the lowest critical band
ranges from 0 Hz to 100 Hz. Using this approximation, figure 2 shows
the average between the quiet threshold and 90dB. There is a small
tendency for the critical band to increase somewhat for levels above 70
dB
|
Figure 2 Critical bandwidth as a function of
frequency. Approximations for low and high frequency ranges are
indicated by broken lines [ZWFA90].
|
MASKING
Masking
plays a very important role in everyday life. For a conversation on
the sidewalk of a quiet street, little power is necessary for the
speakers to understand each other. However, if a loud truck passes
by, the conversation is disturbed. The speakers can no longer hear
each other if speech power is kept constant.
|
Figure 3 Schematic drawing to illustrate and
characterize the regions within which premasking, simultaneous
masking and post masking occur. Note that postmasking uses a
different time origin than premasking and simultaneous masking
[ZWFA90].
|
Figure 4 Level of test tone just masked by
critical-band wide noise with center frequency of 1 kHz and
different levels as a function of the frequency of the test tone
[ZWFA90].
|
Masking
effects can be measured not only when masker and test tone are
present simultaneously, but also when they are not. In the latter
case, the test sound has to be a short burst or sound impulse that
can be present before the masker stimulus is switched on. The
masking effect produced under these conditions is called
pre-stimulus masking, (figure 3). This effect is not very strong,
but if the test sound is present after the masker is switched off,
then a quite pronounced effect occurs. Because the test sound is
present after the termination of the masker, the effect is called
post-stimulus masking, or postmasking. Figure 4 shows the
dependence of masked threshold on the level of a narrow band noise
centered at 1 kHz. Narrow band noise means a noise with a bandwidth
equal to or smaller than the critical bandwidth (about 100 Hz below
and 0.2 f above 500 Hz).
|
LOCALIZATION OF A SOUND SOURCE
Human
hearing is quite atuned to detecting the direction of sounds in a
horizontal plane [JUTO84]. It is most sensitive below 1 kHz. For
frequencies over 2 kHz the sensitivity drops only slightly. In the
frequency range from 1 kHz to 2 kHz the ability to detect directions
is lower. The main mechanism for separating sound directions below
1.5 kHz is time difference and over the 1.5 kHz it is level
difference between the ears.
If
the same sound is excited at different times or with different
levels to different ears, the sound source is localized in direction
that depends of loudness or the order of sound excitation. Level and
time differences are interchangeable within limits. Level and time
differences needed to compensate each other are frequency dependent.
At low frequencies the time that is needed to compensate for level
is about 2 µs/dB and for high frequencies up to 100 µs/dB is needed. When the time difference is over
2 ms, the mechanism does not work and the first incoming sound is
totally dominant.
Pinna cues of sound direction
Interaural
differences in time and level are considered to be the major factors
in directional hearing. They, however, can not take into account the
localization of three dimensional space. Sound sources that cause
identical interaural time differences lie on a hyperbolic cone
around the interaural axis. The mechanism on how humans detect an elevation
of sound source is not well known nor is front-back discrimination
well know. One way to approach the problem is to use the function of
pinnae with three dimensional space sound sources.
|
Figure 5 Descriptive diagram of pinna
|
Over
one kHz, it appears that directivity information is processed
basically in frequency domain as spectral variations while only
secondary cues are handled in the time domain. Below one kHz the
dimensions of the head and ears are such that time delay is the only
primary cue as to the direction of sound. The first 500 µs to 1 ms is dominant for detecting the source.
Repetitive directional information arrived after 1 ms has a lower
effect until zero impact is reached at 10 ms (Haas effect?) [HBR88].
For early man this was all that was necessary for survival. There
was no time for a complex process to detect the directions of a
arriving sound. The consequence of this kind of character is that
directional cues have to be as simple and as clear as possible.
|
Figure 6 Frequencies of judgment front (f), behind
(b) and above (a) for one-third octave bands of noise [JEBL70].
Directional bands: Bordered- at 90 % level of significance and
shaded- most likely
|
Figure 7 Spectral features that vary monotonically
with decreasing sound source elevation in vertical plane (above) and
increasing source azimuth in horizontal plane [HOHA94].
|
The
processing
of cues has to be effective and well suited for the neural system.
For simplicity some experiments conclude that in some cases,
directional information is processed as monaural [HBR88].
In
the head related transfer function, HRTF, there appears to be a
notch in the high frequency range that is a function of the
elevation of the sound source. The notch itself may not be the
primary cue but the left slope that varies systematically and
monotonically from 6 to 13 kHz when the elevation is increasing,
certainly is. This is the only cue below ear level. Although the
detection of spectral slope is a difficult task for electronic
equipment to measure it is simple for neural system while the
detection of spectral minimum is a more complex task.
Experimentally, the sensation of elevated sound source can be
achieved by [HOHA94]
|
** 3.9...8.0 kHz low pass cutoffs. Increasing the cutoff at
this frequency range, increase
the elevation angle from 0 to 60 degrees.
** 4.0...7.2 kHz bandpass filtering is perceived in front
with an elevation of 60 degrees.
** 7.4...10.8 kHz notches cause elevation increase from 0 to
60 degrees when the notch
frequency increases.
** 10.3 kHz low pass filtering causes the elevation of 90degrees.
** 8.1...9.1 kHz bandpass signal causes the sensation of 90
to 60 degrees elevation in rear
section.
** 12.0...17.8 kHz notch causes elevation of 90
degrees.
|
Besides
the level, the right hand edge of a spectral notch, is shifted
toward the higher frequencies in the right ear, when the sound
source is moving clockwise. The working area of this cue is from 40
to 180 degrees. In the frontal section a double notch would work as
directional cue from -40 to +30 degrees. It is the same double notch
that detects the elevation in zero azimuth. In horizontal-plane
localization the concha plays the major role in the HRTF.
The
sensation of a background sound source can be achieve with
13.2...15.5 khz high-pass or 14.5 kHz band-pass filtered noise. In
the first case there is no elevation and in the latter one source is
elevated 30 degrees. The common denominator is towards the high
frequencies ascending spectral slope. In HRTF there is also, at 16
kHz, a deep notch that doesn't appear in the frontal section. Front
back discrimination appears to be based on a level difference at a
band in these frequencies. In the lower frequency band from
3.75...7.5 kHz, depending on the azimuth angle, there is a deep
notch with rear section sound sources but none with frontal sources.
The third mechanism to achieving rear sensation might be a boosted
frequency band between 7 and 12 kHz.
The
frequency bands affecting the sensation of front, back and over head
sound source with the stimulus of one third octave band noise are
presented in figure 3.6. Figure 7 shows the main pinna cues that
probably are active in detecting the elevation and azimuth of the
sound source. In these measurements a torso in an unechoic chamber
has been used. The microphone is mounted in the right ear of the
torso.
|
Perception of distance
Human
hearing is not very accurate in detecting the distance of a sound
source. Loudness is maybe the most widely accepted cue for distance.
Loudness decreases inversely as the distance in unechoic conditions
and about 20 phon [MBGA69] is needed to perceive half the distance.
In reverberant condition the situation is a little different.The
decrease in loudness is not as large as in an unechoic condition and
the loudness difference needed to double the perception of distance
varies from 22 to 41 phon [SØNI93]. For accurate perception of
distance absolute loudness must be known. The distance of the source
is over estimated with low levels of sound but improves considerable
with an increase in the sound level. Variations between individuals
are large. In unechoic condition it is almost impossible to estimate
the distance without any other cue other than loudness.
The
shape of the frequency spectrum is another factor that may give a
cue for ascertaining the distance of a sound source. Air damping is
frequency dependent. The loss of high frequencies is greater than
the losses of low frequencies. Besides direct air damping, many
echoes in reverberant rooms cause absorption of reflections from
walls. The loss of loudness depends on the materials of the walls.
To estimate the distance of sound source requires knowledge about
the characteristics of room and the frequency spectrum of the
source.
The
ratio between direct sound, early reflections and reverberation also
offers a cue as to the distance of a sound source. In this case
known sound and room characteristics are required for good
perception. With a known source like the human voice the
reverberation ratio appears to be the most important [SØNI93]
factor in estimating the distance. Also binaural differences may
sometimes be a valuable cue. The significance of this factor is not
however very strong. Experiments have been also shown that the
distance of a source is over estimated when the source is at an
angle to the receiver as opposed to directly in front.
SOUND
FIELD SIMULATION
It
has long been known that there is a high correlation between
auditory spaciousness and preference in concert hall types of sound
fields. This is confirmed by psychocoustic experiments [JBWL86]. The
correlation preference versus spaciousness is presented in figure 8. However normally spaciousness information
is minimized in recorded or broadcasted program material. A very
common technique for recording music is to place the microphones
close to the instruments. The signal level through this method is
the highest above noise. Also external disturbance like air
conditioning, airplanes and traffic is minimized. Pop and rock music
is normally recorded in sound studios and it is common that the
final recording is assembled from short take ups. It is not unusual
that the
entire band is not recording at the same time. These sound studios
are heavily damped producing dry acoustics while close range
microphones and multi-channel recorders are used. In reproduction
the direction information has to come from the right direction for
satisfactory spaciousness image. The only way to do this is to add
them in afterward.
|
Figure 8 Plot of spaciousness versus preference
[JBWL86]. As test material a motif of Mozart's Jupiter symphony has
been used.
|
Figure 9 A classic presentation of reflections in a
concert hall (Capitol Theater in Yakima, Washington), USA [HECH79].
The sound source is marked as A and receiver as B. Direct sound is c
and reflected ones d and e.
|
Besides
direct sound, reflected sounds can also be heard (figure 9) in a
closed space. Sound is reflected from walls, ceiling and the floor.
The intensity of reflected sound decreases over time for three
reasons. First the sound pressure decreases inversely as the
distance. Secondly at every reflection the sound wave loses energy,
how much depending upon the construction and material of the
reflective surface. The energy loss
depends
on frequency and the spectrum of the sound reformed
during the reflections. The last factor is that the air
attenuates sound, again this is frequency dependent.
The
reflective sound field has been spread over early
reflections and reverberation. The response of a sound
impulse is presented in figure 10.
|
Figure 10 Schematic presentation of impulse response
in a room. The time scale can be seen only as an example not actual
one.
|
Table 1 Effect of a single reflection [MBAM81]
|
Delay
ms
|
Frontal
reflection
|
Lateral
reflection
|
0
|
Loudness
|
Apparent
image size or image shift
|
5
|
Tone
coloration
|
|
10-20
|
|
Tone
coloration
|
3-60
|
|
Spatial
impression
|
>80
|
Echo
disturbance
|
Echo
disturbance
|
The
reflective sound field has been spread over early
reflections and reverberation. The response of a sound
impulse is presented in figure 10. The border between
early reflections and reverberation is not clear but it
would be somewhere between 50 and 100 ms. As early
reflections can not be separated from direct sound, they
increase the total loudness. Reflections that create the
sensation of two different sounds belong to the
reverberation field. If direct and reflective sounds
have the same intensity they are perceived as two sounds
[ALHA76]. At 100 ms and with level difference 6 dB they
are perceived as one source. The subjective effect of a
single reflection is described in the table 1.
Spaciousness
is a multi-dimensional perceptual attribute. The most
important cues for the spaciousness are early
reflections. They are not the only cues but the other
components of post sound and reverberation field offer
only complementary information. In cases where the level
and count of early lateral reflections are low the
reverberation can be used as a substitute. Sometimes the
reverberation is perceived as disturbance.
To
be valuable for spaciousness the early reflections have
to arrive in directions other than the direct sound. The
all spectral components of those reflections carry
information about the room. Reflectors have to work over
a large frequency band to be effective. Spectral
components of reflections work differently. Frequencies
in lateral reflections below 3 kHz mainly expand the
sensation of depth. If the reflections contain
components above 3 kHz the perception of broadening of
the room will be prominent. In figure 11 the
reflection patterns of two different shapes of halls
have been studied [JEBO85].
|
Figure 11 Directions of early reflections in two
different shapes of halls. Reflection pattern of
rectangle room is on the left side and on the right the
shape of hall is fan with direct rear wall [JEBO85]
|
Figure 12 Spatial impression against lateral delayed
reflection pair at
40
degree angle with Mozart's motif. 95 % confidence limit
presented as vertical lines [MBAM81].
|
During
the first 100 ms the sound will reflect over a hundred
times creating the same count of image as the sound
sources. At this time the signal processors for consumer
use are not powerful enough to handle this amount of
audio data. Data reduction is needed. Only the most
important reflections can be taken into account.
Reflections from front section are ignored and also
those arriving from the same direction within one ms. of
each other. Also elevated image sources are discarded.
There should now leave only five to ten reflections to
process, small enough for consumer processors. The
effect of one reflection pair in the angle of
40 degrees is presented in figure
12.
|
Figure 13 Quality of listening experience according
to count of reproduction channels [NMKOS71] when music
is used as program source in frontal section.
|
PHANTOM
SOUND SOURCE CREATION
The
psychoacoustic target of multi-channel reproduction
systems can be split into two main groups. The goal of
the first group is to create super directionality so
that the sound source can be located anywhere around a
listener. Results with such systems are impressive, but
it sometimes makes tires the listeners. The other
possibility is to create an ambient field surrounding a
listener as in a concert hall.
SUPER DIRECTIVITY
If
accurate localization everywhere around the listener is
desired, more than two channels and speakers have to be
used. Many experiments have shown that the four channels
used in quadraphony is not enough [MCA68] and the need
for more channels is obvious (figure 13). The first
step is to place a fifth channel between the front left
and right speakers. Sound localization with this system
is somewhat confusing and the result is not very good
[GTGP76]. A six channel system is possible. The extra
speakers can be placed between the left and right
channels in both the front and rear. In this layout the
new speakers are offering a real source at the highest
phantom accuration point. The alternative method is to
use new speakers at the lowest phantom accuration
location in the sides of listener. Some experimenters
[GTGP76] prefer this six-channel confiquration over the
first one.
In
rectangle quadraphony the most accurate phantom
localization is achieved in the front quadrant. The
image is easily perceived and the shift is smooth and
sharp with inter-channel level difference. If outphase
phantoms are used the image is unstable and front-rear
confusions are common. Also the sound source has been
localized outside the speakers. The phantom source
behavior in rear the quadrant is very similar in both
inphase
and outphase cases. The shift, according to
inter-channel level difference is sometimes sharper
with rear-front confusions increasing slightly. In
side quadrants the localization of phantom source is
almost impossible to achieve[RICA79].
|
Figure 14 Preferred angles of front speakers
|
Figure 15 Phantom source localization in four speaker
system between front and right side speaker according
to level difference [RICA79]
|
In
the arrangement presented in the right side of the
figure 4.2, phantom source localization is almost as
good as in the frontal section of the rectangle
quadraphony system. The phantom source generated
inphase shifts smoothly from center to side speaker
according to the differences in sound level.
Front-rear confusion is a little higher than in the
front quadrant of the rectangle system. Outphase
signals between front and rear speakers caused an
extremely vague phantom and the rate of front-rear
confusion was high. The behavior of the front-side
speakers is presented in picture 3.15. If the level of
signals is nearly equal, the sound is localized to the
rear and opposite side. Sources in the two rear
quadrants have very unstable, non localized phantoms
and rear-front confusion rate is also very high. This
system is totally useless to create any localized rear
source [RICA79].
Inter-channel phase difference
It
is known phenomena that phase difference can create a
phantom source shift toward the leading speaker. The
method of comparing inter-channel level difference to
control phantom source is rarely used. The reason may
be, that experiments have been shown inter-channel
localization is difficult to achieve and the stereo
image is not well defined. Theoretically the
inter-channel phase delay can be reduced to the
inter-channel level difference [FOED8X].
|
Figure 16 Stereophonic system geometry
|
Stereophonic
geometry presented in figure 16 has a phase delay radians between speakers at the
listening point. The wavefront H(x) [BBE85] generated
along the x-axis is
(1)
where
L and R are the amplitudes of the left and right
channels and k = 2
/
is the wave constant and
is wave length. For the listening
geometry in figure 3.16 and frequencies to be
considered [BBE85]
(2)
(3)
Substituting
equations (2) and (3) into equation (1) and
neglecting terms that carry no directional information
nor radial amplitude reduction in divergent wavefront
and assuming speaker polar diagram variations are
small in the area occupying the head and considering
there is no level difference between the left and
right source, the equation (1) can be written
(4)
The
equation (4) has the maximum
(5)
and
respectively
(6)
If
the delay
is equal to 0, the maximum is at
x=0 in the center axis of the speakers and if the
delay 0 the maximum shifts toward the
leading speaker.
The
direction of arrived sound pattern can be calculated from equation
[CDV57]
(7)
where
Le and Re
are average instantaneous sound pressures from the
left and right loudspeakers in the left and right ears
respectively. Assuming the listener is on the center
axes of the speakers, the sound pressures of the
interference pattern, due to phase difference can be
calculated from equation (4) in the position of ears
-xm
x
xm.
Substituting calculated values into the equation (7)
the estimated direction pattern can be calculated. The
sound pressures at the distance between ears, 14 cm, (are calculated in figure
xx with the geometry of the
figure 16 set at w=230 cm, h=200cm and f=500 Hz). The
calculated and measured directions are in figure 17.
In the 500 Hz experiment one third octave pink noise
was used.
The
phase difference is a usable cue for sound
localization only with frequencies below 1.5 kHz,
because of the size of the head. In high frequencies
the interference pattern has more than one minimum or
maximum at the distance between ears and the pressures
are not proportional to the pressures of the speakers.
|
Figure 17 Phantom source positioning based on
inter-channel phase difference in geometry presented
in figure 16 (frequency 500 Hz) [FOED8X].
|
AMBIANCE
As
well as the 360 degree accurate localization another
objective of surround sound system is to create an
ambiance soundfield with rear channels. In this case
an accurate localization is provided only in front of
the listeners and for all other directions the sound
is not localized. The task is now somewhat easier and
satisfactory result can be achieved with as small a
number of channels as three. The best ambiance can be
achieved by placing surround speakers at the sides of
listeners [NMKOS71] instead of at the rear. If more
channels are used rear channels can be also used in
conjunction with side speakers.The relative quality of
sound reproduction compared to the number of speakers
is presented in figure 13. In the picture the super
directivity has been studied, but the sound source is
an orchestra located in front of the listeners with
side and rear speakers producing ambiance signals.
|

|
REFERENCES
|
[ALHA76]
|
Alpo
Halme, "Rakennus- ja huoneakustiikka", Otakustantamo
kolmas muuttamaton painos, 1987
|
[BBE85]
|
J.
C. Benneth, K. Barker, F. O. Edeko, "A New Approach to The
Assesment of Stereophonic Sound System Performance", Journal of
Audio Engineering Society, vol 33, pp. 314...321, May 1985
|
[CDV57]
|
H.
M. Clarc, G. F. Dutton, P. B. Vanderlyn, "The
"Stereophonic" Recording and Reproducing System", The
Proceedings of The Institute of Electrical Engineers, vol. 104, Part
B, pp. 417...430, 1957
|
[FOED8X]
|
F.
O. Edeco, "Image Localization and Interchannel Phase
Difference" Electronics &Wireles World, vol. , pp.
799...802
|
[GTGP76]
|
G.
Thiele, G. Plenge, "Localization of Lateral Phantom
Sources", conference presentation, Audio Engineering Society,
Zyrich, Switzerland, March 2...5, 1976
|
[HBR88]
|
E.
R. Hafter, T. N. Buell, V. M. Richards, "Onset-Coding in
Lateralization: It's Form, Site, and Function", in Auditory
Function, G. M. Edelman, W. E. Gall, W. M. Cowan. Eds, Wiley, New
York, pp. 647...676,1988
|
[HECH79]
|
|
[HOHA94]
|
Hok-Loe
Han, "Measuring a Dummy Head in Search of Pinna Cues",
Journal of Audio Engineering Society, vol 42 no 1/2 pp. 15...37,
January 1994
|
[JBWL86]
|
Jens
Bleuert and Werner Lindeman, " Auditory Soaciousnes: Some
Further Psychoacoustics Analyses", J. Acoust. Soc Am. pp.
533...542, 1986
|
[JEBL70]
|
Jens
Blauert, "Sound Localization in the Median Plane",
Acustica, vol 22, pp. 205...213, 1969/1970
|
[JEBO85]
|
Jefrey
Borish, " An Auditorium Simulator for Domestic Use",
Journal of Audio Engineering Society, vol 33 no 5, pp. 330...341,
May 1985
|
[JUTO84]
|
Juha
Törönen, "Kaiuttimien tuotekehitys- ja tutkimusmenetelmien
kehitysmahdollisuudet", Valtion teknillinen tutkimuskeskus,
Tiedotteita 273, Espoo 1984
|
[MBAM81]
|
M.
Barron and A. H. Marshall, "Spatial Impression Due to Early
Lateral Reflections in Concert Halls: The Derivation of A Physical
Measure", Journal of Sound and Vibration, 77(2), pp.
211...232, 1981
|
[MCA68]
|
M.
Camras, " Approach to Recreating a Sound Field", J,
Acoust. Soc. Am., vol 43 pp. 1425...1431, 1968
|
[NMKOS71]
|
Tekeshi
Nakayma, Tanetoshi Miura, Osamu Kosaka, Michio Okamoto and Takeo
Shiga, "Subjective Assesment of Multichannel
Reproduction", Journal of Audio Engineering Society, vol
19, October 1971
|
[RICA79]
|
Richard
Cabot, "A Triphonic Sound Reproduction System Using Coincident
Microphones", Journal of Audio Engineering Society, vol 27, pp.
965...969, December 1979
|
[SONI93
|
Soren
H. Nielsen; "Auditory Distance Perception in Different
Rooms", Journal of Audio Engineering Society vol. 41, pp.
755...770, October 1993
|
[ZWFA90]
|
E.
Zwicker, H. Fastl, Springer-Verlag; "Psychöacoustics, Facts
and Models", 1990
|
ÄÄNENTOISTO
etusivu
|

|