%-*- mode: Latex; abbrev-mode: true; auto-fill-function: do-auto-fill -*-

% ToDo:
% add remaining figures

%include lhs2TeX.fmt
%include myFormat.fmt

\chapter{Sound and Signals}
\label{ch:signals}

In this chapter we study the fundamental nature of sound and its basic
mathematical representation as a signal.  We also discuss discrete
digital representations of a signal, which form the basis of modern
sound synthesis and audio processing.

% Taken from Chapter 2 of the text, and
% Blair School of Music (BSoM) http://www.computermusicresource.com/

\section{The Nature of Sound}
\label{sec:sound}

Before studying digital audio, it's important that we first know what
\emph{sound} is.  In essence, sound is the rapid compression and
relaxation of air, which travels as a \emph{wave} through the air from
the physical source of the sound to, ultimately, our ears.  The
physical source of the sound could be the vibration of our vocal
chords (resulting in speech or singing), the vibration of a speaker
cone, the vibration of a car engine, the vibration of a string in a
piano or violin, the vibration of the reed in a saxophone or of
the lips when playing a trumpet, or even the (brief and chaotic)
vibrations that result when our hands come together as we clap.
The ``compression and relaxation'' of the air (or of a coiled spring)
is called a \emph{longitudinal} wave, in which the vibrations occur
parallel to the direction of travel of the wave.  In contrast, a rope
that is fixed at one end and being shaken at the other, and a wave in
the ocean, are examples of a \emph{transverse} wave, in which the
rope's and water's movement is perpendicular to the direction the wave
is traveling.

[Note: There are some great animations of these two kinds of waves at:
\newline
  \verb|http://www.computermusicresource.com/what.is.sound.html|.]

If the rate and amplitude of the sound are within a suitable range, we
can \emph{hear} the sound---i.e.\ it is \emph{audible sound}.
``Hearing'' results when the vibrating air waves cause our ear drum to
vibrate, in turn stimulating nerves that enter our brain.  Sound above
our hearing range (i.e.\ vibration that is too quick to induce any
nerve impulses) is called \emph{ultrasonic sound}, and sound below our
hearing range is said to be \emph{infrasonic}.

Staying within the analog world, sound can also be turned into an
\emph{electrical} signal using a \emph{microphone} (or ``mic'' for
short).  Several common kinds of microphones are:
\begin{enumerate}
\item Carbon microphone.  Based on the resistance of a pocket of
  carbon particles that are compressed and relaxed by the sound waves
  hitting a diaphram.
\item Condenser microphone.  Based on the capacitance between two
  diaphrams, one being vibrated by the sound.
\item Dynamic microphone.  Based on the inductance of a coil of wire
  suspended in a magnetic field (the inverse of a speaker).
\item Piezoelectric microphone.  Based on the property of certain
  crystals to induce current when they are bent.
\end{enumerate}

\begin{figure}[hbtp]
\centering
\includegraphics[height=4in,angle=270]{pics/sinewave.eps} 
\caption{A Sine Wave}
\label{fig:sine-wave}
\end{figure}

Perhaps the most common and natural way to represent a wave
diagrammatically, whether it be a sound wave or electrical wave,
longitudinal or transverse, is as a \emph{graph} of its amplitude
vs.\ time.  For example, Figure \ref{fig:sine-wave} shows a
\emph{sinusiodal wave} of 1000 cycles per second, with an amplitude
that varies beween +1 and -1.  A sinusoidal wave follows precisely the
definition of the mathematical sine function, but also relates
strongly, as we shall soon see, to the vibration of sound produced by
most musical instruments.  In the remainder of this text, we will
refer to a sinusoidal wave simply as a sine wave.

%% \begin{figure*}
%% \centerline{
%% \epsfysize=2in 
%% \epsfbox{pics/sinewave.eps}
%% }
%% \caption{A Sine Wave}
%% \label{fig:sine-wave}
%% \end{figure*}

%% Perhaps the most natural way to draw sound is the same as for an
%% electrical signal---that is, as a \emph{graph} of its amplitude
%% vs.\ time.  For example, see Figure \ref{fig:signal-graph}.  This same
%% representation can be used to represent both logitudinal and
%% transverse waves.

\emph{Acoustics} is the study of the properties, in particular the
propagation and reflection, of sound.  \emph{Psychoacoustics} is the
study of the mind's interpretation of sound, which is not always as
tidy as the physical properties that are manifest in acoustics.
Obviously both of these are important areas of study for music in
general, and therefore play an important role in generating or
simulating music with a computer.

The speed of sound can vary considerably, depending on the material,
the temperature, the humidity, and so on.  For example, in dry air at
room temperature (68 degrees Farenheit), sound travels at a rate of
1,125 feet (343 meters) per second, or 768 miles (1,236 kilometers)
per hour.  Perhaps surprisingly, the speed of sound varies little with
respect to air pressure, although it does vary with temperature.

The reflection and absorbtion of sound is a much more difficult topic,
since it depends so much on the material, the shape and thickness of
the material, and the frequency of the sound.  Modeling well the
acoustics of a concert hall, for example, is quite challenging.  To
understand how much such reflections can affect the overall sound that
we hear, consider a concert hall that is 200 feet long and 100 feet
wide.  Based on the speed of sound given above, it will take a sound
wave $\nicefrac{2\times200}{1125} = 0.355$ seconds to travel from the
front of the room to the back of the room and back to the front again.
That $\nicefrac{1}{3}$ of a second, if loud enough, would result in a
significant distortion of the music, and corresponds to about one beat
with a metronome set at 168.

With respect to our interpretation of music, sound has (at least)
three key properties:
\begin{enumerate}
\item \emph{Frequency} (perceived as \emph{pitch}).
\item \emph{Amplitude} (perceived as \emph{loudness}).
\item \emph{Spectrum} (perceived as \emph{timbre}).
\end{enumerate}
We discuss each of these in the sections that follow.

%% \subsection{Review of Trigonometric Identities}

%% In preparation for what follows, we quickly review some basic
%% properties of trigonometric functions that are useful for audio
%% processing.  In general, all of the transcendental functions have a
%% use in audio processing and computer music applications, but our focus
%% here is on sine and cosine.

\subsection{Frequency and Period}
\label{sec:frequency}

The \emph{frequency} $f$ is simply the rate of the vibrations (or
repetitions, or cycles) of the sound, and is the inverse of the
\emph{period} (or duration, or wavelength) $p$ of each of the
vibrations:
\[ f = \frac{1}{p} \]
Frequency is measured in \emph{Hertz} (abbreviated Hz), where 1 Hz is
defined as one cycle per second.  For example, the sound wave in
Figure \ref{fig:sine-wave} has a frequency of 1000 Hz (i.e.\ 1 kHz)
and a period of $\nicefrac{1}{1000}$ second (i.e.\ 1 ms).

In trigonometry, functions like sine and cosine are typically applied
to angles that range from 0 to 360 degrees.  In audio processing (and
signal processing in general) angles are instead usually measured in
\emph{radians}, where $2\pi$ radians is equal to $360^\circ$.  Since
the sine function has a period of $2\pi$ and a frequency of
$\nicefrac{1}{2\pi}$, it repeats itself every $2\pi$ radians:
\[ \sin (2\pi k + \theta) = \sin \theta \]
for any integer $k$. 

But for our purposes it is better to parameterize these functions over
frequency as follows.  Since $\sin(2\pi t)$ covers one full cycle in
one second, i.e.\ has a frequency of 1 Hz, it makes sense that
$\sin(2\pi f t)$ covers $f$ cycles in one second, i.e.\ has a frequency
of $f$.  Indeed, in signal processing the quantity $\omega$ is defined
as:
\[ \omega = 2 \pi f \]
That is, a pure sine wave as a function of time behaves as
$\sin(\omega t)$.

Finally, it is convenient to add a \emph{phase} (or \emph{phase
  angle}) to our formula, which effectively shifts the sine wave in
time.  The phase is usually represented by $\phi$.  Adding a
multiplicative factor $A$ for amplitude (see next section), we arrive
at our final formula for a sine wave as a function of time:
\[ s(t) = A\sin(\omega t + \phi) \]
A negative value for $\phi$ has the effect of ``delaying'' the sine
wave, whereas a positive value has the effect of ``starting early.''
Note also that this equation holds for negative values of $t$.

All of the above can be related to cosine by recalling the following
identity:
\[ \sin(\omega t + \dfrac{\pi}{2}) = \cos(\omega t) \]
More generally:
\[ A \sin(\omega t + \phi) = a\cos(\omega t) + b\sin(\omega t) \]
Given $a$ and $b$ we can solve for $A$ and $\phi$:
\[\begin{array}{lcl}
A    &=& \sqrt{a^2 + b^2} \\[.05in]
\phi &=& \tan^{-1} \dfrac{b}{a}
\end{array}\]
Given $A$ and $\phi$ we can also solve for $a$ and $b$:
\[\begin{array}{lcl}
a &=& A\cos(\phi) \\
b &=& A\sin(\phi)
\end{array}\]

\subsection{Amplitude and Loudness}
\label{sec:amplitude}

Amplitude can be measured in several ways.  The \emph{peak amplitude}
of a signal is its maximum deviation from zero; for example our sine
wave in Figure \ref{fig:sine-wave} has a peak amplitude of 1.  But
different signals having the same peak amplitude have more or less
``energy,'' depending on their ``shape.''  For example, Figure
\ref{fig:rms} shows four kinds of signals: a sine wave, a square wave,
a sawtooth wave, and a triangular wave (whose names are suitably
descriptive).  Each of them has a peak amplitude of 1.  But,
intuitively, one would expect the square wave, for example, to have
more ``energy,'' or ``power,'' than a sine wave, because it is
``fatter.''  In fact, it's value is everywhere either +1 or -1.

\begin{figure}[hbtp]
\centering
\includegraphics[height=3in,angle=270]{pics/sine_rms.eps} 
\includegraphics[height=3in,angle=270]{pics/square_rms.eps} 
\includegraphics[height=3in,angle=270]{pics/sawtooth_rms.eps} 
\includegraphics[height=3in,angle=270]{pics/triangle_rms.eps} 
\caption{RMS Amplitude for Different Signals}
\label{fig:rms}
\end{figure}

To measure this characteristic of a signal, scientists and engineers
often refer to the \emph{root-mean-square} amplitude, or RMS.
Mathematically, the root-mean-square is the square root of the mean of
the squared values of a given quantity.  If $x$ is a discrete quantity
given by the values $x_1, x_2, ..., x_n$, the formula for RMS is:

\[ x_{\rm RMS} = \sqrt{\frac{x_1^2 + x_2^2 + ... + x_n^2}{n}} \]

And if $f$ is continuous function, its RMS value over the interval $T_1
\leq t \leq T_2$ is given by:

\[ \sqrt{{\frac{1}{T_2-T_1}}\int_{-T_1}^{T_2}f(t)^2dt} \]

For a sine wave, it can be shown that the RMS value is approximately
0.707 of the peak value.  For a square wave, it is 1.0.  And for both
a sawtooth wave and a triangular wave, it is approximately 0.577.
Figure \ref{fig:rms} shows these RMS values superimposed on each of
the four signals.

Another way to measure amplitude is to use a relative logarithmic
scale that more aptly reflects how we hear sound.  This is usually
done by measuring the sound level (usually in RMS) with respect to
some reference level.  The number of \emph{decibels} (dB) of sound
is given by:
\[ S_{dB} = 10 \log_{10}\frac{S}{R} \]
where $S$ is the RMS sound level, and $R$ is the RMS reference
level.  The accepted reference level for the human ear is $10^{-12}$
watts per square meter, which is roughly the threshold of hearing.

A related concept is the measure of how much useful information is in
a signal relative to the ``noise.''  The \emph{signal-to-noise ratio},
or $\mathit{SNR}$, is defined as the ratio of the \emph{power} of each
of these signals, which is the square of the RMS value:
\[ \mathit{SNR} = \left(\frac{S}{N}\right)^2 \]
where $S$ and $N$ are the RMS values of the signal and noise,
respectively.  As is often the case, it is better to express this on a
logarithmic scale, as follows:
\[\begin{array}{lcl}
\mathit{SNR}_{dB} &=& 10 \log_{10}\left(\dfrac{S}{N}\right)^2 \\[0.12in]
                 &=& 20 \log_{10}\dfrac{S}{N}
\end{array}\]

The \emph{dynamic range} of a system is the difference between the
smallest and largest values that it can process.  Because this range
is often very large, it is usually measured in decibels, which is a
logarithmic quantity.  The ear, for example, has a truly remarkable
dynamic range---about 130 dB.  To get some feel for this, silence
should be considered 0 dB, a whisper 30 dB, normal conversation about
60 dB, loud music 80 dB, a subway train 90 dB, and a jet plane taking
off or a very loud rock concert 120 dB or higher.

Note that if you double the sound level, the decibels increase by
about 3 dB, whereas a million-fold increase corresponds to 60 dB:
\[\begin{array}{lclcl}
10 \log_{10}2    &=& 10 \times 0.301029996 &\cong& 3 \\
10 \log_{10}10^6 &=& 10 \times 6          &=&     60 \\
\end{array}\]
So the ear is truly adaptive!  (The eye also has a large dynamic range
with respect to light intensity, but not quite as much as the ear, and
its response time is much slower.)

\begin{figure}[hbtp]
\centering
\includegraphics[height=4in]{pics/equal_loudness_contour.eps} 
\caption{Fletcher-Munson Equal Loudness Contour}
\label{fig:fletcher-munson}
\end{figure}

Loudness is the perceived measure of amplitude, or volume, of sound,
and is thus subjective.  It is most closely aligned with RMS
amplitude, with one important exception: loudness depends somewhat on
frequency!  Of course that's obvious for really high and really low
frequencies (since at some point we can't hear them at all), but in
between things aren't constant either.  Furthermore, no two humans are
the same.  Figure \ref{fig:fletcher-munson} shows the
\emph{Fletcher-Munson Equal-Loudness Contour}, which reflects the
perceived equality of sound intensity by the average human ear with
respect to frequency.  Note from this figure that:
\begin{itemize}
\item The human ear is less sensitive to low frequencies.
\item The maximum sensitivity is around 3-4 kHz, which roughly
  corresponds to the resonance of the auditory canal.
\end{itemize}

%% [See: \newline
%% \verb|http://hyperphysics.phy-astr.gsu.edu/Hbase/sound/earsens.html|
%% and
%% \verb|http://hyperphysics.phy-astr.gsu.edu/Hbase/hframe.html|]
%% \verb|http://www2.sfu.ca/sonic-studio/handbook/Equal_Loudness_Contours.html|.]

Another important psychoacoustical property is captured in the
\emph{Weber\-Fechner Law}, which states that the \emph{just noticeable
  difference} (jnd) in a quantity---i.e.\ the minimal change necessary
for humans to notice something in a cognitive sense---is a relative
constant, independent of the absolute level.  That is, the ratio of
the change to the absolute measure of that quantity is constant:

\[ \frac{\Delta q}{q} = k \]

The jnd for loudness happens to be about 1 db, which is another
reason why the decibel scale is so convenient.  1 db corresponds to a
sound level ratio of 1.25892541.  So, in order for a person to ``just
notice'' an increase in loudness, one has to increase the sound level
by about 25\%.  If that seems high to you, it's because your ear is so
adaptive that you are not even aware of it.

\subsection{Frequency Spectrum}
\label{sec:spectrum}

Humans can hear sound approximately in the range 20 Hz to 20,000 Hz =
20 kHz.  This is a dynamic range in frequency of a factor of 1000, or
30 dB.  Different people can hear different degrees of this range (I
can hear very low tones well, but not very high ones).  On a piano,
the fundamental frequency of the lowest note is 27.5 Hz, middle
(concert) A is 440 hz, and the top-most note is about 4 kHz.  Later we
will learn that these notes also contain \emph{overtones}---multiples
of the fundamental frequency---that contribute to the \emph{timbre},
or sound quality, that distinguishes one instrument from another.
(Overtones are also called \emph{harmonics} or \emph{partials}.)

The \emph{phase}, or time delay, of a signal is important too, and
comes into play when we start mixing signals together, which can
happen naturally, deliberately, from reverberations (room acoustics),
and so on.  Recall that a pure sine wave can be expressed as
$\sin(\omega t + \phi)$, where $\phi$ is the \emph{phase angle}.
Manipulating the phase angle is common in additive synthesis and
amplitude modulation, topics to be covered in later chapters.

\begin{figure}[hbtp]
\centering
\includegraphics[height=3in,angle=270]{pics/sine_spect1.eps}
\begin{center}
(a) Spectral plot of pure sine wave
\end{center}
\includegraphics[height=3in,angle=270]{pics/sine_spect2.eps} 
\begin{center}
(b) Spectral plot of a noisy sine wave
\end{center}
\includegraphics[height=3in,angle=270]{pics/sine_spect3.eps} 
\begin{center}
(c) Spectral plot of a musical tone
\end{center}
\caption{Spectral Plots of Different Signals}
\label{fig:frequency-spectrum}
\end{figure}

A key point is that most sounds do not consist of a single, pure sine
wave---rather, they are a combination of many frequencies, and at
varying phases relative to one another.  Thus it is helpful to talk of
a signal's \emph{frequency spectrum}, or spectral content.  If we have
a regular repetitive sound (called a \emph{periodic signal}) we can
plot its spectral content instead of its time-varying graph.  For a
pure sine wave, this looks like an impulse function, as shown in
Figure \ref{fig:frequency-spectrum}a.

But for a richer sound, it gets more complicated.  First, the
distribution of the energy is not typically a pure impulse, meaning
that the signal might vary slightly above and below a particular
frequency, and thus its frequency spectrum typically looks more like
Figure \ref{fig:frequency-spectrum}b.

In addition, a typical sound has many different frequencies associated
with it, not just one.  Even for an instrument playing a single note,
this will include not just the perceived pitch, which is called the
\emph{fundamental frequency}, but also many \emph{overtones} (or
harmonics) which are multiples of the fundamental, as shown in Figure
\ref{fig:frequency-spectrum}c.  The \emph{natural harmonic series} is
one that is approximated often in nature, and has a harmonically
decaying series of overtones.

What's more, the articulation of a note by a performer on an
instrument causes these overtones to vary in relative size over time.
There are several ways to visualize this graphically, and Figure
\ref{fig:time-varying-spectrum} shows two of them.  In
\ref{fig:time-varying-spectrum}a, shading is used to show the varying
amplitude over time.  And in \ref{fig:time-varying-spectrum}b, a 3D
projection is used.

\begin{figure}[hbtp]
\centering
\includegraphics[height=4in,angle=270]{pics/spectrum_map.eps}
\begin{center}
(a) Using shading
\end{center}
\includegraphics[height=4in,angle=270]{pics/spectrum_mesh.eps} 
\begin{center}
(b) Using 3D projection
\end{center}
\caption{Time-Varying Spectral Plots}
\label{fig:time-varying-spectrum}
\end{figure}

The precise blend of the overtones, their phases, and how they vary
over time, is primarily what distinguishes a particular note, say
concert A, on a piano from the same note on a guitar, a violin, a
saxophone, and so on.  We will have much more to say about these
issues in later chapters.

[See pictures at:
\newline
  \verb|http://www.computermusicresource.com/spectrum.html|.]

% This also relates to the envelope that “shapes” a sound. [see BSoM
% for more ideas on this]

\section{Digital Audio}
\label{sec:digital-audio}
 
The preceding discussion has assumed that sound is a continuous
quantity, which of course it is, and thus we represent it using
continuous mathematical functions.  If we were using an analog
computer, we could continue with this representation, and create
electronic music accordingly.  Indeed, the earliest electronic
synthesizers, such as the \emph{Moog synthesizer} of the 1960's, were
completely analog.

However, most computers today are \emph{digital}, which require
representing sound (or signals in general) using digital values.  The
simplest way to do this is to represent a continuous signal as a
\emph{sequence of discrete samples} of the signal of interest.  An
\emph{analog-to-digital converter}, or ADC, is a device that converts
an instantaneous sample of a continuous signal into a binary value.
The microphone input on a computer, for example, connects to an ADC.

Normally the discrete samples are taken at a fixed \emph{sampling
  rate}.  Choosing a proper sampling rate is quite important.  If it
is too low, we will not acquire sufficient samples to adequately
represent the signal of interest.  And if the rate is too high, it may
be an overkill, thus wasting precious computing resources (in both
time and memory consumption).  Intuitively, it seems that the highest
frequency signal that we could represent using a sampling rate $r$
would have a frequency of $\nicefrac{r}{2}$, in which case the result
would have the appearance of a square wave, as shown in Figure
\ref{fig:sample-rate}a.  Indeed, it is easy to see that problems could
arise if we sampled at a rate significantly lower than the frequency
of the signal, as shown in Figures \ref{fig:sample-rate}b and
\ref{fig:sample-rate}c for sampling rates equal to, and one-half, of
the frequency of the signal of interest---in both cases the result is
a sampled signal of 0 Hz!

\begin{figure}[hbtp]
\centering
\includegraphics[height=3.1in,angle=270]{pics/aliasing_2f.eps}
\begin{center}
(a)
\end{center}
\includegraphics[height=3.1in,angle=270]{pics/aliasing_f.eps} 
\begin{center}
(b)
\end{center}
\includegraphics[height=3.1in,angle=270]{pics/aliasing_half-f.eps} 
\begin{center}
(c)
\end{center}
\caption{Choice of Sampling Rate}
\label{fig:sample-rate}
\end{figure}

Indeed, this observation is captured in what is known as the
\emph{Nyquist-Shannon Sampling Theorm} that, stated informally, says
that the accurate reproduction of an analog signal (no matter how
complicated) requires a sampling rate that is at least twice the
highest frequency of the signal of interest.

For example, for audio signals, if the highest frequency humans can
hear is 20 kHz, then we need to sample at a rate of at least 40 kHz
for a faithful reproduction of sound.  In fact, CD's are recorded at
44.1 kHz.  But many people feel that this rate is too low, as some
people can hear beyond 20 kHz.  Another recording studio standard is
48 kHz.  Interestingly, a good analog tape recorder from generations
ago was able to record signals with frequency content even higher than
this---perhaps digital is not always better!

\subsection{From Continuous to Discrete}
\label{sec:discrete}

Recall the definition of a sine wave from Section \ref{sec:frequency}:
\[ s(t) = A\sin(\omega t + \phi) \]
We can easily and intuitively convert this to the discrete domain by
replacing the time $t$ with the quantity $\nicefrac{n}{r}$, where $n$
is the integer index into the sequence of discrete samples, and $r$ is
the sampling rate discussed above. If we use $s[n]$ to denote the
$(n+1)^{\rm th}$ sample of the signal, we have:
\[ s[n] = A\sin\left(\frac{\omega n}{r} + \phi\right),\ \ \ \ \ \ \ \ 
                              n = 0, 1, ..., \infty \]
Thus $s[n]$ corresponds to the signal's value at time $\nicefrac{n}{r}$.

\subsection{Fixed-Waveform Table-Lookup Synthesis}
\label{sec:wavetable}

One of the most fundamental questions in digital audio is how to
generate a sine wave as efficiently as possible, or, in general, how
to generate a fixed periodic signal of any form (sine wave, square
wave, sawtooth wave, even a sampled sound bite).  A common and
efficient way to generate a periodic signal is through
\emph{fixed-waveform table-lookup synthesis}.  The idea is very
simple: store in a table the samples of a desired periodic signal, and
then index through the table at a suitable rate to reproduce that
signal at some desired frequency.  The table is often called a
\emph{wavetable}.

In general, if we let:
\[\begin{array}{lcl}
L &=& {\rm table\ length}        \\
f &=& {\rm resulting\ frequency} \\
i &=& {\rm indexing\ increment}   \\
r &=& {\rm sample\ rate}
\end{array}\]
then we have:
\[ f = \frac{i r}{L} \]

For example, suppose the table contains 8196 samples.  If the sample
rate is 44.1 kHz, how do we generate a tone of, say, 440 Hz?  Plugging
in the numbers and solving the above equation for $i$, we get:
\[\begin{array}{lcl}
440 &=& \dfrac{i \times 44.1 {\rm kHz}}{8196} \\[.1in]
i   &=& \dfrac{440 \times 8196}{44.1 {\rm kHz}} \\[.1in]
    &=& 81.77
\end{array}\]
So, if we were to sample approximately every 81.77$^{\rm th}$ value in
the table, we would generate a signal of 440 Hz.

Now suppose the table $T$ is a vector, and $T[n]$ is the $n$th
element.  Let's call the exact index increment $i$ into a continuous
signal the \emph{phase}, and the actual index into the corresponding
table the \emph{phase index} $p$.  The computation of successive
values of the phase index and output signal $s$ is then captured by
these equations:
\[\begin{array}{lcl}
p_o    &=& \lfloor \phi_0 + 0.5 \rfloor \\
p_{n+1} &=& (p_n + i) \bmod L \\
s_n    &=& T [\ \lfloor p_n + 0.5 \rfloor\ ]
\end{array}\]
$\lfloor a+0.5 \rfloor$ denotes the floor of $a+0.5$, which effectively
rounds $a$ to the nearest integer.  $\phi_0$ is the initial phase
angle (recall earlier discussion), so $p_0$ is the initial index into
the table that specifies where the fixed waveform should begin.

Instead of rounding the index, one could do better by
\emph{interpolating} between values in the table, at the expense of
efficiency.  In practice, rounding the index is often good enough.
Another way to increase accuracy is to simply increase the size of the
table.

\subsection{Aliasing}
\label{sec:aliasing}

Earlier we saw examples of problems that can arise if the sampling
rate is not high enough.  We saw that if we sample a sine wave at
twice its frequency, we can suitably capture that frequency.  If we
sample at exactly its frequency, we get 0 Hz.  But what happens in
between?  Consider a sampling rate ever-so-slightly higher or lower
than the sine wave's fundamental frequency-–-in both cases, this will
result in a frequency much lower than the original signal, as shown in
Figures \ref{fig:aliasing1} and \ref{fig:aliasing2}.  This is
analogous to the effect of seeing spinning objects under fluorescent
or LED light, or old motion pictures of the spokes in the wheels of
horse-drawn carriages.

\begin{figure}[hbtp]
\centering
\includegraphics[height=4in,angle=270]{pics/aliasing_lowf1.eps}
\includegraphics[height=4in,angle=270]{pics/aliasing_lowf2.eps} 
\caption{Aliasing 1}
\label{fig:aliasing1}
\end{figure}

\begin{figure}[hbtp]
\centering
\includegraphics[height=4in,angle=270]{pics/aliasing_lowf5.eps} 
\includegraphics[height=4in,angle=270]{pics/aliasing_lowf6.eps} 
\caption{Aliasing 2}
\label{fig:aliasing2}
\end{figure}

These figures suggest the following.  Suppose that $m$ is one-half the
sampling rate.  Then:
\[\begin{array}{lll}
\hline \\
{\rm Original\ signal} && {\rm Reproduced\ signal} \\
\hline
0-m                   && 0-m \\
m-2m		      && m-0 \\
2m-3m                 && 0-m \\
3m-4m                 && m-0 \\
\cdots                && \cdots \\
\hline
\end{array}\]
This phenomenon is called \emph{aliasing}, or \emph{foldover} of the
signal onto itself.

This is not good!  In particular, it means that audio signals in the
ultrasonic range will get ``folded'' into the audible range.  To solve
this problem, we can add an analog \emph{low-pass filter} in front of
the ADC--–usually called an \emph{anti-aliasing} filter---to eliminate
all but the audible sound before it is digitized.  In practice,
however, this can be tricky.  For example, a steep analog filter
introduces \emph{phase distortion} (i.e.\ frequency-dependent time
delays), and early digital recordings were notorious in the ``harsh
sound'' that resulted.  This can be fixed by using a filter with less
steepness (but resulting in more aliasing), or using a time
correlation filter to compensate, or using a technique called
\emph{oversampling}, which is beyond the scope of this text.

A similar problem occurs at the other end of the digital audio
process---i.e.\ when we reconstruct an analog signal from a digital
signal using a \emph{digital-to-analog converter}, or DAC.  The
digital representation of a signal can be viewed mathematically as a
stepwise approximation to the real signal, as shown in Figure
\ref{fig:no-aliasing}, where the sampling rate is ten times the
frequency of interest.  As discussed earlier, at the highest frequency
(i.e.\ at one-half the sampling rate), we get a square wave.  As we
will see in Chapter~\ref{ch:spectrum-analysis}, a square wave can be
represented mathematically as the sum of an infinite sequence of sine
waves, consisting of the fundamental frequency and all of its odd
harmonics.  These harmonics can enter the ultrasonic region, causing
potential havoc in the analog circuitry, or in a dog's ear (dogs can
hear frequencies much higher than humans).  The solution is to add yet
another low-pass filter, called an \emph{anti-imaging} or
\emph{smoothing} filter to the output of the DAC.  In effect, this
filter ``connects the dots,'' or interpolates, between successive
values of the stepwise approximation.

\begin{figure}[hbtp]
\centering
\includegraphics[height=3.2in,angle=270]{pics/noaliasing.eps} 
\caption{A Properly Sampled Signal}
\label{fig:no-aliasing}
\end{figure}

In any case, a basic block diagram of a typical digital audio
system---from sound input to sound output---is shown in Figure
\ref{fig:DAW-block-diagram}.

\begin{figure}
\centering
\includegraphics[height=4.0in]{pics/DAWBlockDiagram.eps}
\caption{Block Diagram of Typical Digital Audio System}
\label{fig:DAW-block-diagram}
\end{figure}

\subsection{Quantization Error}
\label{sec:quantization}

In terms of amplitude, remember that we are using digital numbers to
represent an analog signal.  For conventional CD's, 16 bits of
precision are used.  If we were to compute and then ``listen to'' the
round-off errors that are induced, we would hear subtle imperfections,
called \emph{quantization error}, or more commonly, ``noise.''

One might compare this to ``hiss'' on a tape recorder (which is due to
the molecular disarray of the magnetic recording medium), but there
are important differences.  First of all, when there is no sound,
there is no quantization error in a digital signal, but there is still
hiss on a tape.  Also, when the signal is very low and regular, the
quantization error becomes somewhat regular as well, and is thus
audible as something different from hiss.  Indeed, it's only when the
signal is loud and complex that quantization error compares favorably
to tape hiss.

One solution to the problem of low signal levels mentioned above is to
purposely introduce noise into the system to make the signal less
predictable.  This fortuitous use of noise deserves a better name, and
indeed it is called \emph{dither}.

\subsection{Dynamic Range}
\label{sec:dynamic-range}

What is the dynamic range of an $n$-bit digital audio system?  If we
think of quantization error as noise, it makes sense to
use the equation for $\mathit{SNR}_{dB}$ given in Section \ref{sec:amplitude}:

\[ \mathit{SNR}_{dB} = 20 \log_{10}\frac{S}{N} \]

But what should $N$ be, i.e.\ the quantization error?  Given a signal
amplitude range of $\pm a$, with $n$ bits of resolution it is divided
into $\nicefrac{2a}{2^n}$ points.  Therefore the dynamic range is:
\[\begin{array}{lcl}
20 \log_{10}\left(\dfrac{2a}{\nicefrac{2a}{2^n}}\right) 
           &=&       20 \times\log_{10}(2^n) \\
           &=&       20 \times n \times\log_{10} (2) \\[.02in]
           &\approx& 20 \times n \times (0.3) \\[.02in]
           &=&       6n
\end{array}\]
For example, a 16-bit digital audio system results in a dynamic range
of 96 dB, which is pretty good, although a 20-bit system yields 120
dB, corresponding to the dynamic range of the human ear.

\vspace{.1in}\hrule

\begin{exercise}{\em
For each of the following, say whether it is a longitudinal wave or a
transverse wave:
\begin{itemize}
\item	A vibrating violin string.
\item	Stop-and-go traffic on a highway.
\item	``The wave'' in a crowd at a stadium.
\item	``Water hammer'' in the plumbing of your house.
\item	The wave caused by a stone falling in a pond.
\item	A radio wave.
\end{itemize} }
\end{exercise}

\begin{exercise}{\em
You see a lightning strike, and 5 seconds later you hear the thunder.
How far away is the lightning? }
\end{exercise}

\begin{exercise}{\em
You clap your hands in a canyon, and 2 seconds later you hear an echo.
How far away is the canyon wall? }
\end{exercise}

\begin{exercise}{\em
By what factor must one increase the RMS level of a signal to yield a
10 dB increase in sound level? }
\end{exercise}

\begin{exercise}{\em
A dog can hear in the range 60-45,000 Hz, and a bat 2,000-110,000 Hz.
In terms of the frequency response, what are the corresponding dynamic
ranges for these two animals, and how do they compare to that of
humans? }
\end{exercise}

\begin{exercise}{\em
What is the maximum number of audible overtones in a note whose
fundamental frequency is 100 Hz?  500 Hz? 1500 Hz?  5 kHz? }
\end{exercise}

\begin{exercise}{\em
Consider a continuous input signal whose frequency is f.  Devise a
formula for the frequency r of the reproduced signal given a sample
rate s. }
\end{exercise}

\begin{exercise}{\em
How much memory is needed to record 3 minutes of stereo sound using
16-bit samples taken at a rate of 44.1 kHz? }
\end{exercise}

\begin{exercise}{\em
If we want the best possible sound, how large should the table be using
fixed-waveform table-lookup synthesis, in order to cover the audible
frequency range? }
\end{exercise}

\begin{exercise}{\em
The Doppler effect occurs when a sound source is in motion.  For
example, as a police car moves toward you its siren sounds higher than
it really is, and as it goes past you, it gets lower.  How fast would
a police car have to go to change a siren whose frequency is the same
as concert A, to a pitch an octave higher? (i.e. twice the frequency)
At that speed, what frequency would we hear after the police car
passes us? }
\end{exercise}

\vspace{.1in}\hrule

\out{ -------------------------

Oversampling

Oversampling is a simple “trick” that improves dynamic range as well
as anti-aliasing.  The idea is to interpolate between digital samples.
This became popular in early CD players.

More recently so-called “1-bit oversampling” has become popular.  The
idea here is to represent signals using a single bit of quantization,
but sample at a much higher rate.  This trade-off in “information
content” is well-known mathematically, and in practice it greatly
simplifies the anti-aliasing problem, because the filter that is
needed can be far less steep (since the higher rate takes us way out
of the audible range).

}