Sampling Sounds

Most sounds nowadays are stored digitally (if you don't count ancient LPs and non-digital magnetic tapes). Let's see how it's done.

First, sound is converted to electrical current using a microphone. Continuous oscillations of air pressure become continuous oscillations of voltage in an electrical circuit. This fast-changing voltage is then converted into a series of numbers by a digitizer. A digitizer acts like a very fast digital voltmeter. It makes thousands of measurements per second. Each measurement results in a number that can be stored digitally (that is, only a finite number of significant digits of this number are recorded). This number is called a sample and the whole conversion of sound to a series of numbers is called sampling.

If you have a sound card in your computer, it has a digitizer. If you plug in a microphone, the sound card will produce a stream of samples that can be retrieved by your software (see example).

Fig. Sampling a continuous signal. Blue line corresponds to changing voltage (for instance, from a microphone), red segments correspond to samples.

Example

To give you an example--CD-quality music recording is created by sampling the sound 44,100 times per second and storing each sample as a 16-bit binary number (twice as much for a stereo recording). So an hour of stereo music is equivalent to 3,600 * 44,100 * 2 = 317,520,000 samples or 635,040,000 bytes. That's over half a gigabyte--which is more or less the capacity of a standard CD. (You can drastically reduce storage requirements if you apply some clever compression scheme--for instance MP3).

Vectors and Wavelets

There are many ways to learn about the Fourier transform. However, since you are presumably a programmer who deals with binary numbers, the digital approach will be more to the point.

Sound acquisition in a computer is usually implemented by getting a buffer full of samples at a time--an array of integers, each integer corresponding to the amplitude of the sound at a particular moment. In mathematics, an array of numbers is called a vector (that's why the C++ Standard Library calls its dynamic array a vector). More precisely, a vector is an abstract entity that can be represented as an array of numbers in a particular coordinate system. Sound analysis can be thought of a representing the same vector of samples in different coordinate systems.

Vector Refresher

Think of a three-dimensional vector. Imagine it as an arrow of some length, pointing in some direction in space. If you have a coordinate system, you can project this vector, call it v, on the three axes and get three numbers, x, y, and z. These are the coordinates of this vector in this particular coordinate system.

Fig. Coordinates of a 3-d vector in a particular coordinate system

However, you are not bound to one particular coordinate system. The same vector can be represented in a different coordinate system by a different set of numbers. A coordinate system in a 3-dimensional space is nothing but a set of three vectors. These are called the basis vectors. We are used to orthonormal coordinate systems in which the three basis vectors are orthogonal to each other and have unit length. In general, however, this is not necessary. What really matters is for the three vectors to be linearly independent to form a good basis. Vectors are linearly independent if none of them can be represented as a linear combination of the others. In particular, you can't have one vector lying in the plane formed by the two other vectors. Such a trio wouldn't form a good coordinate system in 3-d.

Imagine now that you have a new set of three basis vectors, i₁, i₂, and i₃, and you want to find out the coordinates of v in this new basis. A coordinate of v with respect of a basis vector i_k is equal to itsprojection onto i_k. So to get the three coordinates of v in the new basis, you have to project v onto the three basis vectors, i₁, i₂, and i₃.

How do you calculate these projections? That's easy. First, each of the new basis vectors must be represented as three numbers--these are the coordinates of the new basis vectors in our original coordinate system. For instance i₁ = (x₁, y₁, z₁). The projection of v = (x, y, z) onto i₁ is equal to

(v * i₁) = x * x₁ + y * y₁ + z * z₁

This is called a scalar product of two vectors.

All these notions generalize easily to n-dimensional spaces. An n-dimensional vector can be represented as n numbers. A basis in an n-dimensional vector space consists of n linearly independent vectors. To calculate the coordinates of a vector v in a new basis, i_k (k = 1...n), we have to calculate scalar products of v with each of the n basis vectors.

Wavelet Analysis

Although wavelet analysis is only a recent discovery, it is in principle simpler than its predecessor, Fourier analysis, especially in the digital domain.

Consider a fixed-size buffer that is the result of sampling a sound. If the buffer contains n samples, it can be viewed as an n-dimensional vector. Each sample becomes a coordinate of this vector in some n-dimensional space. For instance, if we only had three-sample buffers at our disposal, the values of the samples would correspond to the x, y, z coordinates of a 3-d vector. Normally we have hundreds or thousands of samples in a single buffer, so it's really hard to imagine the multi-dimensional space in which they would form a vector. But the math is still the same.

The wavelet transform is obtained by re-calculating the coordinates of this vector of samples using a different basis in n-dimensional space. The result is, again, n numbers. We start with n numbers and we end up with n numbers, so what's the big deal? It turns out that, depending on the choice of the basis, these new numbers may be easier to interpret. They can tell us something about the sampled sound that was not immediately obvious. Or we may drop some of these numbers for transmission or storage--we would have compressed the sound in a particular way.

8-Point Wavelet Transform

Here's a little example. Suppose that we are dealing with small buffers, each containing 8 consecutive samples. Instead of drawing an 8-dimensional vector to represent each buffer (on a 2-dimensional screen?), I will draw the 8 samples as 8 vertical segments, one after another.

Fig. A representation of a buffer with 8 samples

The eight basis vectors can be represented in the same fashion. Here's an example that is often used in wavelet analysis, the so-called Haar basis.

Fig. Haar basis in 8 dimensions

This particular basis is orthogonal--the scalar product between any two basis elements is zero. Moreover, the "square" of each element (the scalar product with itself) is equal to one. For instance, the last basis vector in the picture has the following coordinates

i₈ = (0, 0, 0, 0, 0, 0, a, -a), a = 1/sqrt(2)

and its square is

(i₈*i₈) = 0² + 0² + 0² + 0² + 0² + 0² + a² + a² = 2 * a² = 1

A scalar product of an arbitrary vector v = (v₁, v₂, v₃, v₄, v₅, v₆, v₇, v₈) with the nth basis vector, i_n is called the nth wavelet coefficient. For instance, let's calculate the 8th wavelet coefficient of v, or the scalar product of v with i₈:

w₈ = (v * i₈) = (v₇ - v₈)/sqrt(2)

It is simply the difference between the seventh and the eighth vector components (or samples in our buffer of samples). Notice that if the signal changes very slowly, this difference will be very small--the 8th wavelet coefficient will be small.

Just by looking at the pictures of the basis vectors, you can figure out that the first wavelet coefficient, or the scalar product of v with i₁, tells you the average amplitude of the signal; the second tells you how much difference there is between the first half of the signal and the second half; and so on. If you decide to drop the last four coefficients, you'll just lose the information about some fine details of the signal, but you'll still be able to recover its overall shape. That's the principle of wavelet compression in a nutshell.

The set of wavelet coefficients, w₁, w₂, ... w₈ is called the wavelet transform of the signal v. Since we are operating on 8-sample buffers, this is called an 8-point wavelet transform.

How do we recover the signal given its wavelet coefficients? Simple: multiply the basis vectors by these coefficients and add them up:

v = w₁ i₁ + w₂ i₂ + ... + w₈ i₈

For instance, if w₁ = 10, w₂ = -4, and the rest is zero, the reconstructed signal will be:

v = 10 * i₁ - 4 * i₂

Taking into account proper normalization of these basis vectors, the result is:

v = (3, 3, 3, 3, 7, 7, 7, 7)/sqrt(2)

Fourier Analysis

A Fourier transform is a special case of a wavelet transform with basis vectors defined by trigonometric functions--sine and cosine. What's so special about trigonometric functions? If you've gone through theharmonic oscillator digression, you might have a pretty good idea. When things oscillate, their displacements are governed by trigonometric functions.

Resonance

When a sound wave interacts with an oscillator, it may excite it or not. When it excites it, the oscillator starts oscillating with its own characteristic frequency. What does it tell you about the sound wave? It tells you that this particular sound wave contains this particular frequency. It could be a complicated wave, not even resembling a sine wave, but somehow it does have in itself a component of a given frequency. Since it causes the oscillator to resonate, it must continually pass some of its energy to it. So, on average, when the oscillator is moving one way (the first half of its period) there is more "pushing" by the sound wave, and when it is moving the other way (the second half of its period) there is more "pulling" from the sound wave. The wave "swings" the oscillator.

We can test a given sound mathematically to find out whether it contains a certain frequency (whether it will swing the oscillator with this particular characteristic frequency). We just have to add up (accumulate) the consecutive sound samples multiplied by special weights. We'll make the weights positive when the oscillator is in the first half of its period and negative when its in the second half of its period. Since the energy transfer is best when the oscillator is in the middle of its swing, we'll make the weight highest there. Taking all this into account, the best weighing function will be, you guessed it, a sine. To test for a particular frequency, we'll use the sine wave of that frequency.
Weighted sum of samples

Fig. Summing samples (blue) with appropriate weights (red)

The accumulated sum will be close to zero if the sound doesn't contain a given frequency. Positive samples will, on average, cancel negative samples. But if there is a correlation between the sound and the weighing function, the contributions will accumulate and the result will possibly be a large number.

If we were to physically analyze a sound, we could create a set of oscillators, each tuned to a slightly different frequency, and let the sound swing them. Whatever frequencies were contained in the sound would excite the corresponding oscillators. This is essentially what happens in our ear.

Alternatively, instead of building a set of oscillators, we could analyze the sound wave mathematically, to find out which oscillators would be excited. Each oscillator corresponds to a sinusoidal weight function of specific frequency. If the sound is digitized, we would multiply its samples by the corresponding weights and add them all up (without digitizing we would have to multiply the sound wave by the sine function and then integrate it). The resulting number tells us how much of a given frequency there is in this particular sound wave. A set of such numbers, corresponding to various frequencies, is called a Fourier transform (more precisely, a Digital Fourier Transform, DFT) or a frequency spectrum of the sound.

What does the Fourier transform have to do with the wavelet transform? Look at the weights we use in our analysis. These are nothing else but some very particular basis vectors. They just happen to follow the shape of a sine function (compare it with the Haar basis).

Fig. Wavelet base vectors corresponding to a Digital Fourier Transform

Notice that it's not enough to use the sine-shaped functions. For completeness, we have to add the cosine-shaped functions as well. Even though they correspond to the same frequencies as the sine-shaped ones, they describe the out-of-phase swings of the oscillators. As you can see, the Fourier basis can be obtained by sampling sine and cosine waves.

Fourier Synthesis

It so happens that the Fourier wavelets form an orthogonal system--if you take a scalar product of any two of them, you'll get zero. Also, with proper normalization, you can make the squares of these vectors equal to one. N of these vectors form a perfect orthonormal basis in N-dimensional space. In other words, any buffer with N samples can be projected onto these N vectors giving N numbers (which are called Fourier coefficients). These numbers form the digital Fourier transform of the buffer.

Conversely, given those numbers we can reconstitute the original buffer. It is enough to multiply each basis vector by its corresponding Fourier coefficient and add them all up to obtain one N-dimensional vector exactly equal to the original sample buffer. The process of reconstructing the sound wave from its Fourier coefficients is called Fourier synthesis. Since each basis vector is a sampled sine (cosine) wave of a given frequency, you see that by adding up these waves with appropriate coefficients you can re-create any sound wave. In this sense a sound wave is the sum of the pure frequencies contained in it.

We now know (theoretically) how to extract these frequencies (Fourier analysis) as well as synthesize arbitrary sounds from them (Fourier synthesis). It's time for some practical algorithms. But before we get to it, we have to learn some complex numbers.

Complex Numbers

The beauty of physics and mathematics is that everything is somehow related to everything else, and that there are so many ways of looking at the same thing, each providing new insights. Take for instance the Fourier transform. The physics of a harmonic oscillator is described by a differential equation whose solutions are trigonometric functions. A sound wave may excite a given harmonic oscillator by periodically pushing and pulling at it. When we digitize the sound, we turn it into an N-dimensional vector. The amount of excitation this sound impairs on a particular oscillator is calculated by taking the scalar product of this vector with a special basis vector in N dimensions. This basis vector is a result of sampling a trigonometric function that describes the movement of the oscillator.

Trigonometric functions are easy to understand but difficult to handle. Fortunately, there is a better language to describe the solutions to a harmonic oscillator--complex numbers. But before we get there, let's look more closely at the differential equations that describe oscillating and non-oscillating motions.

Of Rabbits and Exponentials

A population of rabbits has this peculiar property that the more rabbits there are the more they multiply. The rate of growth of the population at any moment of time is proportional to its size. In mathematics, the rate of growth is described by a derivative. The larger the derivative, the faster something grows.

Let's denote the population of rabbits at time t by p(t). Its derivative is denoted by dp(t)/dt. What we have just have said about the law of population growth of rabbits can be written in the form of a differential equation:

dp(t)/dt = α*p(t)

where α is some proportionality constant. There is a function in mathematics that can be used to solve this equation, it's called the exponential. It is defined as e to the power of t, with e being the special number called "the base of the natural logarithms" and approximately equal to 2.71828. The derivative of the exponential function, e^t, is equal to the function itself:

de^t/dt = e^t

It's a function which: the bigger it is, the faster it grows.

Our population of rabbits will therefore grow exponentially, that is, it will follow an exponential function:

p(t) = A*e^αt

with A being the size of the population at t=0 (remember that e⁰ = 1). The parameter α is called theexponent of the exponential function.

Notice that the second derivative (the derivative of the derivative) of the exponential function is still equal to the function itself. If we include an arbitrary exponent, we get:

d²e^αt/dt² = α²e^αt

In other words, taking the second derivative of the exponential function is equivalent to multiplying it by the square of the exponential. In general, it's a very useful thing to know when solving differential equations: the n-th derivative of the exponential function is equal to the function itself multiplied by the n-th power of its exponential.

Now back to our oscillator. Remember the differential equation it fulfilled?

d²x(t)/dt² = -ω²x(t)

Darn! We are so close! If it weren't for this minus sign, we could solve this equation using an exponential. All we need is a special number, call it i, whose square would be equal to minus one. We could then write:

x(t) = A*e^iωt

with the understanding that (iω)² = -ω².

The Imaginary Unit

It turns out that mathematicians have thought of numbers whose squares are negative. I will not go into all the (very interesting) mathematical subtleties here. The important thing is how one performs calculations on such numbers.

It is enough to take it at face value that there is such an entity called the imaginary unit, i (engineers often call it j), whose only interesting property is that its square is -1. In other words, i is the square root of -1. Now that you've suspended your disbelief, you can do all kinds of calculations usingcomplex numbers of the form

z = x + iy

where x and y are regular (real) numbers. The x number is called the real part, and thy y number theimaginary part of z. For instance, you can add two such numbers, z and w,

z + w = (x + iy)+(v + iu) = x + v + i(y + u)

or multiply them,

z * w = (x + iy)(v + iu) = xv + iyv + ixu + (i*i)yu
= xv - yu +i(yv + xu)

Just like this! Notice that the results of arithmetic operations on complex numbers are again complex numbers.

In order to see how division works, let's first introduce an operation called complex conjugation. The complex conjugate of z is z^* = (x - iy). We just flip the sign in front of i. When you multiply a complex number by its complex conjugate, you always get a real positive number. Check this out:

z z^* = (x + iy)(x - iy) = x² + iyx - ixy -(i*i)y² = x² + y²

The square root of this number, sqrt (x² + y²) is called the modulus of a complex number.

Dividing by a complex number z is equivalent of multiplying it by its complex conjugate divided by modulus square. Indeed, try multiplying z by this:

z^*/(z z^*)

and you will get 1. Therefore this is indeed the inverse of z.

Now back to our differential equations. The solution of the harmonic oscillator using complex numbers is indeed:

x(t) = A*e^iωt

Let's look at the properties of the complex exponential with an imaginary exponent. First of all, its modulus is one, because

e^it(e^it)^* = e^ite^-it = e^it-it = e⁰ = 1

No magic here: complex conjugation flips the sign in front of i, the multiplication of powers with the same base is equivalent to the summing of the exponents, and finally, anything to the power of zero is equal to one.

And here's the tour de force: e^it is a complex number, so, by definition, it must be of the form x + iy. So what are the values of x and y? Guess what, they are the good old trigonometric functions:

e^it = cos(t) + i sin(t)

Do you see that the modulus of this complex number is one? (Hint: it's a simple trigonometric identity based on the Pythagorean theorem.)

We have made a full circle (there's a subtle pun in here). We knew that the solutions to the equation of the harmonic oscillator could be expressed in terms of trigonometric functions. On the other hand, we were able to solve this equation using a complex exponential. Now we see that this single exponential describes both the sine and the cosine oscillations in one compact package. It's the same physics, but the new notation is a lot easier, once you get used to it.

Trigonometry Made Simple

Remember this identity?

sin (a + b) = sin (a) cos (b) + cos (a) sin (b)

We used it in our discussion of the harmonic oscillator. Frankly, I can never remember those trigonometric identities. But with complex numbers they are trivial to derive. For instance, you know that

e^{i(a + b)} = e^iae^ib

Using our exponential/trigonometric equivalence, we get

e^{i(a + b)} = cos (a+b) + i sin (a+b)

and

e^iae^ib = (cos (a) + i sin (a)) (cos (b) + i sin (b))
= cos (a) cos (b) - sin (a) sin (b)
+ i (sin (a) cos (b) + cos (a) sin (b))

Comparing these two, we get two separate trigonometric identities for the price of one.

cos (a + b) = cos (a) cos (b) - sin (a) sin (b)

sin (a + b) = sin (a) cos (b) + cos (a) sin (b)

since both the real parts and the imaginary parts have to be independently equal.

Finally, since they are described by pairs of real numbers, complex numbers can be viewed as 2-dimensional vectors. A complex number with modulus one is equivalent to a unit vector. In particular, a complex exponential, e^it, can be seen as a unit vector at an angle t from the x-axis, see Fig. This angle is called the phase of the complex number.

Fig. A complex number of modulus one and phase t represented as a vector on a unit circle

In general, any complex number, z = x + i y can be represented as the modulus r times the complex exponential of the phase φ,

z = r e^iφ.

Using this representation, the most general solution of a harmonic oscillator, A*e^{iωt + φ} can be viewed as a vector moving around a circle of radius A with constant angular speed ω. The initial phase, φ, defines the starting point of the movement.

Digital Fourier Transform

Exponential Basis

As I explained earlier, the Fourier basis consists of sampled sine and cosine functions. Using complex numbers, these two can be combined into a single exponential, so the complex Fourier basis consists of sampled complex exponentials.

Let's start with a somewhat trivial example of a 4-point Fourier transform. We'll need 4 basis vectors,e_k, k = 0...3, which are sampled exponentials of increasing frequencies.

e_k = (e^-2πi(0*k)/4, e^-2πi(1*k)/4, e^-2πi(2*k)/4, e^-2πi(3*k)/4)

These are four samples of the exponential e^-2πitk/4 taken at t = 0, 1, 2, 3.

Here are all four of them:

e₀ = (1, 1, 1, 1)
e₁ = (1, e^-2πi/4, e^-2πi2/4, e^-2πi3/4)
e₂ = (1, e^-2πi2/4, e^-2πi4/4, e^-2πi6/4)
e₃ = (1, e^-2πi3/4, e^-2πi6/4, e^-2πi9/4)

As an example, the real part of e₃ is

(1, cos(-2π3/4), cos(-2π6/4), cos(-2π9/4)

which is a sampled cosine function. In fact, using simple trigonometric identities, like cos(π/2) = 0, etc., one can simplify the real and imaginary parts to:

Re (e₀) = (1, 1, 1, 1)	Im (e₀) = (0, 0, 0, 0)
Re (e₁) = (1, 0, -1, 0)	Im (e₁) = (0, -1, 0, 1)
Re (e₂) = (1, -1, 1, -1)	Im (e₂) = (0, 0, 0, 0)
Re (e₃) = (1, 0, -1, 0)	Im (e₃) = (0, 1, 0, -1)

Notice that they are not all independent, in fact there can only be 4 independent basis vectors in 4-d. (The original complex exponentials, however, are independent in a 4-d complex space. They could be used to Fourier transform a complex sequence. It just so happens that all our samples have zero imaginary parts.)

Below is a picture of the real parts of the basis vectors, overlapped with the appropriate trigonometric functions whose samples they form.

Fig. The real part of the four basis vectors

The fact that these vectors are so regular is the basis of the FFT algorithm. But before we get to that, let's introduce some convenient notation. All our basis vectors in N-dimensions can be expressed in terms of various powers of one number, W_N,

W_N = e^-2πi/N

For instance, for N=4 we have

e_k = (W₄^k*0, W₄^k*1, W₄^k*2, W₄^k*3)

Check this out for yourself:

e^-2πik*2/4=(e^-2πi/4)^k*2 =W₄^k*2

So this is the final form of our N-point Fourier basis

e_k[n] = W_N^kn, n = 0, 1, ... N-1

and the kth component of the DFT of the buffer v[n] is the scalar product of the kth basis vector with our vector of samples:

V[k] = Σ_n=0...N-1 W^kn v[n]

(the summation is from n=0 to N-1).

Properties of the Digital Fourier Transform

What can you do with the Fourier transform? You can display it graphically. Or you can modify it and then reverse-transform it. You may, for instance, cut off higher (or lower) frequencies. Or you can remove a particular frequency that is polluting your signal. This is called filtering the signal.

The reverse of the digital Fourier transform (DFT)is calculated simply by applying the Fourier transform again, and reversing the resulting buffer (to be precise, because of our normalization, you should also divide the resulting samples by N). By reversing I mean reading the buffer from back to front.

The components of the DFT are complex numbers. They have a modulus and a phase:

The modulus of V[k], sqrt (V[k]*V^*[k]), describes the intensity of the particular frequency corresponding to k (more precisely, because of our normalization, you should divide the modulus by sqrt(N))
The phase of a given component describes the phase shift of this component--at what angle it starts its oscillations

In most cases, the phase is of no relevance--our ears cannot detect it. However, if you have a stereo pair of sources, your ears can detect the phase difference between them. So if you do some signal processing involving multiple channels, you have to take the phases into account.

What is the relation between k--the index of the Fourier component--and its frequency? Here is the function describing the k-th basis vector, e^-2πikt/N. We sample it at t = 0, 1, 2,...,N-1. Let's just look at its real part, cos(2πkt/N) (cosine is symmetric, so the sign of its argument doesn't matter). Cosine starts repeating itself after 2π, which corresponds to kt/N = 1 or t = N/k. So how many such repetitions fit in a segment of length N? Exactly k. In other words, our k-th basis vector has k periods per buffer. But what time interval corresponds to one buffer? That depends on the sampling speed. Ifsps is the number of samples per second, then sps/N is the number of buffers per second. Therefore k * sps / N is the number of periods per second, or the frequency.

Fast Fourier Transform (FFT)

The naive implementation of the N-point digital Fourier transform involves calculating the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate basis vectors. Since each scalar product involves N multiplications and N additions, the total time is proportional to N² (in other words, it's an O(N²) algorithm). However, it turns out that by cleverly re-arranging these operations, one can optimize the algorithm down to O(N log(N)), which for large N makes a huge difference. The optimized version of the algorithm is called the fast Fourier transform, or the FFT.

Let's do some back of the envelope calculations. Suppose that we want to do a real-time Fourier transform of one channel of CD-quality sound. That's 44k samples per second. Suppose also that we have a 1k buffer that is being re-filled with data 44 times per second. To generate a 1000-point Fourier transform we would have to perform 2 million floating-point operations (1M multiplications and 1M additions). To keep up with incoming buffers, we would need at least 88M flops (floating-point operations per second). Now, if you are lucky enough to have a 100 Mflop machine, that might be fine, but consider that you'd be left with very little processing power to spare.

Using the FFT, on the other hand, we would perform on the order of 2*1000*log₂(1000) operations per buffer, which is more like 20,000. Which requires 880k flops--less than 1 Mflop! A hundred-fold speedup.

The standard strategy to speed up an algorithm is to divide and conquer. We have to find some way to group the terms in the equation

V[k] = Σ_n=0..N-1 W_N^kn v[n]

Let's see what happens when we separate odd ns from even ns (from now on, let's assume that N is even):

V[k] = Σ_{n even} W_N^kn v[n] + Σ_{n odd} W_N^kn v[n]
= Σ_r=0..N/2-1 W_N^k(2r) v[2r] + Σ_r=0..N/2-1 W_N^k(2r+1) v[2r+1]
= Σ_r=0..N/2-1 W_N^k(2r) v[2r] + Σ_r=0..N/2-1 W_N^k(2r) W_N^k v[2r+1]
= Σ_r=0..N/2-1 W_N^k(2r) v[2r] + W_N^k Σ_r=0..N/2-1 W_N^k(2r) v[2r+1]
= (Σ_r=0..N/2-1 W_N/2^kr v[2r])
+ W_N^k (Σ_r=0..N/2-1 W_N/2^kr v[2r+1])

where we have used one crucial identity:

W_N^k(2r) = e^-2πi*2kr/N
= e^{-2πi*kr/(N/2)} = W_N/2^kr

Notice an interesting thing: the two sums are nothing else but N/2-point Fourier transforms of, respectively, the even subset and the odd subset of samples. Terms with k greater or equal N/2 can be reduced using another identity:

W_N/2^m+N/2 = W_N/2^mW_N/2^N/2 = W_N/2^m

which is true because W_m^m = e^-2πi = cos(-2π) + i sin(-2π)= 1.

If we start with N that is a power of 2, we can apply this subdivision recursively until we get down to 2-point transforms.

We can also go backwards, starting with the 2-point transform:

V[k] = W₂^0*k v[0] + W₂^1*k v[1], k=0,1

The two components are:

V[0] = W₂⁰ v[0] + W₂⁰ v[1] = v[0] + W₂⁰ v[1]
V[1] = W₂⁰ v[0] + W₂¹ v[1] = v[0] + W₂¹ v[1]

We can represent the two equations for the components of the 2-point transform graphically using the, so called, butterfly

Fig. Butterfly calculation

Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to two 2-point transforms: one for even elements, one for odd elements. The odd one will be multiplied by W₄^k. Diagrammatically, this can be represented as two levels of butterflies. Notice that using the identity W_N/2ⁿ = W_N²ⁿ, we can always express all the multipliers as powers of the same W_N (in this case we choose N=4).

Fig. Diagrammatical representation of the 4-point Fourier transform calculation

I encourage the reader to derive the analogous diagrammatical representation for N=8. What will become obvious is that all the butterflies have similar form:

Fig. Generic butterfly graph

This graph can be further simplified using this identity:

W_N^s+N/2 = W_N^s W_N^N/2 = -W_N^s

which is true because

W_N^N/2 = e^-2πi(N/2)/N = e^-πi = cos(-π) + isin(-π) = -1

Here's the simplified butterfly:

Fig. Simplified generic butterfly

Using this result, we can now simplify our 4-point diagram.

Fig. 4-point FFT calculation

This diagram is the essence of the FFT algorithm. The main trick is that you don't calculate each component of the Fourier transform separately. That would involve unnecessary repetition of a substantial number of calculations. Instead, you do your calculations in stages. At each stage you start with N (in general complex) numbers and "butterfly" them to obtain a new set of N complex numbers. Those numbers, in turn, become the input for the next stage. The calculation of a 4-point FFT involves two stages. The input of the first stage are the 4 original samples. The output of the second stage are the 4 components of the Fourier transform. Notice that each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions (multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The number of stages is log₂N (which, since N is a power of 2, is the exponent m in N = 2^m). Altogether, the FFT requires on the order of O(N logN) calculations.

Moreover, the calculations can be done in-place, using a single buffer of N complex numbers. The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of samples is v[0], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the samples into two groups, even ones and odd ones. Applying this division recursively, we split these groups of samples into two groups each by selecting every other sample. For instance, the group (0, 2, 4, 6, 8, 10, ... 2N-2) will be split into (0, 4, 8, ...) and (2, 6, 10, ...), and so on. If you write these numbers in binary notation, you'll see that the first split (odd/even) is done according to the lowest bit; the second split is done according to the second lowest bit, and so on. So if we start with the sequence of, say, 8 consecutive binary numbers:

000, 001, 010, 011, 100, 101, 110, 111

we will first scramble them like this:

[even] (000, 010, 100, 110), [odd] (001, 011, 101, 111)

then we'll scramble the groups:

((000, 100), (010, 110)), (001, 101), (011, 111))

which gives the result:

000, 100, 010, 110, 001, 101, 011, 111

This is equivalent to sorting the numbers in bit-reversed order--if you reverse bits in each number (for instance, 110 becomes 011, and so on), you'll get a set of consecutive numbers.

So this is how the FFT algorithm works (more precisely, this is the decimation-in-time in-place FFT algorithm).

Select N that is a power of two. You'll be calculating an N-point FFT.
Gather your samples into a buffer of size N
Sort the samples in bit-reversed order and put them in a complex N-point buffer (set the imaginary parts to zero)
Apply the first stage butterfly using adjacent pairs of numbers in the buffer
Apply the second stage butterfly using pairs that are separated by 2
Apply the third stage butterfly using pairs that are separated by 4
Continue butterflying the numbers in your buffer until you get to separation of N/2
The buffer will contain the Fourier transform

Implementation

We will start by initializing some data structures and pre-computing some constants in the constructor of the FFT object.

// Use complex numbers from Standard Library
#include <complex>
typedef std::complex<double> Complex;

// Points must be a power of 2
Fft::Fft (int Points, long sampleRate)
    : _Points (Points), _sampleRate (sampleRate)
{
    _sqrtPoints = sqrt((double)_Points);
    // calculate binary log
    _logPoints = 0;
    Points--;
    while (Points != 0)
    {
        Points >>= 1;
        _logPoints++;
    }
    // This is where the original samples will be stored
    _aTape = new double [_Points];
    for (int i = 0; i < _Points; i++)
        _aTape[i] = 0;
    // This is the in-place FFT buffer
    _X = new Complex [_Points];

    // Precompute complex exponentials for each stage
    _W = new Complex * [_logPoints+1];
    int _2_l = 2;
    for (int l = 1; l <= _logPoints; l++)
    {
        _W[l] = new Complex [_Points];

        for ( int i = 0; i < _Points; i++ )
        {
            double re =  cos (2. * PI * i / _2_l);
            double im = -sin (2. * PI * i / _2_l);
            _W[l][i] = Complex (re, im);
        }
        _2_l *= 2;
    }

    // prepare bit-reversed mapping
    _aBitRev = new int [_Points];
    int rev = 0;
    int halfPoints = _Points/2;
    for (i = 0; i < _Points - 1; i++)
    {
        _aBitRev[i] = rev;
        int mask = halfPoints;
        // add 1 backwards
        while (rev >= mask)
        {
            rev -= mask; // turn off this bit
            mask >>= 1;
        }
        rev += mask;
    }
    _aBitRev [_Points-1] = _Points-1;
}

The FFT buffer is filled with samples from the "tape" in bit-reversed order

    for (i = 0; i < _Points; i++)
        PutAt (i, _aTape[i]);

The bit reversal is done inside PutAt, which also converts real samples into complex numbers (with the imaginary part set to zero):

void Fft::PutAt (int i, double val)
{
    _X [_aBitRev[i]] = Complex (val);
}

The calculation of the FFT is relatively simple

//
//               0   1   2   3   4   5   6   7
//  level   1
//  step    1                                     0
//  increm  2                                   W 
//  j = 0        <--->   <--->   <--->   <--->   1
//  level   2
//  step    2
//  increm  4                                     0
//  j = 0        <------->       <------->      W      1
//  j = 1            <------->       <------->   2   W
//  level   3                                         2
//  step    4
//  increm  8                                     0
//  j = 0        <--------------->              W      1
//  j = 1            <--------------->           3   W      2
//  j = 2                <--------------->            3   W      3
//  j = 3                    <--------------->             3   W
//                                                              3
// 

void Fft::Transform ()
{
    // step = 2 ^ (level-1)
    // increm = 2 ^ level;
    int step = 1;
    for (int level = 1; level <= _logPoints; level++)
    {
        int increm = step * 2;
        for (int j = 0; j < step; j++)
        {
            // U = exp ( - 2 PI j / 2 ^ level )
            Complex U = _W [level][j];
            for (int i = j; i < _Points; i += increm)
            {
                // in-place butterfly
                // Xnew[i]      = X[i] + U * X[i+step]
                // Xnew[i+step] = X[i] - U * X[i+step]
                Complex T = U;
                T *= _X [i+step];
                _X [i+step] = _X[i];
                _X [i+step] -= T;
                _X [i] += T;
            }
        }
        step *= 2;
    }
}

The variable step is the "spread" of the butterfly--distance between the two inputs of the butterfly. It starts with 1 and doubles at every level.

At each level we have to calculate step bunches of butterflies, each bunch consisting of butterflies separated by the distance of increm (increm is twice the step). The calculation is organized into bunches, because each bunch shares the same multiplier W.

The source of the FFT class is available for downloading. You can also download the sources of our real-time Frequency Analyzer which uses the FFT algorithm to analyze the sound acquired through your PC's microphone (if you have one).

Source codes

COMPLEX_H

#if !defined COMPLEX_H
#define COMPLEX_H
//------------------------------------
// complex.h
// Complex number
// (c) Reliable Software, 1996
//------------------------------------

#include <math.h>

class Complex
{
public:
Complex () {}
Complex (double re): _re(re), _im(0.0) {}
Complex (double re, double im): _re(re), _im(im) {}
double Re () const { return _re; }
double Im () const { return _im; }
void operator += (const Complex& c)
{
_re += c._re;
_im += c._im;
}
void operator -= (const Complex& c)
{
_re -= c._re;
_im -= c._im;
}
void operator *= (const Complex& c)
{
double reT = c._re * _re - c._im * _im;
_im = c._re * _im + c._im * _re;
_re = reT;
}
Complex operator- ()
{
return Complex (-_re, -_im);
}
double Mod () const { return sqrt (_re * _re + _im * _im); }
private:
double _re;
double _im;
};

#endif

fft.cpp

//------------------------------------
// fft.cpp
// The implementation of the
// Fast Fourier Transform algorithm
// (c) Reliable Software, 1996
//------------------------------------
#include "fft.h"
#include "recorder.h"

// log (1) = 0, log(2) = 1, log(3) = 2, log(4) = 2 ...

#define PI (2.0 * asin(1.0))

// Points must be a power of 2

Fft::Fft (int Points, long sampleRate)
: _Points (Points), _sampleRate (sampleRate)
{
_aTape = new double [_Points];
#if 0
// 1 kHz calibration wave
for (int i = 0; i < _Points; i++)
_aTape[i] = 1600 * sin (2 * PI * 1000. * i / _sampleRate);
#else
for (int i = 0; i < _Points; i++)
_aTape[i] = 0;
#endif
_sqrtPoints = sqrt((double)_Points);
// calculate binary log
_logPoints = 0;
Points--;
while (Points != 0)
{
Points >>= 1;
_logPoints++;
}

FFT_H

#if !defined FFT_H
#define FFT_H
//------------------------------------
// fft.h
// Fast Fourier Transform
// (c) Reliable Software, 1996
//------------------------------------

#include "complex.h"
#include "wassert.h"

class SampleIter;

class Fft
{
public:
Fft (int Points, long sampleRate);
~Fft ();
int Points () const { return _Points; }
void Transform ();
void CopyIn (SampleIter& iter);

double GetIntensity (int i) const
{
Assert (i < _Points);
return _X[i].Mod()/_sqrtPoints;
}

int GetFrequency (int point) const
{
// return frequency in Hz of a given point
Assert (point < _Points);
long x =_sampleRate * point;
return x / _Points;
}

int HzToPoint (int freq) const
{
return (long)_Points * freq / _sampleRate;
}

int MaxFreq() const { return _sampleRate; }

int Tape (int i) const
{
Assert (i < _Points);
return (int) _aTape[i];
}

private:

void PutAt ( int i, double val )
{
_X [_aBitRev[i]] = Complex (val);
}

int _Points;
long _sampleRate;
int _logPoints;
double _sqrtPoints;
int *_aBitRev; // bit reverse vector
Complex *_X; // in-place fft array
Complex **_W; // exponentials
double *_aTape; // recording tape
};

#endif

ERKUTLU SCIENCE CENTER

29 Eylül 2012 Cumartesi

Signal Processing