Daqarta
Data AcQuisition And Real-Time Analysis
Scope - Spectrum - Spectrogram - Signal Generator
Software for Windows
Science with your Sound Card!

# Gut-Level Fourier Transforms - Part 7:

## Spectrograms

In our discussion of windowing to reduce spectral leakage, we saw how window functions taper the sampled data set smoothly to zero at the ends. This is needed when the input is a continuous repeating waveform, in order to "fake" its repeat interval into matching the DFT or FFT data set length. However, many real-world signals contain transients that might get tapered to nothing if they happen to fall near the beginning or end of a given data set. How do we handle those?

If the signal contains only a transient, such as a recording of a brief impact, and it is short enough to fit entirely into the data set (that is, it goes to zero, or nearly so, by the end of the set), then no windowing is needed. Because it goes to zero, there would be no apparent discontinuity if we repeated this data set, so there would be no spectral leakage problem.

But many transients take a long time to decay, and may not fit into the maximum FFT size of your spectrum analyzer. And many more signals, such as speech, are full of transients mixed with tonal components. How do we analyze those? The general term for the study of time-varying spectra is joint time-frequency analysis, often abbreviated as JTFA. The most popular tool for this is the spectrogram, which employs the FFT in a form called the short-time Fourier transform (STFT).

The spectrogram is a general-purpose method for viewing time-varying spectra. It shows frequency versus time, with intensity encoded as color or brightness to represent a third dimension. Essentially, it is made up of vertical stripes, each of which represents one conventional FFT spectrum turned on end. Frequency runs up the screen, and magnitude comes "out of" the screen by virtue of the color or brightness code. Figure 1 shows a short section of a typical speech spectrogram, with the color range reduced for broader monitor compatibility and to minimize image file size.

Fig. 1: Spectrogram of spoken "Spectrogram"

You can think of the spectrogram as similar to ordinary sheet music, where time flows to the right, and higher notes are higher up the page. The difference is that the spectrogram encodes the loudness of each component as brightness or color, and the duration of a "note" is shown directly instead of symbolically.

Figure 1 is the spectrogram of the spoken word "spectrogram", with the approximate locations of the syllables shown on the X axis. Note the high frequency noise energy in the sibilant "s", and the loud plosive of the "p". The stacked horizontal streaks are harmonics of the glottal impulse generated by the vocal cords. Changes in the shape of the vocal tract result in different filter characteristics, which create the formant patterns that we recognize as speech.

But just standing a bunch of spectra on end does not by itself solve the window issue: Since the signal may contain continuous tones, we still need windowing to prevent spectral leakage. And since it may also contain transients, we need to worry about the window squashing them. The solution is overlap processing.

In overlap processing, we don't take all-new data for each successive FFT; we re-use some of the old data each time. Suppose we are taking 1024-sample data sets for our FFTs. The first set would be samples 0 to 1023, just as usual. But with 50% overlap, the next set would run from 512 to 1535... only 512 new samples, plus 512 old ones.

The selected window function is applied to each set independently. Consider a transient that arrives near the end of the first data set: It may get flattened by the taper of the first window, but it will be near the center of the second one. So we have here the best of both worlds: Windows reduce leakage on continuous tones, yet transients still get through.

What about resolution? We already know that larger FFTs give finer frequency resolution, but now we must also consider temporal resolution as well. Compare a 1024-sample FFT to a 256-sample FFT, both running at 10240 samples per second. The spectral line spacing is Fs / N, so we get a freqency resolution of 10 Hz in the first case, and 40 Hz in the second.

But temporal resolution is the reciprocal of frequency resolution. There is a sort of "uncertainty principle" at work here, much like Heisenberg's: If we want more temporal detail, we have to give up spectral detail, and vice-versa. Imagine a very short tone burst that only lasts a few dozen samples. As long as the whole burst is included in one FFT, we will get essentially the same spectrum whether it happens near the start, middle, or end (neglecting window taper for the moment). So we can't say for sure when the burst occurs with any precision greater than the time for the N samples: 0.10 second for 1024 samples, versus 0.025 second for 256.

The obvious solution seems to be lots of overlap: At maximum overlap, we take only one new sample for each FFT, and the rest are old data. The FFT slowly slides through the data record, one new sample at a time, and continues to show nothing until it eventually encounters the short tone burst. As soon as we see any activity, voila', we know where the start of the burst is. As we keep sliding, we can likewise determine the end of the burst when the spectral activity returns to zero.

Of course, we need to use a window function to reduce spectral leakage when the burst is only partially within the FFT, and this will make the start and stop a bit more gradual. But aside from that it seems like we can have our cake, and eat it too, by using large FFTs with high overlaps.

Not quite. Look at the following two spectrograms. The same input signal was used for both: A 2 kHz tone burst, with a 4 kHz tone burst starting 20 msec later. The spectrogram in Figure 2 used 1024-sample FFTs, while Figure 3 used 256-sample FFTs. Overlaps were adjusted to give the same time axis, and the 256-sample FFTs were given x4 vertical magnification (simple pixel duplication) to get matching frequency axes. Both used

Hanning windows.

Fig. 2: 1024-sample Spectrogram

Fig. 3: 256-sample Spectrogram

It should be clear from this that the larger FFTs really do have poorer temporal resolution, even using overlap. You can see that the 4 kHz tone starts 20 msec after the 2 kHz tone, but it's a lot harder to say exactly when it started. What happened here?

Since the larger FFT has a longer "reach" in time, it slides into each burst well in advance of the smaller FFT, and slides past it much later. For a fairly simple and well controlled situation like this test, it is certainly possible to use the larger FFT for the better spectral resolution (which wasn't particularly needed here). But if the signal becomes very "busy", such that multiple events happen within one FFT duration, this is less satisfactory.

Still, for the vast majority of applications, you can usually reach a compromise by adjusting the FFT size and the sample rate, along with the overlap. Figure 1 uses 1024-point FFTs, but you can clearly resolve the features of the spoken word "spectrogram". To get this overall view the overlap was set to 900 samples, but it could be set even higher to see additional detail in individual syllables.

If a signal has been recorded, it is easy to go back later and adjust the overlap and FFT size to get the best results. But real-time operation requires fast processing for high overlaps. The maximum overlap that can be achieved in a live situation is limited by how long it takes the system to perform one FFT and display it in the vertical stripe format of the spectrogram. That time multiplied by the sample rate gives the number of samples that will arrive during the processing interval. The system must therefore advance the FFT window along the incoming waveform by at least that many samples, or it will continually fall farther behind.

You can use the author's Daqarta for Windows software that turns your Windows sound card into a full-fledged spectrum analyzer, with a built-in 256-color spectrogram capability. You can also record data for later analysis with full overlap control via the DDisk Read Step Size.

All Daqarta features are free to use for 30 days or 30 sessions, after which it becomes a freeware signal generator... with full analysis capabilities. (Only the sound card inputs are ignored.)

## Questions? Comments? Contact us!

We respond to ALL inquiries, typically within 24 hrs.
INTERSTELLAR RESEARCH:
Over 30 Years of Innovative Instrumentation
© Copyright 2007 - 2017 by Interstellar Research
All rights reserved