Describes frequency and pitch detection for pitch detection on Arduino (frequency.cpp
, pitch.cpp
). The core of the application: the frequency detection algorithm. Part of the project Arduino Pitch Detector.\(\)
Each musical instruments create an unique set of harmonics [demo]. In the commonly used Even Tempered Scale, the A4 key on the piano corresponds a fundamental frequency \(f_0=440\mathrm{\ Hz}\). Other frequencies follow as:
$$ f=2^{\frac{n}{12}}\times 440\,\rm{Hz} $$ where \(n\) is the number of half-steps from middle A (A4).
Designs considered
The main challenge of this project is to detect the fundamental frequency of the notes played using an embedded system. The fundamental frequency is defined as the lowest frequency produced by an oscillation.
The following three methods were considered
- Using a time-domain feature such as zero crossings. This means that you find the distance between when the waveform goes from negative to positive the first time and when it does that a second time.
- Using autocorrelation to determine the frequency of instrumental sounds as published by Judith C. Brown and Bin Zhang [Brown, Monti]. Autocorrelation is a math tool for finding repeating patterns. It estimates the degree to which a signal and a time lagged version of itself are correlated. A high correlation indicates a periodicity in the signal at the corresponding time lag.
- An alternate method of calculating autocorrelation is by using a Fast Fourier Transform and approaching it similar to convolution. To get cross-correlation instead of convolution, I time-reverse one of the signals before doing the FFT, or take the complex conjugate of one of the signals after the FFT as shown in $$ R_{xx}(k) = \mathcal{F}^{-1}\left(\mathcal{F}(t)\times\mathcal{F}(t+k)^*\right) $$
A literature study revealed that using time-domain features (1) will not perform well for musical instruments, such as a clarinet, that produce harmonics that are stronger than the fundamental frequency.
Brown’s method (2) is more promising. It calculates the autocorrelation \(R_{xx}\) at lag \(k\) by the equation [wiki, Lyon]
$$ \begin{align} R_{xx}(k) & =\frac{1}{\sigma^2} \sum_{t=1}^N(s_t-\bar{s})(s_{t+k}-\bar{s})\\ \rm{where}\quad \bar{s}&=\frac{1}{N}\sum_{t=1}^Ns_t,\quad \sigma=\sqrt{\frac{1}{N}\sum_{t=1}^N(s_t-\bar{s})^2}\nonumber \end{align} $$
The symbols:
- \(s\) are audio samples
- \(N\) is the total number of samples
- \(k\) is the lag
- \(\bar{s}\) is the mean signal value
- \(\sigma^2\) is a normalization factor.
However, calculating the autocorrelation requires \(2N\) subtractions, \(N\) additions, \(2N\) multiplications, and \(N\) divisions. This is likely to exceed the design constraints.
The alternate method of calculating autocorrelation (3) reduces the processing requirement to \(N-\log(N)\), but the algorithm uses significantly more memory. This leaves less memory to store audio samples thereby reducing the window size and consequently limits the ability to recognize low frequencies.
Once it determines the frequency, the MIDI pitch \(m\) follows as
$$ m = 69+12\log_2\frac{f}{440} $$
Design to find the frequency
To detect the fundamental frequency, I simplified Brown’s method by making two assumptions
- The signal has no DC bias, \(\bar{s}=0\).
- We’re only interested in the lag for which the autocorrelation peaks, not the absolute value of the autocorrelation. Therefore, the normalization factor \(\sigma^2\) that is independent of the lag \(k\) can be ignored.
- If the term \(t+k\) extends past the length of the series, the series is considered to be \(0\).
Based on these assumptions, the autocorrelation can be estimated as:
$$ R_{xx}(k) = \sum_{t=1}^Ns_t\,s_{t+k} $$
The figure below shows a visualization of the term . The original waveform is shown in blue, and the time lagged version in red. The black curve shows the multiplication of these signals.
The plot below shows an example of the estimated autocorrelation for \(R_{xx}(k)\) as a function of the lag \(k\). By definition the maximum autocorrelation \(R_{xx}(0)\) is at lag \(k=0\).
I ported my frequency detection code to GNU Octave to enhance my visual understanding of the algorithm. This was especially helpful in determining the empirical threshold for the peak finding algorithm.
A peak finding algorithm looks for the first peak that exceeds a threshold at \(\frac{2}{3}R_{xx}(0)\). The corresponding lag \(k_0\) is considered the period time \(T_0\). The fundamental frequency \(f_0\) follows as the inverse of \(T_0\).
The listing below shows a code fragment from frequency.cpp
that implements the autocorrelation function.
Design to find the peak
By definition the autocorrelation is maximum at lag \(k=0\). If we find the maximum value for \(R_{xx}(k)\), for \(0 \lt k \lt N\), then \(k\) is the period time. This requires calculating \(R_{xx}(k)\) for all values of \(k\).
To make it faster, I accept the first maximum that is above \(\frac{2}{3}R_{xx}(0)\). The first peak that exceeds this value is considered the period time \(T_0\). The algorithm is shown below.
Simulation
A complementary simulation in GNU Octave visualizes the algorithm, making the process easier to understand and fine tune. The video below shows the calculation of \(R_{xx}(k)\), and the peak finding algorithm. To run the simulation yourself, load the file simulation/file2pitch.m
. in GNU Octave or Matlab.
Analyzing accuracy
Analysis revealed that the sample rate and window size determine the maximum and minimum frequency that can be recognized. These variables can be configured in config.h
.
- If the sample rate is to low, high frequencies will only have a few audio samples per period, causing these frequencies not to be accurately recognized.
- The window size is the number of audio samples that are processed at the time in the autocorrelation loop. If the windows size is too small, low frequencies cannot be recognized.
- The delay is caused by the sampling of the input, calculating the frequency and the segmentation algorithm. The highest delay occurs at the lowest frequency, approximately 60 milliseconds. This was noticeable but acceptable. I observed that my simple synthesizer software introduced a noticeable additional delay. The delay was minimized by sampling audio while doing the calculations, and by stopping the autocorrelation as soon as the frequency could be determined.
Range
The project’s aim is to recognize the range of notes produced by a B♭ clarinet. This clarinet can produce notes from E♭3 to G6, corresponding to a fundamental frequencies \(f_L\) and \(f_H\) $$ \shaded{ \left\{ \begin{align} f_L &= 155.6 \, \rm{Hz} \nonumber \\ f_H &= 1.568 \, \rm{kHz} \nonumber \end{align} \right. } $$
For the 12-note-per-octave equal-tempered scale, each note or semi-tone is \(5.946309436\%\) “higher” in frequency than the previous note. Given that the frequency will be rounded to a note pitch, we can allow for an error rate \(\varepsilon\) $$ \varepsilon \approx 0.05946\% $$
The highest frequency \(f_H\), determines the sample rate. To stay within the error rate \(\varepsilon\), the sample rate \(f’_{s}\) follows as $$ f_s’ = \frac{f_H}{\varepsilon} = \frac{1568}{0.05946} \approx 26.37\,\rm{kHz} $$
The Arduino can only sample a signal at \(2^a\,\frac{16\,\rm{MHz}}{128\times 13}\), where \(a\in \mathbb{N}\). As a consequence, the sampling rate has to be rounded up to \(f_{s}^{\prime\prime}\) $$ f_s^{\prime\prime} = 38.461\,\rm{kHz} $$
The lowest frequency \(f_L\) determines the window size, where the number of audio samples \(N^\prime\) should cover at least twice period time of the lowest frequency \(f_L\) $$ N^\prime = 2\,\frac{f_s^{\prime\prime}}{f_L} = 2\,\frac{38461}{155.6} \approx 495 $$
Each audio sample requires 1 byte of the Arduino’s SDRAM. With only about \(200\) bytes left available to store audio samples \(N’\) will not fit.
Alternative
Instead, we use the available \(200\) bytes to store the samples, so the window size \(N\) is $$ \shaded{ N = 200 } $$
In order to recognize the lowest frequency \(f_L\), the sample frequency \(f_s^{””}\) follows as $$ \begin{align} f_s^{”’} &\le f_L\,\frac{N}{2} \nonumber \\ &\le 155.6\,\frac{200}{2} \nonumber \\ &\le 15.560\,\rm{Hz} \end{align} $$
For the Arduino sample rate scaler, this needs to be rounded down to \(f_s\) $$ \shaded{ f_s = 9.615\,\rm{kHz} } $$
The resulting frequency range can be expressed as $$ \begin{align} \frac{f_s}{N/2} \lt &f \lt \Delta\varepsilon\,f_s \nonumber \\ \frac{9615}{200/2} \lt &f \lt 0.0595\times 9615 \nonumber \\ 96.2\,\rm{Hz} \lt &f \lt 572\,\rm{Hz} \end{align} $$
This implies that it will only reach D♭5. Let’s see how we can improve the accuracy.
Measurements
Low notes are meased accurately, but errors increase with frequency.
Improvements
With the base algorithm in place, time has come to focus on improvements.
Improving speed
The fundamental frequency requires calculating \(R_{xx}\) for all values of \(k\) . However, the possible values of \(k\) are limited by the window size and sample frequency.
- The window size limits the lowest frequency, while
- the sample frequency limits the highest frequency.
The range for the lag \(k\) is determined by the highest frequency \(f_H\) and the windows size \(N\) $$ \begin{align} \frac{f_s}{f_H} \leq &k \leq \frac{N}{2} \nonumber \\[0.5em] \implies \frac{9615}{1568} \leq &k \leq \frac{200}{2} \end{align} $$
Rounding down the minimum and rounding up the maximum values, the range for the lag \(k\) follows as $$ \shaded{ 6 \leq k \leq 62 } $$
Improving accuracy of high notes
We can improve the accuracy of especially the high notes, by using interpolation. Fitting a parabolic curve to the prior, current and next autocorrelation values \(k_1,k_2,k_3\). The value for \(k\) that corresponds to the top of the parabola, is the estimate lag \(k_m\).
The difference between the estimated autocorrelation value \(k_m\) and the measured factor \(k_2\) is the correction factor \(\delta\). [^1][^2] [^1]: [Polynomial Interpolation, Abdulkadir Hassen [^2]: Cross-Correlation, Douglas Lyon] $$ \delta = \frac{k_3-k_1}{2(2k_2-k_1-k_3)} $$
The corresponding sample window size \(N’\), is determined by the lowest frequency \(F_L\) $$ N’ = 2\frac{f_{s}’}{f_L}=2\,\frac{9615}{155.6}\approx 125 $$
I rounded the window size \(N”\) up to 200 bytes. $$ \shaded{ N = 200 } $$
Measurements
The accuracy of high note dramatically improves as shown below.
Improving accuracy of low notes
As the peak finding algoritm considers higher values of the lag \(k\), the autocorrelation values decrease because of the zeroes introduced in the shifted signal.
I tried normalizing the autocorrelation for these introduced zeroes, by multiplying with a normalization factor of \(\frac{N}{N-k}\). As a result the normalized autocorrelation can be expressed as $$ R_{xx}(k)=\frac{N}{N-k}\sum_{t=1}^Ns_t\,s_{t+k} $$
Measurements
For the clarinet the nomalization makes it drop some high notes as shown in the figure below. The clarinet doesn’t benefit from improving accuracy on low notes, because the lowest notes it can play is \(155.6\,\rm{Hz}\) compared to the Arduino that can detect up to \(96\,\rm{Hz}\).
Piano
The results for piano samples using interpolation are shown below for reference.
For piano it greatly benefits the low notes as shown below.