4 COST 276 Workshop March 31, April 1, 2003
A STUDY OF TEMPO TRACKING ALGORITHMS FROM POLYPHONIC MUSIC SIGNALS
M. Alonso, B. David, G. Richard
´Ecole Nationale Sup´erieure des T´el´ecommunications (ENST)D´ept. Traitement du Signal et des Images46, Rue Barrault75634 Paris, Cedex 13Email:miguel.alonso,bertrand.david,gael.richard@enst.fr
ABSTRACT
In this paper, algorithms for tempo tracking from polyphonic music signals are introduced. These new methods are based on the association of a ﬁlter bank with robust pitch detection algorithms such as the spectral sum orspectral product. These algorithms are further improvedby using an onset detector in each band. These algorithmsare then compared to two of the most reliable methods of the literature using a small manually annotated database of short sequences of music signals. It is shown that, despitetheirsimplicity, thesenewapproachesareveryefﬁcientandoutperform the tested methods.
1. INTRODUCTION
Theenormousamountofunstructuredmultimediadataavailable nowadays and the spread of its use as a data sourcein many applications are introducing new challenges to researchers in information and signal processing. The continuously growing amount of this digital multimedia information increases the difﬁculty of its access and management,thus hampering its practical usefulness. As a consequence,there is a clear need for contentbased multimedia data indexing, processing and retrieval techniques.If multimedia and multimodal approaches represent essential challenges, the more classical approach consistinginbuildingnewanalysisorindexingtechnologiesonagivenmedia are still needed to overcome the current limitationsof today approaches.For example in audio, numerous problems still exist toextract high level descriptors directly from polyphonic music signals. The tempo (or beat) is one of the most important descriptor since many applications can be derivedfrom the automatic recognition of the rhythmic structure of a music signal:
•
automaticrhythmicalignementofmultipleinstruments,channels or musical pieces (for mixing or
karaok ´ e
)
•
automatic indexing, segmentation and style classiﬁcation of music databases,
•
beat driven computer graphics (virtual dancers, etc..)Tempo or beat analysis of musical signals is a domainof research that receives a growing interest as shown by thevariety of recent publications [8],[10],[1], [12], [6]. Thisproblem is apparently simple (most people even withoutany musical knowledge have no difﬁculties to ﬁnd the beatof a musical performance). However, automatic recognition is more complex especially for music styles that donot include strong rhythmic patterns (such as classical or jazz music, for example).If earlier approaches focused on MIDI signals (or simple real audio signals such as purely percussive signals [9]),today approaches are directly dealing with polyphonic music. Scheirer [8] proposed a method associating a ﬁlterbank with a set of combﬁlters. Simpler methods were introduced by by Sepp¨annen [10] using sound onset detectionor by Tzanetakis [12] in the context of musical genre classiﬁcation. Another approach was also proposed by Goto[2] to infer the hierarchical beat structure. In most of theseworks, the rhythm detection (or periodicity) is based on asimple interonset time detection or on the traditional autocorrelation method.In this paper, several algorithms for tempo tracking areintroduced. These methods are based on the association of a ﬁlter bank with robust pitch detection algorithms such asthe spectral sum or spectral products. The performancesof these algorithms are evaluated against some of the mostreliable algorithms of the literature using a small manuallyannotated database of short sequences of music signals.The paper is organized as follows: the next section describes our new algorithms including some minor improvements of the srcinal Scheirer’s algorithm. Results of thesealgorithms compared to two methods of the literature aregiven in section 3. Finally, in section 4 we suggest someconclusions.
2. TEMPO ESTIMATION ALGORITHMS
Most of the algorithms designed to estimate the tempo of musical pieces [8, 7, 3, 12] are based on the same basicsteps. Inparticular, theyallprocessseveralfrequencybands1
4 COST 276 Workshop March 31, April 1, 2003separately, combining the results in the end. According tothe experimental results found in [8], this assumption preserves rhythm perception for most music signals, and haveprove to be efﬁcient for many kinds of frequency bands decomposition. These basic steps are described below:1. subband decomposition of the signal, provided by aﬁlterbank,2. onset detection in each subband.3. estimation of the periodicity in each subband.4. combinationoftheresultstoobtainthegeneraltempo.The differences between the algorithms found in the litterature rely on the implementations of those steps. Forinstance, the wellestablished Scheirer algorithm uses a sixband IIR ﬁlter bank for the ﬁrst stage. Nevertheless we alsofound a eight band ﬁlter bank in [7] and a 21 nearly criticalband ﬁlter bank in [4]. There are also different techniquesused to detect the onset times: based on a half wave or fullwave rectiﬁcation, using the enveloppe or its squared value,deriving the difference function or the relative differencefunction, applying thresholding or not.According to these descriptions, the structure of the wholesystem would appear as sketched on the ﬂow diagram of the ﬁgure 1.
Σ
detection
Onset
detection
Onset
M M
PeriodicitydetectionPeriodicitydetection
. . .
H (z)
x(t)
0 7
H (z)
. . . . . . . . .
extraction
Envelope
extraction
Envelope
. . . . . .
Peak picking Tempo output
Fig. 1
. Proposed tempo estimation system ﬂow diagram.Our tempo estimation approach uses the general psychoacoustic simpliﬁcation proposed by [8] and also adopted byPaulus [7]. The input signal to the tempo estimation systemis ﬁrst divided into eight nonoverlapping frequency bandsusing a ﬁlterbank of sixthorder butterworth ﬁlters. Thelowest band is obtained by lowpass ﬁltering with a cutoff frequency of 100 Hz, the seven higher bands are logarithmically distributed in frequency between 100 Hz and half the sampling frequency (8000 Hz for our experiments), assuggested by Paulus in [7]. We obtain the subband signals
x
k
(
t
)
, where
k
= 0
,...,
7
.Next, extractionofthesignal’senvelopeiscarriedout. Thisis a fundamental step aiming at precisely ﬁnding the soundonset points provided to the tempo estimation algorithms.This task is accomplished using a system mostly based onthe onset detector proposed by Klapuri in [3, 4], a ﬂow diagram is presented in Fig. 2.
( )
.
2
subbandonsetsLPF
dt d
(
log(A(t))
)
Thres−holding
M
Half−waverectificationsubbandsignal
Fig. 2
. Envelope extraction and onset detector ﬂow diagram.At each subband, the input signal is ﬁrst halfwave rectiﬁed and squared. Then, amplitude envelopes,
A
k
(
t
)
, ateach frequency channel are calculated by convolving thesubband signals with a 100 ms descending halfHanningwindow (linear phase lowpass ﬁlter) and then the outputof each band is decimated in order to reduce the computational burden of the following stages, the decimation factorbeing 16. For the bandwise onset detection we use the ﬁrstorder
relative difference function
W
k
(
t
)
, which gives theamount of change in the signal in relation to its absolutelevel. This is equivalent to differentiating the logarithm of the amplitude envelope, as given by Eq. (1).
W
n
(
t
) =
ddt
(log(
A
n
(
t
)))
(1)The relative difference function is a psychoacoustically relevant measure, since the perceived increase in signal amplitude is in relation to its level, the same amount of increase is more prominent in a quiet signal [3, 4]. Hence,we detect onset components by a peak picking operation,which looks for peaks above a given threshold. The threshold value was found experimentally to be around
1
.
5
σ
W
k
,where
σ
W
k
stands for the standard deviation of the signal
W
k
(
t
)
.Until this point, two of the proposed tempo estimation systems (spectral and summary autocovariance function) follow the same principle. From here on they will be treatedseparately, thus one of them takes place in the frequencydomain while the other in the time domain.
2.1. Spectral methods
Duetotheirstrongrelationship, twodifferentspectraltempoestimation methods are presented: the
harmonic spectral
2
4 COST 276 Workshop March 31, April 1, 2003
sum
andthe
harmonicspectralproduct
. Bothofthese
methods come from traditional pitch determination techniques.At the output of the onset detection block, subband signals have the appearance of a quasiperiodic train pulse. Inorder to ﬁnd the bandwise fundamental frequency of suchtrain pulses, the Fourier transform of the subband signals,
X
k
(
e
jω
n
)
, is calculated. Prior to the FFT calculation, subband signals are zero padded to have a size
l
x
given by:
l
x
= 2
⌊
log
2
(
length
(
x
k
(
t
))
⌋
+2
(2)where
⌊·⌋
stands for
the integer part of
.
2.1.1. Spectral Sum
The spectral sum is a reliable pitch determination technique. It’s principle lies on the assumption that the powerspectrum of the signal is formed of strong harmonics located at integer multiples of the signal’s fundamental frequency. Forthepurposeofﬁndingthisfrequency, thepowerspectrum is compressed by a factor
l
, then the obtainedspectra are added. In normalized frequency, this is indicated by Eq. (3).
S
k
(
e
jω
n
) =
M
l
=1

X
k
(
e
jlω
n
)

2
for
ω
n
< πM
(3)Consequently, the signal’s fundamental frequency is strongly reinforced. In addition, all subband spectral sums areadded together, this strengthens even more the fundamental frequency which is shown in the form of an easily detectable prominent peak, as depicted in Fig. 3 for a particular music signal. There, we can clearly see the most salientpeak located roughly at a frequency of 1.05 Hz, which corresponds to a beat rate of 63 Beat Per Minute (BPM).
1 1.5 2 2.5 3 3.5 4 4.5 5−18−16−14−12−10−8−6−4−20
Frequency (Hz)
N o r m a l i z e d m a g n i t u d e ( d B )
Spectral sum
Fig. 3
. Spectral sum of a music signal.In our implementation
M
was set to 6 and the fundamental frequency search was carried out in the interval rangingfrom
5
/
6
to 5 Hz, which corresponds to a beat rate between50 and 300 BPM.
2.1.2. Spectral Product
As brieﬂy mentioned, another method for pitch estimationclosely related to the spectral sum is the spectral product,in normalized frequency it is deﬁned by Eq. (4).
S
k
(
e
jω
n
) =
M
l
=1

X
k
(
e
jlω
n
)

2
for
ω
n
< πM
(4)In a similar way to the preceeding method, for the spectral product implementation
M
was set to 6 and the fundamental frequency search was performed within the samefrequency interval. Fig. (4) depicts the result for the aforesaid signal. Once again, we can see the most salient peak located at about the same frequency of 1.05 Hz. Note,however, that the spectral product method shows a muchhigher prominent peak relatively to the other secondarypeaks compared to the spectral sum method.
1 1.5 2 2.5 3 3.5 4 4.5 5−120−100−80−60−40−200
Frequency (Hz)
N o r m a l i z e d m a g n i t u d e ( d B )
Spectral product
Fig. 4
. Spectral product of a music signal.
2.2. Summary Autocovariance Function
The conception of this method was suggested by the work done in the multipitch detection ﬁeld by [11]. The bandwise train pulse like signals at the output of the sound onsetdetector are convolved with a 100 ms odd length Hanningwindow. Then, the autocovariance function,
Γ
k
(
τ
)
, of thesubband signals is computed, as given by Eq. (5).
Γ
k
(
τ
) =
t
[
x
k
(
t
+
τ
)
−
x
k
][
x
k
(
t
)
−
x
k
]
(5)Then, the bandwise autocovariance functions are added together to form the
summary autocovariance function
,SACVF
(
τ
)
, obtained by Eq. (6).SACVF
(
τ
) =
7
k
=0
Γ
k
(
τ
)
(6)3
4 COST 276 Workshop March 31, April 1, 2003Asinthespectralmethodscase, we’reonlyinterestedin
detecting the periodicities of the subband signals
x
k
(
t
)
whocorrespond to the most salient peaks in SACVF
(
τ
)
. In addition, we are only concerned about ﬁnding a beat rate between 50 and 300 BPM. Thus, the time lag
τ
varies onlywithin the range of 200 to 1250 ms. For our SACVF implementation this result is shown in Fig. (5), where theforesaid signal has a dominant periodicity, depicted by themost salient peak, at lag of approximately 950 ms, whichnearly corresponds to a fundamental frequency of 1.05 Hz.
200 300 400 500 600 700 800 900 1000 1100 1200−0.500.51
Time lag (ms)
N o r m a l i z e d m a g n i t u d e
Summary autocovariance function
Fig.5
. Summary autocovariance funcion of a music signal.Finally, to ensure a better tempo extraction, the three mostsalient peaks are detected and a relation of multiplicity between them is searched via a simple numerical algorithm.For instance, in Fig. (5) the ﬁrst, second and fourth peaksare the most prominent and they bear a strong relationshipbetween them. The second peak is located at lag practicallytwice that of ﬁrst one and the fourth peak is located at a lagnearly four times that of the ﬁrst one. This allows to have amore robust decision. If no relation of multiplicity is foundamong the detected peaks, the most salient one is taken asthe right tempo.
2.3. Bank of resonators
An alternate method dedicated to the determination of thetempo is given in [8] and [5]. In order to estimate the onsetperiodicity (pulse) in each subband, a socalled bank of resonators is used. These resonators are simply oversampledversions of autoregressive ﬁlters of order 1. For instance,the k
th
ﬁlter has a ztransfer function of the kind:
H
(
z
) =
β
k
1
−
α
k
z
−
T
k
(7)The principle of the method is the following. The impulse response of this ﬁlter is zero unless for the time indices multiples of the oversampling factor
T
k
. When it receives a periodic pulse train of period
T
k
, the output samples cumulate and the output level is increased.The factors
T
k
are set as integer numbers of samples inorder to cover the whole range of the analysed tempi, from
T
1
to
T
M
. According to the principle described above, the
β
k
and
α
k
coefﬁcients are chosen constant for all ﬁlters inorder to be able to compare the nonzero outputs. For a
T
k
periodic pulse train, the transient time of the k
th
ﬁlter isthus of the order of
τ
k
=
−
3
T
k
ln
α
In our implementation,
α
is set to ensure the maximumtransient time
τ
M
to be far less than the length of the analysis window. The tempo is determined following the mainsteps:1. computation of the responses
y
k
(
t
)
of each ﬁlter tothe centered amplitude envelopes,2. estimation the mean power
σ
2
k
of
y
k
,3. extraction of the time indices
t
i
,i
= 1
,...,N
k
suchas
y
k
(
t
i
)
>
3
σ
k
, andcomputationofthemeanpower
P
k
= 1
/N
kN
k
i
=1
y
k
(
t
i
)
2
,4. Extraction of the tempo corresponding to the factor
T
u
with
u
= arg
k
max
P
k
3. SIMULATION RESULTS
3.1. Sound database
The database used for evaluation is constituted of 55 shortsegments of musical signals (each of 10 seconds long). Theshort musical excerpts were chosen in order to representdifferent styles : Classical music (23 % of the database),Rock or modern pop music (33 %), traditional songs (12%), Latin/cuban music (12%) and jazz (20 %). All signalsare sampled at 16kHz. This sound database has been manually annotated by skilled musicians. The procedure formanually estimating the tempo is the following:
•
themusicianlistenstoamusicalsegmentusingheadphones (sometimes several times in a row to be accustomed with the tempi),
•
while listening, he/she taps the tempo,
•
the tapping signal is recorded and tempo is automatically extracted from it,
•
all tempo are ﬁnally manually checked, however dueto the impulsiveness nature of the tapping signals, noerrors was found after the automatic extraction.4
4 COST 276 Workshop March 31, April 1, 2003Method Pourcentage of correct estimationScheirer 76Scheirer Modiﬁed 85Autocovariance 87Paulus 74Spectral Sum 87Spectral prod 89
Table1
. Performances obtained for several tempo trackingalgorithms
3.2. Results
This section gives the results of several algorithms on thedatabase described in the previous section. Despite the
limited size of our database, this test gives good indication of the performances of each algorithm. The estimation provided by an algorithm is labelled as correct when it differsfrom less than 5% from the srcinal tempo without counting errors of doubling or halving. An estimated tempo
T
e
is then labelled correct if:
0
.
95
αT < T
e
<
1
.
05
αT
(8)where
α
= 0
.
5
or
α
= 2
, and where
T
is the valid(manually measures) tempo.The table 1 gives the results obtained for ﬁve algorithm:the srcinal Scheirer algorithm [8], a modiﬁed version of ittaking into account a new resonator bank, (see section 2.3),the approach introduced by Paulus [7], the autocovariancemethod, the spectral sum and the spectral product methods.If there is no signiﬁcant differences between the fourbest methods, the approaches introduced in this paper outperform the classical approach of Scheirer and the recentlyintroduced method by Paulus. It is though important tonotice that the database used is small and that it will beimportant to conduct complementary tests on an extendeddatabase to conﬁrm these initial results.
4. CONCLUSION
This paper has proposed several new algorithms for tempotracking and has compared them to a number of methodsdescribed in the literature. Although the dataset used forevaluationisratherlimited, itisseenthatthemethodsintroduced are very accurate. Future work will include an evaluation of these algorithms on a larger dataset, the extensionof our best algorithm for realtime tracking of tempo andits adaptation to small size analysis windows. Finally retrieval experiments will be conducted based solely on therhythmic pattern.
5. REFERENCES
[1] S.E. Dixon. A beat tracking system for audio signals.
Austrian Research Institute for Artiﬁcial Intelligence.Vienne.
, pages 311–320, Avr. 2000.[2] M. Goto and Y. Muraoka. An audiobased realtimebeat tracking system and its applications.
Proceedings of the International Computer Music Conference
, 1998.[3] Anssi Klapuri. Automatic transcription of music.Master’s thesis, Department of Information Technology, Tampere University of Technology, 1998.[4] Anssi Klapuri. Sound onset detection by applyingpsychoacoustic knowledge. In
IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP)
, pages pp.3089–3092, 1999.[5] Edward W. Large and John F. Kolen. Resonance andthe perception ofmusical metter.
ConnectionScience
,Vol. 6:pp. 177–208, 1994.[6] J. Laroche. Estimating tempo, swing and beat locations in audio recordings.
Proc. Int. Workshop on ap plications of Signal Processing to Audio and Acoustics
, WAASPA:pp. 131–135, 2001.[7] Jouni Paulus and Anssi Klapuri. Measuring the similarity of rhythmic patterns. In
3rd. InternationalConference on Music Information Retrieval (ISMIR)
,2002.[8] Eric D. Scheirer. Tempo and beat analysis of acousticmusic signals.
J. Acoust. Soc. Am.
, Vol. 103(1):pp.588–601, Jan. 1998.[9] A. Schloss. On the automatic transcription of percussive music, 1985.[10] Jarno Sepp¨anen. Tatum grids analysis of musical signals.
New Paltz, New York
, pages 21–24, Oct. 2001.[11] Tero Tolonen and Matti Karjalainen. A computationally efﬁcient multipitch analysis model.
IEEE Trans. Speech Audio Processing
, Vol. 8(6):pp. 708–716, Nov. 2000.[12] George Tzanetakis and Perry Cook. Musical genreclassiﬁcation of audio signals.
IEEE Trans. Speech Audio Processing
, Vol. 10(5):pp. 293–301, Jul. 2002.5