This is the repo for the paper titled "Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music", accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing.
As reported in our ISMIR late-breaking demo paper titled "What Can Go Wrong When Conducting Beat Tracking Experiments" [4], there are underlying technical issues regarding madmom library [3], in combination with the inconsistent audio file formats (i.e., data types and sampling rate)in ASAP dataset [1, 2], which may largerly influence the evaluation results. We update here the results after fixing the technical issues. Note that the differences lie in the real activation experiment part of ASAP. For Maz-5 dataset with a consistent audio file format, all results remain the same. Also note that, the main message of this paper remains the same: Compared to existing commonly adopted HMM-based post-processing trackers (PPTs), PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score.
To model the periodicity of beats, state-of-the-art beat tracking systems use “post-processing trackers” (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin’s Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predomi-nant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called “predominant local pulses” (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).
In this repo we include one recording from the ASAP dataset [1, 2] as an example to demonstrate the usage of the inference code. Users may run the inference code directly to see how it generates both estimated beats and qualitative results using the post-processing trackers (i.e., SPPK, DP, HMM, HMMT0, and PLPDP) considered in the paper.
The following figure shows the beat tracking F1-score of single kernel PLPDP using kernel sizes from 1-20 seconds. The F1-scores of combined kernel PLPDP (i.e.,
Finally, as illustrated in our paper Section III-B, despite that
Grid search experiments are conducted to investigate the performance of HMM using tempo transition lambda from 0--100 with a step size of five. The following figure shows the result for real datasets. The results of PLPDP are also plotted as horizontal dashed lines for comparison.
We can see that despite that HMMs with lambda
On the other hand, the Figure below shows the grid search results of synthetic datasets using different tempo transition lambdas. It can be seen that in synthetic experiments, for both ASAP and Maz-5, lambda
We summarize the main ideas as follows:
- The results of Figures 3 and 4 reveal the remarkable influence of the tempo transition lambda on the performance of HMM for expressive music. They also validate PLPDP’s superiority over HMMs in both datasets when using real activation functions.
- From the different preference of lambda for ASAP and Maz-5 in real activation experiment, it can be seen that “adjust the lambda for the characteristics of the data” may be impractical in real use cases. As the characteristics of expressive classical music may vary dramatically, the optimal lambda for a dataset may also be problematic for individual pieces.
- Again, note that similar “global” vs. “local” dilemma also appears within individual expressive musical pieces. It is common that an expressive musical piece has regions with stable tempo and regions with dramatic tempo changes. And a “best lambda” for that musical piece may be problematic for specific regions.
- The synthetic experiments also reveal the impracticality of adjusting the lambda based on real activation. As the real ASAP prefers lambda = 90 which is dramatially different from what synthetic ASAP prefers (i.e, lambda= 5), it can be seen that the interaction between imperfect beat activation and limited tempo related assumptions make the error analysis or parameter tuning difficult.
- This way, we can see again the value of synthetic experiments. The “perfect” synthetic activation allow us to exclude factors of imperfect beat activation and see clearer the limitations of the post-processing trackers (PPTs).
- As we have already known the PPTs work based on different assumptions (i.e., “local periodicity” vs. “global tempo transition settings”), and our goal in this work is to investigate the limitations of PPTs, rather than comparing them after parameter optimization, we decided to keep the original baselines (i.e., HMMT0, HMM) and put all these grid search discussions in our repository as a supplement.
As existing conventional evaluation metrics generally assume a fixed relation between estimated beats and reference beats (e.g., double tempo or onbeat throughout a whole sequence), they are not able to reflect the "metric-level switching" behaviors of PPTs for expressive music. We have recently proposed an analysis method to compensate existing metrics[3]. Briefly, our proposed analysis method, annotation coverage ratio (ACR), calculate for each musical piece how the reference beats are "covered" (i.e., detected) by the estimated beats. For example, if the estimated beats switch to double tempo of reference beats for half of the time, while existing metric like AMLt would give a score of 0.5, ACR can reveal 0.5 for onbeat, 0.5 for double tempo, and 1.0 for any tempo.
From Table 1, we can see that potential inconsistency between existing conventional metrics. For example, for Maz-5, HMMT0 achieves higher F1-score (0.595) than DP does (0.488), while DP achieves much higher CMLt and AMLt than HMMT0. With ACR, the inconsistency became explainable. As HMMT0 switches between subharmonic tempi (i.e., half, third, quarter), the scores of CMLt and AMLt can only be low. However, due to its high precision, HMMT0 still gets a high F1-score. It can also be observed that none of existing PPTs really learn to correctly determine the local tempo of expressive musical pieces, and different PPTs behave differently based on their assumptions. The ACR results explains the high recall of PLPDP and high precision of HMMs. As PLPDP relies on local periodicity calculated based on local windowed activation peaks, for ASAP dataset with large amount of non-beat activation peaks, PLPDP is prone to detect faster harmonic tempi, which results in high recall and low precision. On the other hand, HMMs tend to tap slower in both datasets, therefore achieve higher precision and lower recall.
[1] F. Foscarin, A. McLeod, P. Rigaux, F. Jacquemard, and M. Sakai,“ASAP: A dataset of aligned scores and performances for piano transcription,” in Proc. Int. Soc. Music Inf. Retr. Conf., 2020, pp. 53.
[2] https://github.com/fosfrancesco/asap-dataset
[3] C. Y. Chiu, M. Müller, M. E. P. Davies, A. W. Y. Su and Y. H. Yang, "An Analysis Method for Metric-Level Switching in Beat Tracking," in IEEE Signal Processing Letters, vol. 29, pp. 2153-2157, 2022.