Multimodal image and audio music transcription

Carlos de la Fuente; Jose J. Valero-Mas; Francisco J. Castellanos; Jorge Calvo-Zaragoza

doi:10.1007/s13735-021-00221-6

Multimodal image and audio music transcription

de la Fuente, Carlos; Valero-Mas, Jose J.; Castellanos, Francisco J.; Calvo-Zaragoza, Jorge 2022-03-01 00:00:00 Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) stand for the research ﬁelds that aim at obtaining a structured digital representation from sheet music images and acoustic recordings, respectively. While these ﬁelds have traditionally evolved independently, the fact that both tasks may share the same output representation poses the question of whether they could be combined in a synergistic manner to exploit the individual transcription advantages depicted by each modality. To evaluate this hypothesis, this paper presents a multimodal framework that combines the predictions from two neural end-to-end OMR and AMT systems by considering a local alignment approach. We assess several experimental scenarios with monophonic music pieces to evaluate our approach under different conditions of the individual transcription systems. In general, the multimodal framework clearly outperforms the single recognition modalities, attaining a relative improvement close to 40% in the best case. Our initial premise is, therefore, validated, thus opening avenues for further research in multimodal OMR-AMT transcription. Keywords Multimodal recognition · Automatic music transcription · Optical music recognition and deep learning 1 Introduction format [3]; on the other hand, when considering acoustic music signals, Automatic Music Transcription (AMT) rep- Bringing music sources into a structured digital represen- resents the ﬁeld devoted to the research on computational tation, typically known as transcription, remains as one methods for transcribing them into some form of structured of the key, yet challenging, tasks in the Music Informa- digital music notation [1]. It must be remarked that, despite tion Retrieval (MIR) ﬁeld [17,21]. Such digitization not pursuing the same goal, these two ﬁelds have been developed only improves music heritage preservation and dissemina- separately due to the different nature of the source data. tion [11], but it also enables the use of computer-based tools Multimodal recognition frameworks, understood as those which allow indexing, analysis, and retrieval, among many which take as input multiple representations or modalities of other tasks [20]. the same piece of data, have proved to generally achieve bet- In this context, two particular research lines stand out ter results than their respective single-modality systems [25]. within the MIR community: on the one hand, when tackling In such schemes, it is assumed that the different modalities music scores images, the ﬁeld of Optical Music Recogni- provide complementary information to the system, which tion (OMR) investigates how to computationally read these eventually results in an enhancement of the overall recogni- documents and store their music information in a symbolic tion performance. Such approaches are generally classiﬁed in one of these fashions [7]: (i) those in which the individual B Carlos de la Fuente features of the modalities are directly merged with the con- cdlf4@alu.ua.es strain of requiring the input elements to be synchronized to Jose J. Valero-Mas some extent (feature or early-fusion level); or those in which jjvalero@dlsi.ua.es the merging process is done with the hypotheses obtained by Francisco J. Castellanos each individual modality, thus not requiring both systems to fcastellanos@dlsi.ua.es be synchronized (decision or late-fusion level). Jorge Calvo-Zaragoza Regarding the MIR ﬁeld, this premise has also been jcalvo@dlsi.ua.es explored in particular cases as music recommendation, artist identiﬁcation or instrument classiﬁcation, among others [22]. Department of Software and Computing Systems, University of Alicante, Alicante, Spain 123 78 International Journal of Multimedia Information Retrieval (2022) 11:77–84 Music transcription is no strange and has also contemplated experimental set-up considered as well as results and dis- the use of multimodality as a means of solving certain glass cussion; ﬁnally, Sect. 5 concludes the work and poses future ceiling reached in single-modality approaches. For instance, research. research on AMT has considered the use of additional sources of information as, for instance, onset events, harmonic infor- mation, or timbre [2]. Nevertheless, to our best knowledge, 2 Related work no existing work has considered that a given score image and its acoustic performance may be considered two differ- While multimodal transcription approaches based on the ent modalities of the same piece to be transcribed. Under this combination of OMR and AMT have not been yet explored premise, transcription results may be enhanced if the individ- in the MIR ﬁeld, we may ﬁnd some research examples in the ual, and somehow complementary, descriptions by the OMR related areas of Text Recognition (TR) and Automatic Speech and AMT systems are adequately combined. Recognition (ASR). It must be noted that the multimodal While this idea might have been discussed in the past, we fusion in these cases is also carried out at the decision level, consider that classical formulations of both OMR and AMT keeping the commented advantage of not requiring multi- frameworks did not allow exploring a multimodal approach. modal training data for the underlying models. However, recent developments in these ﬁelds deﬁne both One of the ﬁrst examples in this regard is the proposal by tasks in terms of a sequence labeling problem [10], thus Singh et al. [23], in which TR and ASR where fused in the enabling research on the combined paradigm. Note that context of postal code recognition using a heuristic approach when addressing transcription tasks within this formulation, based on the Edit distance [14]. More recent approaches the input data (either image or audio) is directly decoded related to handwritten manuscripts have resorted to proba- into a sequence of music-notation symbols, having this bilistic frameworks for merging the individual hypotheses typically been carried out considering neural end-to-end sys- by the systems as those of using confusion networks [8]or tems [4,19]. the word-graph hypothesis spaces [9]. One could argue whether it may be practical, or even real- It is worth noting that this type of multimodality may be istic, having both the acoustic and image representations of also found in other ﬁelds as now the Gesture Recognition the piece to be transcribed. We assume, however, that for a (GR) one. For instance, the work by Pitsikalis et al. [16] music practitioner it would be, at least, more appealing to improves the recognition rate by re-scoring the different play a composition reading a music sheet rather than man- hypotheses of the GR model with information from an ASR ually transcribing it. Note that we ﬁnd the same scenario in system. Within this same context other works have explored the ﬁeld of Handwritten Text Recognition, where producing the alignment of different hypotheses using Dynamic Pro- a uttering out of a written text and using a speech recogni- gramming approaches [15] or, again, a confusion networks tion system for then fusing the decisions required less effort framework [13]. than manually transcribing the text or correcting the errors In this work, we tackle this multimodal music transcrip- produced by the text recognition system [8]. tion problem considering the alignment, at a sequence level, This work explores and studies whether the transcrip- of the individual hypotheses depicted by stand-alone end- tion results of a multimodal combination of sheet scores and to-end OMR and AMT systems. As it will be shown, when acoustic performances of music pieces improves those of the adequately conﬁgured, this approach is capable of success- stand-alone modalities. For that, we propose a decision-level fully improving the recognition rate of the single-modality fusion policy based on the combination of the most proba- transcription systems. ble symbol sequences depicted by two end-to-end OMR and AMT systems. The experiments have been performed with a corpus of monophonic music considering multiple scenar- ios which differ in the manner the individual transcription 3 Methodology systems are trained, hence allowing a thorough analysis of the proposal. The results obtained prove that the combined We consider two neural end-to-end transcription systems approach improves the transcription capabilities with respect as the base OMR and AMT methods for validating our to single-modality systems in cases in which their individual fusion proposal. As commented, the choice of these partic- performances do not remarkably differ. This fact validates ular approaches is that they allow a common formulation of our initial premise and poses new research questions to be the individual modalities, thus facilitating the deﬁnition of a addressed and explored. fusion policy. Note that, in this case, the combination policy The rest of the paper is structured as follows: Sect. 2 works at a decision, or sequence, level, as it can be observed contextualizes the work within the related literature; Sect. 3 in Fig. 1. To properly describe these design principles, we describes our multimodal framework; Sect. 4 presents the shall introduce some notation. 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 79 Fig. 1 Graphical description of the scheme proposed. For a given music piece, a score image x and an audio signal (as a CQT spectrogram) x are provided to the OMR and AMT systems, retrieving sequences a i z and z , respectively. The m m multimodal fusion policy eventually produces the sequence z |T | Let T = {(x , z ) : x ∈ X , z ∈ Z} represent a the actual output is a posteriogram with a number of frames m m m m m=1 given by the recurrent stage and | | activations each. Most set of data where sample x drawn from space X corresponds to symbol sequence z = z ,..., z from space Z commonly, the ﬁnal prediction is obtained out of this pos- m m1 mN teriogram using a greedy approach which retrieves the most considering the underlying function g : X → Z. Note that probable symbol per step and a posterior squash function the latter space is deﬁned as Z = where represents the score-level symbol vocabulary. which merges consecutive repeated symbols and removes the blank label. In our case, we slightly modify this decoding Since we are dealing with two sources of information, we i a have different representation spaces X and X with vocab- approach for allowing the multimodal fusion of both sources i a of information. ularies and related to the image scores and audio signals, respectively. While not strictly necessary, for sim- plicity we are constraining both systems to consider the same 3.2 Multimodal fusion policy i a vocabulary, i.e., = . Also note that, for a given m-th i i a a element, while staff x ∈ X and audio x ∈ X sig- m m The proposed policy takes as starting point the posteriograms nals depict a different origin, the target sequence z ∈ Z of the two recognition modalities, OMR and AMT. For each is deemed to be the same. posteriogram, a greedy decoding policy is applied to each of them for obtaining their most probable symbols per frame 3.1 Neural end-to-end base recognition systems together with their per-symbol probabilities. After that, the CTC squash function merges consecutive Concerning the recognition architectures, we consider a Con- symbols for each modality with the particularity of deriving volutional Recurrent Neural Network (CRNN) scheme to the per-symbol probability by averaging the individual prob- approximate g (·). Recent works have applied this approach ability values of the merged symbols. For example, when any to both OMR [5,6] and AMT [18,19] transcription systems of the models obtains a sequence in which the same symbol with remarkably successful results. Hence, we shall resort to is predicted for 4 consecutive frames, the algorithm com- these works to deﬁne our baseline single-modality transcrip- bines them and computes the average probabilities of these tion architectures within the multimodal framework. involved frames. After that, the blank symbols estimated by i a More in depth, a CRNN architecture is formed by an initial CTC are also removed, retrieving predictions z and z , m m block of convolutional layers devised to learn the adequate which correspond to the image and audio recognition mod- features for the task at issue followed by another group of els, respectively. i a recurrent layers that model their temporal dependencies. To Since sequences z and z may not match in terms of m m achieve an end-to-end system with such architecture, CRNN length, it is necessary to align both estimations for merging models are trained using the Connectionist Temporal Clas- them. Hence, we consider the Smith-Waterman (SW) local siﬁcation (CTC) algorithm [10]. In a practical sense, this alignment algorithm [24], which performs a search for the algorithm only requires the different input signals and their most similar regions between pairs of sequences. associated transcripts as sequences of symbols, without any Eventually, the ﬁnal estimation z is obtained from these speciﬁc input-output alignment at a ﬁner level. Note that CTC two aligned sequences following these premises: (i) if both requires the inclusion of an additional “blank” symbol within sequences match on a token, it is included in the resulting the vocabulary, i.e., = ∪ {blank} due to its training estimation; (ii) if the sequences disagree on a token, the one procedure. with the highest probability is included in the estimation; Since CTC assumes that the architecture contains a fully- (iii) if one of the sequences misses a symbol, that of the connected layer of | | outputs with a softmax activation, other sequence is included in the estimation. 123 80 International Journal of Multimedia Information Retrieval (2022) 11:77–84 4 Experiments Regarding the particular type of data used by each recog- nition model, the OMR system takes as input the artiﬁcially Having deﬁned the individual recognition systems as well distorted staff image of the incipit scaled to a height of 64 as the multimodal fusion proposal, this section presents the pixels, while maintaining the aspect ratio. Concerning the experimental part of the work. For that, we introduce the AMT model, an audio ﬁle is synthesized from the MIDI ﬁle CRNN schemes considered for OMR and AMT, we describe for each incipit with the FluidSynth software and a piano the corpus and metrics for the evaluation, and ﬁnally we timbre, considering a sampling rate of 22,050 Hz; then a present and discuss the results obtained. As previously stated, time-frequency representation is obtained by means of the the combination of OMR and AMT has not been previously Constant-Q Transform with a hop length of 512 samples, addressed in the MIR ﬁeld. Hence, the experimental section 120 bins, and 24 bins per octave. This result is embedded as of the work focuses on comparing the performance of the an image whose height is scaled to 256 pixels, maintaining multimodal approach against that of the individual transcrip- the aspect ratio. tion models, given that no other results can be reported from An initial data curation process was applied to the cor- the literature. pus for discarding samples which may cause a conﬂict in the combination, resulting in 67,000 incipits. Since this reduced set still contains a considerably large amount of elements, we 4.1 CRNN models randomly selected approximately a third of this curated set for our experiments to take a considerable amount of mem- The different CRNN topologies considered for both the OMR ory and time, resulting in 22,285 incipits with a label space and the AMT systems are described in Table 1. These conﬁg- i a of | |=| |= 1, 180 tokens. Eventually, we derive three urations are based on those used by recent works addressing partitions—train, validation, and test—which correspond to the individual OMR and AMT tasks as a sequence labeling the 60%, 20%, and 20% of the latter amount of data, respec- problem with deep neural networks [4,19]. It is important tively. to highlight that these architectures can be considered as the With regard to the performance evaluation, we considered state of the art in the aforementioned transcription tasks, thus the Symbol Error Rate (SER) as in other neural end-to-end being good representatives of the attainable performance in transcription systems [4,19]. This measure is deﬁned as: each of the baseline cases. Note that, as aforementioned, the last recurrent layer of the schemes is connected to a dense |S| i a unit with | |+ 1 =| |+ 1 output neurons and a softmax ED z , z m=1 m SER (%) = (1) |S| activation. |z | m=1 These architectures were trained using the backpropaga- tion method driven by CTC for 115 epochs using the ADAM where ED (·, ·) stands for the string Edit distance, S a set of optimizer [12]. Batch size was ﬁxed to 16 for the OMR sys- test data, and z and z the target and estimated sequences, tem, while for the AMT it was set 1 because of being more respectively. memory-intensive. 4.3 Results 4.2 Materials In preliminary experimentation, when training both the OMR For the evaluation of our approach, we considered the and AMT systems with the same amount of data, the former Camera-based Printed Images of Music Staves (Camera- one depicted a remarkably better performance. This fact hin- PrIMuS) database [4]. This corpus contains 87, 678 real dered the possible improvement of the multimodal proposal music staves of monophonic incipits extracted from the as the AMT recognition model rarely corrected any ﬂaw of Répertoire International des Sources Musicales (RISM). For the (almost perfect) OMR one. Thus, we propose four con- each incipit, different representations are provided: an image trolled scenarios with the goal of thoroughly analyzing the with the rendered score (both plain and with artiﬁcial distor- multimodal transcription proposal. tions), several encoding formats for the symbol information, For the sake of compactness, all the results are depicted and a MIDI ﬁle of the content. Although this dataset does not in Table 2 while the following sections provide an individ- represent the hardest challenge for OMR or AMT, it provides ual analysis for each case. A last additional section further both audio and images of the same pieces while allowing an explores the results to analyze the error typology by each artiﬁcial control of the performances for studying different scenarios. https://www.ﬂuidsynth.org/. This is the case of samples containing long multi-rests, which barely Short sequence of notes, typically the ﬁrst measures of the piece, used extend the length of the score image but take many frames in the audio for indexing and identifying a melody or musical work. signal. 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 81 Table 1 CRNN conﬁgurations considered Model Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 OMR Conv(64, 5 × 5) Conv(64, 5 × 5) Conv(128, 3 × 3) Conv(128, 3 × 3) BatchNorm BatchNorm BatchNorm BatchNorm BLSTM(256) BLSTM(256) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) Dropout(0.50) Dropout(0.50) MaxPool(2 × 2) MaxPool(1 × 2) MaxPool(1 × 2) MaxPool(1 × 2) AMT Conv(8, 2 × 10) Conv(8, 5 × 8) BatchNorm BatchNorm BLSTM(256) BLSTM(256) LeakyReLU(0.20) LeakyReLU(0.20) Dropout(0.50) Dropout(0.50) MaxPool(1 × 2) MaxPool(1 × 2) Notation: Conv( f ,w × h) stands for a convolution layer of f ﬁlters of size w × h pixels, BatchNorm performs the normalization of the batch, LeakyReLU(α) represents a leaky rectiﬁed linear unit activation with negative slope value of α, MaxPool2D(w × h ) stands for the max-pooling p p operator of dimensions w × h pixels, BLSTM(n) denotes a bidirectional long short-term memory unit with n neurons, and Dropout(d) performs p p the dropout operation with d probability Table 2 Symbol error rate (%) results for the OMR, AMT, and fusion mately, a 3% of the initial partition considered, remaining policy for the scenarios considered AMT unaltered. With this particular conﬁguration the starting point is that Scenario OMR (%) AMT (%) Fusion (%) OMR improves the error rate of AMT in, approximately, a A26.09 27.53 18.56 9%. While such difference may, in principle, suggest that no B18.57 27.53 15.14 improvement would be expected, it is eventually observed C10.82 11.64 6.64 that the fusion decreases the error rate to 15.14%, which D2.38 27.53 5.70 supposes a relative improvement of almost 19% with respect to the OMR system. This experiment shows that, even in cases where a modal- ity depicts a better performance than the other one, there is transcription method as well as the incorrect hypotheses the still a margin for improvement. fusion policy is able to correct. 4.3.1 Scenario A: SER SER SER ∼ ∼ ∼ SER SER SER OMR OMR OMR AMT AMT AMT This ﬁrst scenario poses the case in which the OMR and 4.3.3 Scenario C: SER SER SER ∼ ∼ ∼ SER SER SER OMR OMR OMR AMT AMT AMT AMT systems depict a similar performance. For obtaining such situation, we reduced the training data of the OMR to, The third posed scenario considers the case in which both approximately, a 2% of the initial partition considered while transcription systems also achieve similar recognition rates that of the AMT system remained unaltered. Under these con- but with a remarkably better performance than those shown ditions, the individual OMR and AMT frameworks achieve in Scenario A. To artiﬁcially increase the performance of the error rates of 26.09% and 27.53%, respectively. AMT process, we removed the music incipits from the test As it may be checked, the proposed fusion policy reduces set whose error was superior to 30% according to this model. the error rate to a ﬁgure of 18.56%, which supposes a rela- After the process, the number of elements in this test partition tive error decrease of approximately 28.86% with respect to is reduced to a 60% of the initial size while the others remain that of the OMR system. This fact suggests that the fusion as in Scenario B. policy somehow exhibits a synergistic behavior in which the In this case, the error rates depicted by the individual resulting sequence takes the most accurate estimations of the systems range between 10% and 11%, which already rep- OMR and AMT transcription methods. resent competitive transcription ﬁgures, at least in this type of architectures. However, when combining both modalities, 4.3.2 Scenario B: SER SER SER < < < SER SER SER the error rate decreases to 6.64%, which represents a relative OMR AMT OMR OMR AMT AMT improvement of, roughly, a 40%. The second scenario shows the case in which the individual This particular experiment proves that, even in cases performance of one of the transcription systems is consider- where both stand-alone transcription methods report compet- ably superior than that of the other one. For that, we reduced itive performances, the multimodal framework may report a the training data devoted to the OMR system to, approxi- noticeable beneﬁt in the recognition process. 123 82 International Journal of Multimedia Information Retrieval (2022) 11:77–84 Table 3 Example of the OMR AMT Fusion Ground truth multimodal fusion on a music incipit Clef-G2 Clef-C1 Clef-G2 Clef-G2 KeySignature-FM – KeySignature-FM KeySignature-FM TimeSignature-C TimeSignature-C TimeSignature-C TimeSignature-C Rest-half Rest-half Rest-half Rest-half Note-A4_eighth Note-A4_eighth Note-A4_eighth Note-A4_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-C5_sixteenth Note-C#5_sixteenth Note-C#5_sixteenth Note-C#5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Barline Barline Barline Barline Note-F5_eighth Note-F5_eighth Note-F5_eighth Note-F5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Rest-eighth Rest-eighth Rest-eighth Rest-eighth Note-C5_eighth Note-C#5_eighth Note-C#5_eighth Note-C#5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth The OMR and AMT columns depict the estimated sequences by the stand-alone systems while the Fusion one shows the combined estimation. The ground-truth transcription is also provided. Disagreements between modalities are highlighted in bold 4.3.4 Scenario D: SER SER SER SER SER SER to actual music notes. We shall now examine how these con- OMR AMT OMR OMR AMT AMT ﬂicts are solved by the merging policy. In this last scenario, we pose the case where one of the sys- Focusing on the clef and key errors, note that the devised tems greatly outperforms the other one. For that, we have fusion policy estimates the correct labels to be the ones by considered the original data partitions introduced in Sect. 4.2 the OMR recognition system. Given that this disagreement is for both OMR and AMT transcription systems. solved, on a broad sense, by taking the token with a superior In this particular case, it may be observed that the OMR probability among the different modalities, it is possible to model achieves an individual SER of 2.38%, while the AMT afﬁrm that the OMR performs better on this particular infor- one remains at 27.53%. As expected, when fusing the two mation than the AMT system. This conclusion is no strange sources of information, the error increases to 5.70%, which since these two data (clef and key) are explicitly drawn in the supposes a remarkable performance decrease compared to score image while, for the case of audio data, this information the system achieving the best results, i.e., the OMR one. must be inferred. Not surprisingly, when one of the modalities has a very Furthermore, the errors present in the notes of the piece limited room for improvement, these results show that the are better estimated by the AMT system rather than the OMR multimodal framework is not expected to bring any beneﬁt. one. Again, this behavior is very intuitive since, while the note information is explicitly present in the audio data, in a score some information is elided due to the graphical rep- 4.3.5 Multimodal fusion example resentation rules. As an example, if the music piece depicts pitch alterations (sharp and/or ﬂat notes), this information The previously posed scenarios show the performance of the is explicitly engraved in the key signature of the piece and multimodal music transcription framework proposed, on a not represented with the notes to be recognized; oppositely, macroscopic level. Hence, we shall now analyze in detail the acoustic data directly contains the note with its possible alter- actual behavior of the method. For that Table 3 shows an ation in the audio stream. example of the results obtained for a given incipit with the Finally, it must be remarked that the relative improvement OMR and AMT systems, as well with the multimodal fusion in terms of error rate of almost a 40% achieved in Scenerio proposed. The reference transcription is also provided. C supports the initial hypothesis that the multimodal com- A ﬁrst point which can be observed is that, for this par- bination of OMR and AMT technologies may enhance that ticular case, there is a strong agreement between the OMR of stand-alone systems, at least in some particular scenarios and AMT modalities, being only four cases in which the two where there is margin for improvement. This facts endorses sequences estimate different labels: one related to the clef, another one for the key signature, and the remaining related 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 83 the idea of further studying this new multimodal image and the Spanish “Ministerio de Ciencia e Innovación” through project Mul- tiScore (PID2020-118447RA-I00). The ﬁrst author acknowledges the audio paradigm for music transcription tasks. support from the Spanish “Ministerio de Educación y Formación Pro- fesional” through grant 20CO1/000966. The second and third authors acknowledge support from the “Programa I+D+i de la Generalitat Valenciana” through grants ACIF/2019/042 and APOSTD/2020/256, 5 Conclusions respectively. Music transcription, understood as obtaining a structured dig- Data availability Data are available from the authors upon request. ital representation of the content of a given music source, is deemed as a key challenge in the Music Information Retrieval Declarations (MIR) ﬁeld for its applicability in a wide range of tasks including music heritage preservation, dissemination, and Conﬂict of interest The authors declare that they have no conﬂict of analysis, among others. interest. Within this MIR ﬁeld, depending on the nature of the data Ethical approval This paper contains no cases of studies with human at issue, transcription is approached from either the Opti- participants performed by any of the authors. cal Music Recognition (OMR) perspective if dealing with image scores or the so-called Automatic Music Transcription Code availability Not applicable. (AMT) when tackling acoustic recordings. While these ﬁelds Consent to participate Not applicable. have historically evolved separately, the fact that both tasks may represent their expected outputs in the same way allows Consent for publication Not applicable. developing a synergistic framework with which achieving a more accurate transcription. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- This work presents a ﬁrst proposal that combines the pre- tation, distribution and reproduction in any medium or format, as dictions depicted by a couple of neural end-to-end OMR long as you give appropriate credit to the original author(s) and the and AMT systems considering a local alignment approach source, provide a link to the Creative Commons licence, and indi- over different scenarios dealing with monophonic music data. cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, The results obtained validate our initial hypothesis that the unless indicated otherwise in a credit line to the material. If material multimodal combination of these two sources of informa- is not included in the article’s Creative Commons licence and your tion is capable of retrieving an improved transcription result. intended use is not permitted by statutory regulation or exceeds the While the actual improvement depends on the scenario con- permitted use, you will need to obtain permission directly from the copy- right holder. To view a copy of this licence, visit http://creativecomm sidered, our results attain up to around 40% of relative error ons.org/licenses/by/4.0/. improvement with respect to the single-modality transcrip- tion systems. It must be also pointed out that, out of the different scenarios posed, the only case in which the multi- modal fusion proposed does not imply any beneﬁt is when one of the modalities remarkably outperforms the other one and reaches an almost perfect performance. References In light of these results, different research avenues may be explored to further improve the results obtained. The ﬁrst 1. Benetos E, Dixon S, Duan Z, Ewert S (2018) Automatic music one is the actual combination of the hypotheses depicted by transcription: an overview. IEEE Signal Process Mag 36(1):20–30 2. Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A (2013) the individual systems on a probabilistic framework, such Automatic music transcription: challenges and future directions. J as that of word graphs or confusion networks. In addition, Intell Inf Syst 41(3):407–434 while these proposals work on a prediction-level combina- 3. Calvo-Zaragoza J, Hajicˇ J Jr, Pacha A (2020) Understanding optical tion, it may be also explored the case in which this fusion is music recognition. ACM Comput Surv (CSUR) 53(4):1–35 4. Calvo-Zaragoza J, Rizo D (2018) Camera-PrIMuS: neural end-to- done in previous stages of the pipeline as, for instance, the end optical music recognition on realistic monophonic scores. In: feature extraction one. Finally, experimentation may be also Proceedings of the 19th international society for music information extended to more challenging data as handwritten scores, retrieval conference, pp. 248–255. Paris, France different instrumentation, or polyphonic music. 5. Calvo-Zaragoza J, Toselli AH, Vidal E (2017) Handwritten music recognition for mensural notation: formulation, data and baseline Author Contributions C.F., J.J.V.-M., F.J.C. and J.C.-Z. made equally results. In: 14th IAPR International conference on document anal- contributions as regards the conception of the work, the experimental ysis and recognition, vol. 1, pp. 1081–1086 work, the data analysis, and writing the paper. 6. Calvo-Zaragoza J, Valero-Mas JJ, Pertusa A (2017) End-to-end optical music recognition using neural networks. In: Proceedings Funding Open Access funding provided thanks to the CRUE-CSIC of the 18th international society for music information retrieval agreement with Springer Nature. This research was partially funded by conference, pp. 472–477. Suzhou, China 123 84 International Journal of Multimedia Information Retrieval (2022) 11:77–84 7. Dumas B, Signer B, Lalanne D (2012) Fusion in multimodal 17. Rebelo A, Fujinaga I, Paszkiewicz F, Marcal AR, Guedes C, Car- interactive systems: an hmm-based algorithm for user-induced doso JS (2012) Optical music recognition: state-of-the-art and open adaptation. In: Proceedings of the 4th ACM SIGCHI symposium issues. Int J Multimed Inf Retr 1(3):173–190 on Engineering interactive computing systems, pp. 15–24 18. Román MA, Pertusa A, Calvo-Zaragoza J (2020) Data representa- 8. Granell E, Martínez-Hinarejos CD (2015) Multimodal output com- tions for audio-to-score monophonic music transcription. Exp Syst bination for transcribing historical handwritten documents. In: Appl 162:113769 International conference on computer analysis of images and pat- 19. Román M, Pertusa A, Calvo-Zaragoza J (2019) A holistic approach terns, pp. 246–260. Springer to polyphonic music transcription with neural networks. In: Pro- 9. Granell E, Martínez-Hinarejos CD, Romero V (2018) Improving ceedings of the 20th international society for music information transcription of manuscripts with multimodality and interaction. retrieval conference, pp. 731–737. Delft, The Netherlands In: Proceedings of IberSPEECH, pp. 92–96 20. Schedl M, Gómez E, Urbano J (2014) Music information retrieval: 10. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Con- recent developments and applications. Found Trends Inf Retr nectionist temporal classiﬁcation: labelling unsegmented sequence 8:127–261. https://doi.org/10.1561/1500000042 data with recurrent neural networks. In: Proceedings of the 23rd 21. Serra X, Magas M, Benetos E, Chudy M, Dixon S, Flexer A, Gómez international conference on machine learning, pp. 369–376. New E, Gouyon F, Herrera P, Jordà S, et al (2013) Roadmap for music York, USA information research. The MIReS Consortium. Creative Commons 11. Iñesta JM, Ponce de León PJ, Rizo D, Oncina J, Micó L, Rico- BY-NC-ND 3.0 license Juan JR, Pérez-Sancho C, Pertusa A (2018) Hispamus: Handwritten 22. Simonetta F, Ntalampiras S, Avanzini F (2019) Multimodal music spanish music heritage preservation by automatic transcription. In: information processing and retrieval: survey and future challenges. 1st International workshop on reading music systems, pp. 17–18 In: International workshop on multilayer music representation and 12. Kingma DP, Ba J (2015) Adam: A method for stochastic optimiza- processing, pp. 10–18 tion. In: 3rd International conference on learning representations. 23. Singh A, Sangwan A, Hansen JHL (2012) Improved parcel sort- San Diego, USA ing by combining automatic speech and character recognition. In: 13. Kristensson PO, Vertanen K (2011) Asynchronous multimodal text 2012 IEEE International conference on emerging signal process- entry using speech and gesture keyboards. In: Twelfth annual con- ing applications, pp. 52–55. https://doi.org/10.1109/ESPA.2012. ference of the international speech communication association 6152444 14. Levenshtein VI (1966) Binary codes capable of correcting dele- 24. Smith TF, Waterman MS (1981) Identiﬁcation of common molec- tions, insertions, and reversals. Sov Phys Dokl 10(8):707–710 ular subsequences. J Mol Biol 147(1):195–197 15. Miki M, Kitaoka N, Miyajima C, Nishino T, Takeda K (2014) 25. Toselli AH, Vidal E, Casacuberta F (2011) Multimodal interactive Improvement of multimodal gesture and speech recognition per- pattern recognition and applications. Springer Science & Business formance using time intervals between gestures and accompanying Media, Berlin speech. EURASIP J Audio, Speech, Music Process 2014(1):1–7 16. Pitsikalis V, Katsamanis A, Theodorakis S, Maragos P (2017) Multimodal gesture recognition via multiple hypotheses rescor- Publisher’s Note Springer Nature remains neutral with regard to juris- ing. In: Escalera S, Guyon I, Athitsos V (eds) Gesture recognition. dictional claims in published maps and institutional afﬁliations. Springer, Cham, pp 467-496 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Multimedia Information Retrieval Springer Journals http://www.deepdyve.com/lp/springer-journals/multimodal-image-and-audio-music-transcription-eL7umfBtA1

Loading next page...

References (31)

Jorge Calvo-Zaragoza, D. Rizo (2018)
Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores
Emmanouil Benetos, S. Dixon, D. Giannoulis, Holger Kirchhoff, Anssi Klapuri (2013)
Automatic music transcription: challenges and future directions
Journal of Intelligent Information Systems, 41
M. Schedl, E. Gómez, Julián Urbano (2014)
Music Information Retrieval: Recent Developments and Applications
Found. Trends Inf. Retr., 8
(2012)
2012)Opticalmusic recognition: state-of-the-art and open issues
Madoka Miki, N. Kitaoka, C. Miyajima, Takanori Nishino, K. Takeda (2014)
Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech
EURASIP Journal on Audio, Speech, and Music Processing, 2014
Jorge Calvo-Zaragoza, Jose Valero-Mas, A. Pertusa (2017)
End-to-End Optical Music Recognition Using Neural Networks
V. Levenshtein (1965)
Binary codes capable of correcting deletions, insertions, and reversals
Soviet physics. Doklady, 10
Bruno Dumas, B. Signer, D. Lalanne (2012)
Fusion in multimodal interactive systems: an HMM-based algorithm for user-induced adaptation
10.3390/app8040606
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic Optimization
CoRR, abs/1412.6980
J. Iñesta, P. León, D. Rizo, J. Oncina, L. Micó, J. Rico-Juan, C. Pérez-Sancho, A. Pertusa (2018)
HISPAMUS: Handwritten Spanish Music Heritage Preservation by Automatic Transcription
Emilio Granell, C. Martínez-Hinarejos (2015)
Multimodal Output Combination for Transcribing Historical Handwritten Documents
Xavier Serra, Michela Magas, Emmanouil Benetos, M. Chudy, S. Dixon, A. Flexer, E. Gómez, F. Gouyon, P. Herrera, S. Jordà, Oscar Paytuvi, G. Peeters, Jan Schlüter, H. Vinet, G. Widmer (2013)
Roadmap for Music Information ReSearch
A Rebelo (2012)
10.1007/s13735-012-0004-6
Int J Multimed Inf Retr, 1
PitsikalisVassilis, KatsamanisAthanasios, TheodorakisStavros, MaragosPetros (2015)
Multimodal gesture recognition via multiple hypotheses rescoring
Journal of Machine Learning Research
Federico Simonetta, S. Ntalampiras, F. Avanzini (2019)
Multimodal Music Information Processing and Retrieval: Survey and Future Challenges
2019 International Workshop on Multilayer Music Representation and Processing (MMRP)
(2011)
Asynchronousmultimodal text entry using speech and gesture keyboards. In: Twelfth annual conference of the international speech communication association
Emmanouil Benetos, S. Dixon, Z. Duan, Sebastian Ewert (2019)
Automatic Music Transcription: An Overview
IEEE Signal Processing Magazine, 36
Vassilis Pitsikalis, Athanasios Katsamanis, Stavros Theodorakis, P. Maragos (2015)
Multimodal gesture recognition via multiple hypotheses rescoring
Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
10.1145/1143844.1143891
Jorge Calvo-Zaragoza, Jan Hajic, Alexander Pacha (2019)
Understanding Optical Music Recognition
ACM Computing Surveys (CSUR), 53
Emilio Granell, C. Martínez-Hinarejos, Verónica Romero (2018)
Improving Transcription of Manuscripts with Multimodality and Interaction
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.
Journal of molecular biology, 147 1
Amriteshwar Singh, A. Sangwan, J. Hansen (2012)
Improved parcel sorting by combining automatic speech and character recognition
2012 IEEE International Conference on Emerging Signal Processing Applications
P. Kristensson, K. Vertanen (2011)
Asynchronous Multimodal Text Entry Using Speech and Gesture Keyboards
Jorge Calvo-Zaragoza, A. Toselli, E. Vidal (2017)
Handwritten Music Recognition for Mensural Notation: Formulation, Data and Baseline Results
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 01
A. Toselli, E. Vidal, F. Casacuberta (2011)
Multimodal Interactive Pattern Recognition and Applications
Miguel Román, A. Pertusa, Jorge Calvo-Zaragoza (2019)
A Holistic Approach to Polyphonic Music Transcription with Neural Networks
ArXiv, abs/1910.12086
V Pitsikalis (2017)
10.1007/978-3-319-57021-1_16
M. Román, A. Pertusa, Jorge Calvo-Zaragoza (2020)
Data representations for audio-to-score monophonic music transcription
Expert Syst. Appl., 162

Publisher: Springer Journals
Copyright: Copyright © The Author(s) 2021
ISSN: 2192-6611
eISSN: 2192-662X
DOI: 10.1007/s13735-021-00221-6
Publisher site: See Article on Publisher Site

Abstract

Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) stand for the research ﬁelds that aim at obtaining a structured digital representation from sheet music images and acoustic recordings, respectively. While these ﬁelds have traditionally evolved independently, the fact that both tasks may share the same output representation poses the question of whether they could be combined in a synergistic manner to exploit the individual transcription advantages depicted by each modality. To evaluate this hypothesis, this paper presents a multimodal framework that combines the predictions from two neural end-to-end OMR and AMT systems by considering a local alignment approach. We assess several experimental scenarios with monophonic music pieces to evaluate our approach under different conditions of the individual transcription systems. In general, the multimodal framework clearly outperforms the single recognition modalities, attaining a relative improvement close to 40% in the best case. Our initial premise is, therefore, validated, thus opening avenues for further research in multimodal OMR-AMT transcription. Keywords Multimodal recognition · Automatic music transcription · Optical music recognition and deep learning 1 Introduction format [3]; on the other hand, when considering acoustic music signals, Automatic Music Transcription (AMT) rep- Bringing music sources into a structured digital represen- resents the ﬁeld devoted to the research on computational tation, typically known as transcription, remains as one methods for transcribing them into some form of structured of the key, yet challenging, tasks in the Music Informa- digital music notation [1]. It must be remarked that, despite tion Retrieval (MIR) ﬁeld [17,21]. Such digitization not pursuing the same goal, these two ﬁelds have been developed only improves music heritage preservation and dissemina- separately due to the different nature of the source data. tion [11], but it also enables the use of computer-based tools Multimodal recognition frameworks, understood as those which allow indexing, analysis, and retrieval, among many which take as input multiple representations or modalities of other tasks [20]. the same piece of data, have proved to generally achieve bet- In this context, two particular research lines stand out ter results than their respective single-modality systems [25]. within the MIR community: on the one hand, when tackling In such schemes, it is assumed that the different modalities music scores images, the ﬁeld of Optical Music Recogni- provide complementary information to the system, which tion (OMR) investigates how to computationally read these eventually results in an enhancement of the overall recogni- documents and store their music information in a symbolic tion performance. Such approaches are generally classiﬁed in one of these fashions [7]: (i) those in which the individual B Carlos de la Fuente features of the modalities are directly merged with the con- cdlf4@alu.ua.es strain of requiring the input elements to be synchronized to Jose J. Valero-Mas some extent (feature or early-fusion level); or those in which jjvalero@dlsi.ua.es the merging process is done with the hypotheses obtained by Francisco J. Castellanos each individual modality, thus not requiring both systems to fcastellanos@dlsi.ua.es be synchronized (decision or late-fusion level). Jorge Calvo-Zaragoza Regarding the MIR ﬁeld, this premise has also been jcalvo@dlsi.ua.es explored in particular cases as music recommendation, artist identiﬁcation or instrument classiﬁcation, among others [22]. Department of Software and Computing Systems, University of Alicante, Alicante, Spain 123 78 International Journal of Multimedia Information Retrieval (2022) 11:77–84 Music transcription is no strange and has also contemplated experimental set-up considered as well as results and dis- the use of multimodality as a means of solving certain glass cussion; ﬁnally, Sect. 5 concludes the work and poses future ceiling reached in single-modality approaches. For instance, research. research on AMT has considered the use of additional sources of information as, for instance, onset events, harmonic infor- mation, or timbre [2]. Nevertheless, to our best knowledge, 2 Related work no existing work has considered that a given score image and its acoustic performance may be considered two differ- While multimodal transcription approaches based on the ent modalities of the same piece to be transcribed. Under this combination of OMR and AMT have not been yet explored premise, transcription results may be enhanced if the individ- in the MIR ﬁeld, we may ﬁnd some research examples in the ual, and somehow complementary, descriptions by the OMR related areas of Text Recognition (TR) and Automatic Speech and AMT systems are adequately combined. Recognition (ASR). It must be noted that the multimodal While this idea might have been discussed in the past, we fusion in these cases is also carried out at the decision level, consider that classical formulations of both OMR and AMT keeping the commented advantage of not requiring multi- frameworks did not allow exploring a multimodal approach. modal training data for the underlying models. However, recent developments in these ﬁelds deﬁne both One of the ﬁrst examples in this regard is the proposal by tasks in terms of a sequence labeling problem [10], thus Singh et al. [23], in which TR and ASR where fused in the enabling research on the combined paradigm. Note that context of postal code recognition using a heuristic approach when addressing transcription tasks within this formulation, based on the Edit distance [14]. More recent approaches the input data (either image or audio) is directly decoded related to handwritten manuscripts have resorted to proba- into a sequence of music-notation symbols, having this bilistic frameworks for merging the individual hypotheses typically been carried out considering neural end-to-end sys- by the systems as those of using confusion networks [8]or tems [4,19]. the word-graph hypothesis spaces [9]. One could argue whether it may be practical, or even real- It is worth noting that this type of multimodality may be istic, having both the acoustic and image representations of also found in other ﬁelds as now the Gesture Recognition the piece to be transcribed. We assume, however, that for a (GR) one. For instance, the work by Pitsikalis et al. [16] music practitioner it would be, at least, more appealing to improves the recognition rate by re-scoring the different play a composition reading a music sheet rather than man- hypotheses of the GR model with information from an ASR ually transcribing it. Note that we ﬁnd the same scenario in system. Within this same context other works have explored the ﬁeld of Handwritten Text Recognition, where producing the alignment of different hypotheses using Dynamic Pro- a uttering out of a written text and using a speech recogni- gramming approaches [15] or, again, a confusion networks tion system for then fusing the decisions required less effort framework [13]. than manually transcribing the text or correcting the errors In this work, we tackle this multimodal music transcrip- produced by the text recognition system [8]. tion problem considering the alignment, at a sequence level, This work explores and studies whether the transcrip- of the individual hypotheses depicted by stand-alone end- tion results of a multimodal combination of sheet scores and to-end OMR and AMT systems. As it will be shown, when acoustic performances of music pieces improves those of the adequately conﬁgured, this approach is capable of success- stand-alone modalities. For that, we propose a decision-level fully improving the recognition rate of the single-modality fusion policy based on the combination of the most proba- transcription systems. ble symbol sequences depicted by two end-to-end OMR and AMT systems. The experiments have been performed with a corpus of monophonic music considering multiple scenar- ios which differ in the manner the individual transcription 3 Methodology systems are trained, hence allowing a thorough analysis of the proposal. The results obtained prove that the combined We consider two neural end-to-end transcription systems approach improves the transcription capabilities with respect as the base OMR and AMT methods for validating our to single-modality systems in cases in which their individual fusion proposal. As commented, the choice of these partic- performances do not remarkably differ. This fact validates ular approaches is that they allow a common formulation of our initial premise and poses new research questions to be the individual modalities, thus facilitating the deﬁnition of a addressed and explored. fusion policy. Note that, in this case, the combination policy The rest of the paper is structured as follows: Sect. 2 works at a decision, or sequence, level, as it can be observed contextualizes the work within the related literature; Sect. 3 in Fig. 1. To properly describe these design principles, we describes our multimodal framework; Sect. 4 presents the shall introduce some notation. 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 79 Fig. 1 Graphical description of the scheme proposed. For a given music piece, a score image x and an audio signal (as a CQT spectrogram) x are provided to the OMR and AMT systems, retrieving sequences a i z and z , respectively. The m m multimodal fusion policy eventually produces the sequence z |T | Let T = {(x , z ) : x ∈ X , z ∈ Z} represent a the actual output is a posteriogram with a number of frames m m m m m=1 given by the recurrent stage and | | activations each. Most set of data where sample x drawn from space X corresponds to symbol sequence z = z ,..., z from space Z commonly, the ﬁnal prediction is obtained out of this pos- m m1 mN teriogram using a greedy approach which retrieves the most considering the underlying function g : X → Z. Note that probable symbol per step and a posterior squash function the latter space is deﬁned as Z = where represents the score-level symbol vocabulary. which merges consecutive repeated symbols and removes the blank label. In our case, we slightly modify this decoding Since we are dealing with two sources of information, we i a have different representation spaces X and X with vocab- approach for allowing the multimodal fusion of both sources i a of information. ularies and related to the image scores and audio signals, respectively. While not strictly necessary, for sim- plicity we are constraining both systems to consider the same 3.2 Multimodal fusion policy i a vocabulary, i.e., = . Also note that, for a given m-th i i a a element, while staff x ∈ X and audio x ∈ X sig- m m The proposed policy takes as starting point the posteriograms nals depict a different origin, the target sequence z ∈ Z of the two recognition modalities, OMR and AMT. For each is deemed to be the same. posteriogram, a greedy decoding policy is applied to each of them for obtaining their most probable symbols per frame 3.1 Neural end-to-end base recognition systems together with their per-symbol probabilities. After that, the CTC squash function merges consecutive Concerning the recognition architectures, we consider a Con- symbols for each modality with the particularity of deriving volutional Recurrent Neural Network (CRNN) scheme to the per-symbol probability by averaging the individual prob- approximate g (·). Recent works have applied this approach ability values of the merged symbols. For example, when any to both OMR [5,6] and AMT [18,19] transcription systems of the models obtains a sequence in which the same symbol with remarkably successful results. Hence, we shall resort to is predicted for 4 consecutive frames, the algorithm com- these works to deﬁne our baseline single-modality transcrip- bines them and computes the average probabilities of these tion architectures within the multimodal framework. involved frames. After that, the blank symbols estimated by i a More in depth, a CRNN architecture is formed by an initial CTC are also removed, retrieving predictions z and z , m m block of convolutional layers devised to learn the adequate which correspond to the image and audio recognition mod- features for the task at issue followed by another group of els, respectively. i a recurrent layers that model their temporal dependencies. To Since sequences z and z may not match in terms of m m achieve an end-to-end system with such architecture, CRNN length, it is necessary to align both estimations for merging models are trained using the Connectionist Temporal Clas- them. Hence, we consider the Smith-Waterman (SW) local siﬁcation (CTC) algorithm [10]. In a practical sense, this alignment algorithm [24], which performs a search for the algorithm only requires the different input signals and their most similar regions between pairs of sequences. associated transcripts as sequences of symbols, without any Eventually, the ﬁnal estimation z is obtained from these speciﬁc input-output alignment at a ﬁner level. Note that CTC two aligned sequences following these premises: (i) if both requires the inclusion of an additional “blank” symbol within sequences match on a token, it is included in the resulting the vocabulary, i.e., = ∪ {blank} due to its training estimation; (ii) if the sequences disagree on a token, the one procedure. with the highest probability is included in the estimation; Since CTC assumes that the architecture contains a fully- (iii) if one of the sequences misses a symbol, that of the connected layer of | | outputs with a softmax activation, other sequence is included in the estimation. 123 80 International Journal of Multimedia Information Retrieval (2022) 11:77–84 4 Experiments Regarding the particular type of data used by each recog- nition model, the OMR system takes as input the artiﬁcially Having deﬁned the individual recognition systems as well distorted staff image of the incipit scaled to a height of 64 as the multimodal fusion proposal, this section presents the pixels, while maintaining the aspect ratio. Concerning the experimental part of the work. For that, we introduce the AMT model, an audio ﬁle is synthesized from the MIDI ﬁle CRNN schemes considered for OMR and AMT, we describe for each incipit with the FluidSynth software and a piano the corpus and metrics for the evaluation, and ﬁnally we timbre, considering a sampling rate of 22,050 Hz; then a present and discuss the results obtained. As previously stated, time-frequency representation is obtained by means of the the combination of OMR and AMT has not been previously Constant-Q Transform with a hop length of 512 samples, addressed in the MIR ﬁeld. Hence, the experimental section 120 bins, and 24 bins per octave. This result is embedded as of the work focuses on comparing the performance of the an image whose height is scaled to 256 pixels, maintaining multimodal approach against that of the individual transcrip- the aspect ratio. tion models, given that no other results can be reported from An initial data curation process was applied to the cor- the literature. pus for discarding samples which may cause a conﬂict in the combination, resulting in 67,000 incipits. Since this reduced set still contains a considerably large amount of elements, we 4.1 CRNN models randomly selected approximately a third of this curated set for our experiments to take a considerable amount of mem- The different CRNN topologies considered for both the OMR ory and time, resulting in 22,285 incipits with a label space and the AMT systems are described in Table 1. These conﬁg- i a of | |=| |= 1, 180 tokens. Eventually, we derive three urations are based on those used by recent works addressing partitions—train, validation, and test—which correspond to the individual OMR and AMT tasks as a sequence labeling the 60%, 20%, and 20% of the latter amount of data, respec- problem with deep neural networks [4,19]. It is important tively. to highlight that these architectures can be considered as the With regard to the performance evaluation, we considered state of the art in the aforementioned transcription tasks, thus the Symbol Error Rate (SER) as in other neural end-to-end being good representatives of the attainable performance in transcription systems [4,19]. This measure is deﬁned as: each of the baseline cases. Note that, as aforementioned, the last recurrent layer of the schemes is connected to a dense |S| i a unit with | |+ 1 =| |+ 1 output neurons and a softmax ED z , z m=1 m SER (%) = (1) |S| activation. |z | m=1 These architectures were trained using the backpropaga- tion method driven by CTC for 115 epochs using the ADAM where ED (·, ·) stands for the string Edit distance, S a set of optimizer [12]. Batch size was ﬁxed to 16 for the OMR sys- test data, and z and z the target and estimated sequences, tem, while for the AMT it was set 1 because of being more respectively. memory-intensive. 4.3 Results 4.2 Materials In preliminary experimentation, when training both the OMR For the evaluation of our approach, we considered the and AMT systems with the same amount of data, the former Camera-based Printed Images of Music Staves (Camera- one depicted a remarkably better performance. This fact hin- PrIMuS) database [4]. This corpus contains 87, 678 real dered the possible improvement of the multimodal proposal music staves of monophonic incipits extracted from the as the AMT recognition model rarely corrected any ﬂaw of Répertoire International des Sources Musicales (RISM). For the (almost perfect) OMR one. Thus, we propose four con- each incipit, different representations are provided: an image trolled scenarios with the goal of thoroughly analyzing the with the rendered score (both plain and with artiﬁcial distor- multimodal transcription proposal. tions), several encoding formats for the symbol information, For the sake of compactness, all the results are depicted and a MIDI ﬁle of the content. Although this dataset does not in Table 2 while the following sections provide an individ- represent the hardest challenge for OMR or AMT, it provides ual analysis for each case. A last additional section further both audio and images of the same pieces while allowing an explores the results to analyze the error typology by each artiﬁcial control of the performances for studying different scenarios. https://www.ﬂuidsynth.org/. This is the case of samples containing long multi-rests, which barely Short sequence of notes, typically the ﬁrst measures of the piece, used extend the length of the score image but take many frames in the audio for indexing and identifying a melody or musical work. signal. 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 81 Table 1 CRNN conﬁgurations considered Model Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 OMR Conv(64, 5 × 5) Conv(64, 5 × 5) Conv(128, 3 × 3) Conv(128, 3 × 3) BatchNorm BatchNorm BatchNorm BatchNorm BLSTM(256) BLSTM(256) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) Dropout(0.50) Dropout(0.50) MaxPool(2 × 2) MaxPool(1 × 2) MaxPool(1 × 2) MaxPool(1 × 2) AMT Conv(8, 2 × 10) Conv(8, 5 × 8) BatchNorm BatchNorm BLSTM(256) BLSTM(256) LeakyReLU(0.20) LeakyReLU(0.20) Dropout(0.50) Dropout(0.50) MaxPool(1 × 2) MaxPool(1 × 2) Notation: Conv( f ,w × h) stands for a convolution layer of f ﬁlters of size w × h pixels, BatchNorm performs the normalization of the batch, LeakyReLU(α) represents a leaky rectiﬁed linear unit activation with negative slope value of α, MaxPool2D(w × h ) stands for the max-pooling p p operator of dimensions w × h pixels, BLSTM(n) denotes a bidirectional long short-term memory unit with n neurons, and Dropout(d) performs p p the dropout operation with d probability Table 2 Symbol error rate (%) results for the OMR, AMT, and fusion mately, a 3% of the initial partition considered, remaining policy for the scenarios considered AMT unaltered. With this particular conﬁguration the starting point is that Scenario OMR (%) AMT (%) Fusion (%) OMR improves the error rate of AMT in, approximately, a A26.09 27.53 18.56 9%. While such difference may, in principle, suggest that no B18.57 27.53 15.14 improvement would be expected, it is eventually observed C10.82 11.64 6.64 that the fusion decreases the error rate to 15.14%, which D2.38 27.53 5.70 supposes a relative improvement of almost 19% with respect to the OMR system. This experiment shows that, even in cases where a modal- ity depicts a better performance than the other one, there is transcription method as well as the incorrect hypotheses the still a margin for improvement. fusion policy is able to correct. 4.3.1 Scenario A: SER SER SER ∼ ∼ ∼ SER SER SER OMR OMR OMR AMT AMT AMT This ﬁrst scenario poses the case in which the OMR and 4.3.3 Scenario C: SER SER SER ∼ ∼ ∼ SER SER SER OMR OMR OMR AMT AMT AMT AMT systems depict a similar performance. For obtaining such situation, we reduced the training data of the OMR to, The third posed scenario considers the case in which both approximately, a 2% of the initial partition considered while transcription systems also achieve similar recognition rates that of the AMT system remained unaltered. Under these con- but with a remarkably better performance than those shown ditions, the individual OMR and AMT frameworks achieve in Scenario A. To artiﬁcially increase the performance of the error rates of 26.09% and 27.53%, respectively. AMT process, we removed the music incipits from the test As it may be checked, the proposed fusion policy reduces set whose error was superior to 30% according to this model. the error rate to a ﬁgure of 18.56%, which supposes a rela- After the process, the number of elements in this test partition tive error decrease of approximately 28.86% with respect to is reduced to a 60% of the initial size while the others remain that of the OMR system. This fact suggests that the fusion as in Scenario B. policy somehow exhibits a synergistic behavior in which the In this case, the error rates depicted by the individual resulting sequence takes the most accurate estimations of the systems range between 10% and 11%, which already rep- OMR and AMT transcription methods. resent competitive transcription ﬁgures, at least in this type of architectures. However, when combining both modalities, 4.3.2 Scenario B: SER SER SER < < < SER SER SER the error rate decreases to 6.64%, which represents a relative OMR AMT OMR OMR AMT AMT improvement of, roughly, a 40%. The second scenario shows the case in which the individual This particular experiment proves that, even in cases performance of one of the transcription systems is consider- where both stand-alone transcription methods report compet- ably superior than that of the other one. For that, we reduced itive performances, the multimodal framework may report a the training data devoted to the OMR system to, approxi- noticeable beneﬁt in the recognition process. 123 82 International Journal of Multimedia Information Retrieval (2022) 11:77–84 Table 3 Example of the OMR AMT Fusion Ground truth multimodal fusion on a music incipit Clef-G2 Clef-C1 Clef-G2 Clef-G2 KeySignature-FM – KeySignature-FM KeySignature-FM TimeSignature-C TimeSignature-C TimeSignature-C TimeSignature-C Rest-half Rest-half Rest-half Rest-half Note-A4_eighth Note-A4_eighth Note-A4_eighth Note-A4_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-C5_sixteenth Note-C#5_sixteenth Note-C#5_sixteenth Note-C#5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-D5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Note-E5_sixteenth Barline Barline Barline Barline Note-F5_eighth Note-F5_eighth Note-F5_eighth Note-F5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Rest-eighth Rest-eighth Rest-eighth Rest-eighth Note-C5_eighth Note-C#5_eighth Note-C#5_eighth Note-C#5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth Note-D5_eighth The OMR and AMT columns depict the estimated sequences by the stand-alone systems while the Fusion one shows the combined estimation. The ground-truth transcription is also provided. Disagreements between modalities are highlighted in bold 4.3.4 Scenario D: SER SER SER SER SER SER to actual music notes. We shall now examine how these con- OMR AMT OMR OMR AMT AMT ﬂicts are solved by the merging policy. In this last scenario, we pose the case where one of the sys- Focusing on the clef and key errors, note that the devised tems greatly outperforms the other one. For that, we have fusion policy estimates the correct labels to be the ones by considered the original data partitions introduced in Sect. 4.2 the OMR recognition system. Given that this disagreement is for both OMR and AMT transcription systems. solved, on a broad sense, by taking the token with a superior In this particular case, it may be observed that the OMR probability among the different modalities, it is possible to model achieves an individual SER of 2.38%, while the AMT afﬁrm that the OMR performs better on this particular infor- one remains at 27.53%. As expected, when fusing the two mation than the AMT system. This conclusion is no strange sources of information, the error increases to 5.70%, which since these two data (clef and key) are explicitly drawn in the supposes a remarkable performance decrease compared to score image while, for the case of audio data, this information the system achieving the best results, i.e., the OMR one. must be inferred. Not surprisingly, when one of the modalities has a very Furthermore, the errors present in the notes of the piece limited room for improvement, these results show that the are better estimated by the AMT system rather than the OMR multimodal framework is not expected to bring any beneﬁt. one. Again, this behavior is very intuitive since, while the note information is explicitly present in the audio data, in a score some information is elided due to the graphical rep- 4.3.5 Multimodal fusion example resentation rules. As an example, if the music piece depicts pitch alterations (sharp and/or ﬂat notes), this information The previously posed scenarios show the performance of the is explicitly engraved in the key signature of the piece and multimodal music transcription framework proposed, on a not represented with the notes to be recognized; oppositely, macroscopic level. Hence, we shall now analyze in detail the acoustic data directly contains the note with its possible alter- actual behavior of the method. For that Table 3 shows an ation in the audio stream. example of the results obtained for a given incipit with the Finally, it must be remarked that the relative improvement OMR and AMT systems, as well with the multimodal fusion in terms of error rate of almost a 40% achieved in Scenerio proposed. The reference transcription is also provided. C supports the initial hypothesis that the multimodal com- A ﬁrst point which can be observed is that, for this par- bination of OMR and AMT technologies may enhance that ticular case, there is a strong agreement between the OMR of stand-alone systems, at least in some particular scenarios and AMT modalities, being only four cases in which the two where there is margin for improvement. This facts endorses sequences estimate different labels: one related to the clef, another one for the key signature, and the remaining related 123 International Journal of Multimedia Information Retrieval (2022) 11:77–84 83 the idea of further studying this new multimodal image and the Spanish “Ministerio de Ciencia e Innovación” through project Mul- tiScore (PID2020-118447RA-I00). The ﬁrst author acknowledges the audio paradigm for music transcription tasks. support from the Spanish “Ministerio de Educación y Formación Pro- fesional” through grant 20CO1/000966. The second and third authors acknowledge support from the “Programa I+D+i de la Generalitat Valenciana” through grants ACIF/2019/042 and APOSTD/2020/256, 5 Conclusions respectively. Music transcription, understood as obtaining a structured dig- Data availability Data are available from the authors upon request. ital representation of the content of a given music source, is deemed as a key challenge in the Music Information Retrieval Declarations (MIR) ﬁeld for its applicability in a wide range of tasks including music heritage preservation, dissemination, and Conﬂict of interest The authors declare that they have no conﬂict of analysis, among others. interest. Within this MIR ﬁeld, depending on the nature of the data Ethical approval This paper contains no cases of studies with human at issue, transcription is approached from either the Opti- participants performed by any of the authors. cal Music Recognition (OMR) perspective if dealing with image scores or the so-called Automatic Music Transcription Code availability Not applicable. (AMT) when tackling acoustic recordings. While these ﬁelds Consent to participate Not applicable. have historically evolved separately, the fact that both tasks may represent their expected outputs in the same way allows Consent for publication Not applicable. developing a synergistic framework with which achieving a more accurate transcription. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- This work presents a ﬁrst proposal that combines the pre- tation, distribution and reproduction in any medium or format, as dictions depicted by a couple of neural end-to-end OMR long as you give appropriate credit to the original author(s) and the and AMT systems considering a local alignment approach source, provide a link to the Creative Commons licence, and indi- over different scenarios dealing with monophonic music data. cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, The results obtained validate our initial hypothesis that the unless indicated otherwise in a credit line to the material. If material multimodal combination of these two sources of informa- is not included in the article’s Creative Commons licence and your tion is capable of retrieving an improved transcription result. intended use is not permitted by statutory regulation or exceeds the While the actual improvement depends on the scenario con- permitted use, you will need to obtain permission directly from the copy- right holder. To view a copy of this licence, visit http://creativecomm sidered, our results attain up to around 40% of relative error ons.org/licenses/by/4.0/. improvement with respect to the single-modality transcrip- tion systems. It must be also pointed out that, out of the different scenarios posed, the only case in which the multi- modal fusion proposed does not imply any beneﬁt is when one of the modalities remarkably outperforms the other one and reaches an almost perfect performance. References In light of these results, different research avenues may be explored to further improve the results obtained. The ﬁrst 1. Benetos E, Dixon S, Duan Z, Ewert S (2018) Automatic music one is the actual combination of the hypotheses depicted by transcription: an overview. IEEE Signal Process Mag 36(1):20–30 2. Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A (2013) the individual systems on a probabilistic framework, such Automatic music transcription: challenges and future directions. J as that of word graphs or confusion networks. In addition, Intell Inf Syst 41(3):407–434 while these proposals work on a prediction-level combina- 3. Calvo-Zaragoza J, Hajicˇ J Jr, Pacha A (2020) Understanding optical tion, it may be also explored the case in which this fusion is music recognition. ACM Comput Surv (CSUR) 53(4):1–35 4. Calvo-Zaragoza J, Rizo D (2018) Camera-PrIMuS: neural end-to- done in previous stages of the pipeline as, for instance, the end optical music recognition on realistic monophonic scores. In: feature extraction one. Finally, experimentation may be also Proceedings of the 19th international society for music information extended to more challenging data as handwritten scores, retrieval conference, pp. 248–255. Paris, France different instrumentation, or polyphonic music. 5. Calvo-Zaragoza J, Toselli AH, Vidal E (2017) Handwritten music recognition for mensural notation: formulation, data and baseline Author Contributions C.F., J.J.V.-M., F.J.C. and J.C.-Z. made equally results. In: 14th IAPR International conference on document anal- contributions as regards the conception of the work, the experimental ysis and recognition, vol. 1, pp. 1081–1086 work, the data analysis, and writing the paper. 6. Calvo-Zaragoza J, Valero-Mas JJ, Pertusa A (2017) End-to-end optical music recognition using neural networks. In: Proceedings Funding Open Access funding provided thanks to the CRUE-CSIC of the 18th international society for music information retrieval agreement with Springer Nature. This research was partially funded by conference, pp. 472–477. Suzhou, China 123 84 International Journal of Multimedia Information Retrieval (2022) 11:77–84 7. Dumas B, Signer B, Lalanne D (2012) Fusion in multimodal 17. Rebelo A, Fujinaga I, Paszkiewicz F, Marcal AR, Guedes C, Car- interactive systems: an hmm-based algorithm for user-induced doso JS (2012) Optical music recognition: state-of-the-art and open adaptation. In: Proceedings of the 4th ACM SIGCHI symposium issues. Int J Multimed Inf Retr 1(3):173–190 on Engineering interactive computing systems, pp. 15–24 18. Román MA, Pertusa A, Calvo-Zaragoza J (2020) Data representa- 8. Granell E, Martínez-Hinarejos CD (2015) Multimodal output com- tions for audio-to-score monophonic music transcription. Exp Syst bination for transcribing historical handwritten documents. In: Appl 162:113769 International conference on computer analysis of images and pat- 19. Román M, Pertusa A, Calvo-Zaragoza J (2019) A holistic approach terns, pp. 246–260. Springer to polyphonic music transcription with neural networks. In: Pro- 9. Granell E, Martínez-Hinarejos CD, Romero V (2018) Improving ceedings of the 20th international society for music information transcription of manuscripts with multimodality and interaction. retrieval conference, pp. 731–737. Delft, The Netherlands In: Proceedings of IberSPEECH, pp. 92–96 20. Schedl M, Gómez E, Urbano J (2014) Music information retrieval: 10. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Con- recent developments and applications. Found Trends Inf Retr nectionist temporal classiﬁcation: labelling unsegmented sequence 8:127–261. https://doi.org/10.1561/1500000042 data with recurrent neural networks. In: Proceedings of the 23rd 21. Serra X, Magas M, Benetos E, Chudy M, Dixon S, Flexer A, Gómez international conference on machine learning, pp. 369–376. New E, Gouyon F, Herrera P, Jordà S, et al (2013) Roadmap for music York, USA information research. The MIReS Consortium. Creative Commons 11. Iñesta JM, Ponce de León PJ, Rizo D, Oncina J, Micó L, Rico- BY-NC-ND 3.0 license Juan JR, Pérez-Sancho C, Pertusa A (2018) Hispamus: Handwritten 22. Simonetta F, Ntalampiras S, Avanzini F (2019) Multimodal music spanish music heritage preservation by automatic transcription. In: information processing and retrieval: survey and future challenges. 1st International workshop on reading music systems, pp. 17–18 In: International workshop on multilayer music representation and 12. Kingma DP, Ba J (2015) Adam: A method for stochastic optimiza- processing, pp. 10–18 tion. In: 3rd International conference on learning representations. 23. Singh A, Sangwan A, Hansen JHL (2012) Improved parcel sort- San Diego, USA ing by combining automatic speech and character recognition. In: 13. Kristensson PO, Vertanen K (2011) Asynchronous multimodal text 2012 IEEE International conference on emerging signal process- entry using speech and gesture keyboards. In: Twelfth annual con- ing applications, pp. 52–55. https://doi.org/10.1109/ESPA.2012. ference of the international speech communication association 6152444 14. Levenshtein VI (1966) Binary codes capable of correcting dele- 24. Smith TF, Waterman MS (1981) Identiﬁcation of common molec- tions, insertions, and reversals. Sov Phys Dokl 10(8):707–710 ular subsequences. J Mol Biol 147(1):195–197 15. Miki M, Kitaoka N, Miyajima C, Nishino T, Takeda K (2014) 25. Toselli AH, Vidal E, Casacuberta F (2011) Multimodal interactive Improvement of multimodal gesture and speech recognition per- pattern recognition and applications. Springer Science & Business formance using time intervals between gestures and accompanying Media, Berlin speech. EURASIP J Audio, Speech, Music Process 2014(1):1–7 16. Pitsikalis V, Katsamanis A, Theodorakis S, Maragos P (2017) Multimodal gesture recognition via multiple hypotheses rescor- Publisher’s Note Springer Nature remains neutral with regard to juris- ing. In: Escalera S, Guyon I, Athitsos V (eds) Gesture recognition. dictional claims in published maps and institutional afﬁliations. Springer, Cham, pp 467-496

Journal

International Journal of Multimedia Information Retrieval – Springer Journals

Published: Mar 1, 2022

Keywords: Multimodal recognition; Automatic music transcription; Optical music recognition and deep learning

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Multimodal image and audio music transcription

Multimodal image and audio music transcription

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Multimodal image and audio music transcription

Multimodal image and audio music transcription

References (31)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies