Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

Marcelo Caetano; George P. Kafentzis; Athanasios Mouchtaris; Yannis Stylianou

doi:10.3390/app6050127

Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

Caetano, Marcelo;Kafentzis, George P.;Mouchtaris, Athanasios;Stylianou, Yannis 2016-05-02 00:00:00 applied sciences Article Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids 1, 2 2,3 2 Marcelo Caetano *, George P. Kafentzis , Athanasios Mouchtaris and Yannis Stylianou Sound and Music Computing Group, Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), 4200-465 Porto, Portugal Multimedia Informatics Lab, Department of Computer Science, University of Crete, 700-13 Heraklion, Greece; kafentz@csd.uoc.gr (G.P.K.); mouchtar@ics.forth.gr (A.M.); yannis@csd.uoc.gr (Y.S.) Signal Processing Laboratory, Institute of Computer Science, Foundation for Technology & Research-Hellas (FORTH), 700-13 Heraklion, Greece * Correspondence: mcaetano@inesctec.pt; Tel.: +351-22-209-4217 Academic Editor: Vesa Valimaki Received: 16 February 2016; Accepted: 19 April 2016; Published: 2 May 2016 Abstract: Sinusoids are widely used to represent the oscillatory modes of musical instrument sounds in both analysis and synthesis. However, musical instrument sounds feature transients and instrumental noise that are poorly modeled with quasi-stationary sinusoids, requiring spectral decomposition and further dedicated modeling. In this work, we propose a full-band representation that ﬁts sinusoids across the entire spectrum. We use the extended adaptive Quasi-Harmonic Model (eaQHM) to iteratively estimate amplitude- and frequency-modulated (AM–FM) sinusoids able to capture challenging features such as sharp attacks, transients, and instrumental noise. We use the signal-to-reconstruction-error ratio (SRER) as the objective measure for the analysis and synthesis of 89 musical instrument sounds from different instrumental families. We compare against quasi-stationary sinusoids and exponentially damped sinusoids. First, we show that the SRER increases with adaptation in eaQHM. Then, we show that full-band modeling with eaQHM captures partials at the higher frequency end of the spectrum that are neglected by spectral decomposition. Finally, we demonstrate that a frame size equal to three periods of the fundamental frequency results in the highest SRER with AM–FM sinusoids from eaQHM. A listening test conﬁrmed that the musical instrument sounds resynthesized from full-band analysis with eaQHM are virtually perceptually indistinguishable from the original recordings. Keywords: musical instruments; analysis and synthesis; sinusoidal modeling; AM–FM sinusoids; adaptive modeling; nonstationary sinusoids; full-band modeling PACS: 43.75.Zz; 43.75.De; 43.75.Ef; 43.75.Fg; 43.75.Gh; 43.75.Kk; 43.75.Mn; 43.75.Pq; 43.75.Qr 1. Introduction Sinusoidal models are widely used in the analysis [1,2], synthesis [2,3], and transformation [4,5] of musical instrument sounds. The musical instrument sound is modeled by a waveform consisting of a sum of time-varying sinusoids parameterized by their amplitudes, frequencies, and phases [1–3]. Sinusoidal analysis consists of the estimation of parameters, synthesis comprises techniques to retrieve a waveform from the analysis parameters, and transformations are performed as changes of the parameter values. The time-varying sinusoids, called partials, represent how the oscillatory modes of the musical instrument change with time, resulting in a ﬂexible representation with perceptually meaningful parameters. The parameters completely describe each partial, which can be manipulated independently. Appl. Sci. 2016, 6, 127; doi:10.3390/app6050127 www.mdpi.com/journal/applsci Appl. Sci. 2016, 6, 127 2 of 20 Several important features can be directly estimated from the analysis parameters, such as fundamental frequency, spectral centroid, inharmonicity, spectral ﬂux, onset asynchrony, among many others [2]. The model parameters can also be used in musical instrument classiﬁcation, recognition, and identification [6], vibrato detection [7], onset detection [8], source separation [9], audio restoration [10], and audio coding [11]. Typical transformations are pitch shifting, time scaling [12], and musical instrument sound morphing [5]. Additionally, the parameters from sinusoidal models can be used to estimate alternative representations of musical instrument sounds, such as spectral envelopes [13] and the source-ﬁlter model [14,15]. The quality of the representation is critical and can impact the results for the above applications. In general, sinusoidal models render a close representation of musical instrument sounds because most pitched musical instruments are designed to present very clear modes of vibration [16]. However, sinusoidal models do not result in perfect reconstruction upon resynthesis, leaving a modeling residual that contains whatever was not captured by the sinusoids [17]. Musical instrument sounds have particularly challenging features to represent with sinusoids, such as sharp attacks, transients, inharmonicity, and instrumental noise [16]. Percussive sounds produced by plucking strings (such as harpsichords, harps, and the pizzicato playing technique) or striking percussion instruments (such as drums, idiophones, or the piano) feature sharp onsets with highly nonstationary oscillations that die out very quickly, called transients [18]. Flute sounds characteristically comprise partials on top of breathing noise [16]. The reed in woodwind instruments presents a highly nonlinear behavior that also results in attack transients [19], while the stiffness of piano strings results in a slightly inharmonic spectrum [18]. The residual from most sinusoidal representations of musical instrument sounds contains perceptually important information [17]. However, the extent of this information ultimately depends on what the sinusoids are able to capture. The standard sinusoidal model (SM) [1,20] was developed as a parametric extension of the short-time Fourier transform (STFT) so both analysis and synthesis present the same time-frequency limitations as the Discrete Fourier Transform (DFT) [21]. The parameters are estimated with well-known techniques, such as peak-picking and parabolic interpolation [20,22], and then connected across overlapping frames (partial tracking [23]). Peak-picking is known to bias the estimation of parameters because errors in the estimation of frequencies can bias the estimation of amplitudes [22,24]. Additionally, the inherent time-frequency uncertainty of the DFT further limits the estimation because long analysis windows blur the temporal resolution to improve the frequency resolution and vice-versa [21]. The SM uses quasi-stationary sinusoids (QSS) under the assuption that the partials are relatively stable inside each frame. QSS can accurately capture the lower frequencies because these have fewer periods inside each frame and thus less temporal variation. However, higher frequencies have more periods inside each frame with potentially more temporal variation lost by QSS. Additionally, the parameters of QSS are estimated using the center of the frame as the reference and the values are less accurate towards the edges because the DFT has a stationary basis [25]. This results in the loss of sharpness of attack known as pre-echo. The lack of transients and noise is perceptually noticeable in musical instrument sounds represented with QSS [17,26]. Serra and Smith [1] proposed to decompose the musical instrument sound into a sinusoidal component represented with QSS and a residual component obtained by subtraction of the sinusoidal component from the original recording. This residual is assumed to be noise not captured by the sinusoids and commonly modeled by ﬁltering white noise with a time-varying ﬁlter that emulates the spectral characteristics of the residual component [1,17]. However, the residual contains both errors in parameter estimation and transients plus noise missed by the QSS [27]. The time-frequency resolution trade-off imposes severe limits on the detection of transients with the DFT. Transients are essentially localized in time and usually require shorter frames which blur the peaks in the spectrum. Daudet [28] reviews several techniques to detect and extract transients with sinusoidal models. Multi-resolution techniques [29,30] use multiple frame sizes to circumvent Appl. Sci. 2016, 6, 127 3 of 20 the time-frequency uncertainty and to detect modulations at different time scales. Transient modeling synthesis (TMS) [26,27,31] decomposes sounds into sinusoids plus transients plus noise and models each separately. TMS performs sinusoidal plus residual decomposition with QSS and then extracts the transients from the residual. An alternative to multiresolution techniques is the use of high-resolution techniques based on total least squares [32] such as ESPRIT [33], MUSIC [34], and RELAX [35] to ﬁt exponentially damped sinusoids (EDS). EDS are widely used to represent musical instrument sounds [11,36,37]. EDS are sinusoids with stationary (i.e., constant) frequencies modulated in amplitude by an exponential function. The exponentially decaying amplitude envelope from EDS is considered suitable to represent percussive sounds when the beginning of the frame is synchronized with the onsets [38]. However, EDS requires additional partials when there is no synchronization, which increases the complexity of the representation. ESPRIT decomposes the signal space into sinusoidal and residual, further ranking the sinusoids by decreasing magnitude of eigenvalue (i.e., spectral energy). Therefore, the ﬁrst K sinusoids maximize the energy upon resynthesis regardless of their frequencies. Both the SM and EDS rely on sinusoids with stationary frequencies, which are not appropriate to represent nonstationary oscillations [21]. Time-frequency reassignment [39–41] was developed to estimate nonstationary sinusoids. Polynomial phase signals [20,25] such as splines [21] are commonly used as an alternative to stationary sinusoids. McAulay and Quatieri [20] were among the ﬁrst to interpolate the phase values estimated at the center of the analysis window across frames with cubic polynomials to obtain nonstationary sinusoids inside each frame. Girin et al. [42] investigated the impact of the order of the polynomial used to represent the phase and concluded that order ﬁve does not improve the modeling performance sufﬁciently to justify the increased complexity. However, even nonstationary sinusoids leave a residual with perceptually important information that requires further modeling [25]. Sinusoidal models rely on spectral decomposition assuming that the lower end of the spectrum can be modeled with sinusoids while the higher end essentially contains noise. The estimation of the separation between the sinusoidal and residual components has proved difﬁcult [27]. Ultimately, spectral decomposition misses partials on the higher end of the spectrum because the separation is artiﬁcial, depending on the spectrum estimation method rather than the spectral characteristics of musical instrument sounds. We consider spectral decomposition to be a consequence of artifacts from previous sinusoidal models instead of an acoustic property of musical instruments. Therefore, we propose the full-band modeling of musical instrument sounds with adaptive sinusoids as an alternative to spectral decomposition. Adaptive sinusoids (AS) are nonstationary sinusoids estimated to ﬁt the signal being analyzed usually via an iterative parameter re-estimation process. AS have been used to model speech [43–46] and musical instrument sounds [25,47]. Pantazis [45,48] developed the adaptive Quasi-Harmonic Model (aQHM), which iteratively adapts the frequency trajectories of all sinusoids at the same time based on the Quasi-Harmonic Model (QHM). Adaptation improves the ﬁt of a spectral template via an iterative least-squares (LS) parameter estimation followed by frequency correction. Later, Kafentzis [43] devised the extended adaptive Quasi-Harmonic Model (eaQHM), capable of adapting both amplitude and frequency trajectories of all sinusoids iteratively. In eaQHM, adaptation is equivalent to the iterative projection of the original waveform onto nonstationary basis functions that are locally adapted to the time-varying characteristics of the sound, capable of modeling sudden changes such as sharp attacks, transients, and instrumental noise. In a previous work [47], we showed that eaQHM is capable of retaining the sharpness of the attack of percussive sounds. In this work, we propose full-band modeling with eaQHM for a high-quality analysis and synthesis of isolated musical instrument sounds with a single component. We compare our method to QSS estimated with the standard SM [20] and EDS estimated with ESPRIT [36]. In the next section, we discuss the differences in full-band spectral modeling and traditional decomposition for musical instrument sounds. Next, we describe the full-band quasi-harmonic adaptive sinusoidal modeling Appl. Sci. 2016, 6, 127 4 of 20 behind eaQHM. Then, we present the experimental setup, describe the musical instrument sound database used in this work and the analysis parameters. We proceed to the experiments, present the results, and evaluate the performance of QSS, EDS, and eaQHM in modeling musical instrument sounds. Finally, we discuss the results and present conclusions and perspectives for future work. 2. Full-Band Modeling Spectrum decomposition splits the spectrum of musical instrument sounds into a sinusoidal component and a residual as illustrated in Figure 1a. Spectrum decomposition assumes that there are partials only up to a certain cutoff frequency f , above which there is only noise. Figure 1a represents the spectral peaks as spikes on top of colored noise (wide light grey frequency bands) and f as the separation between the sinusoidal and residual components. Therefore, f determines the number of sinusoids because only the peaks at the lower frequency end of the spectrum are represented with sinusoids (narrow dark grey bars) and the rest is considered wide-band and stochastic noise existing across the whole range of the spectrum. There is noise between the spectral peaks and at the higher end of the spectrum. In a previous study [17], we showed that the residual from the SM is perceptually different from ﬁltered (colored) white noise. Figure 1a shows that there are spectral peaks left in the residual because the spectral peaks above f are buried under the estimation noise ﬂoor (and sidelobes). Consequently, the residual from sinusoidal models that rely on spectral decomposition such as the SM is perceptually different from ﬁltered white noise. PSD PSD Fs/2 Fs/2 Sinusoidal Residual Full-Band Harmonic Template Spectral Peaks Noise Sinusoids Frequency Frequency (a) Spectrum Decomposition (b) Full-Band Harmonic Template Figure 1. Illustration of the spectral decomposition and full-band modeling paradigms. From an acoustic point of view, the physical behavior of musical instruments can be modeled as the interaction between an excitation and a resonator (the body of the instrument) [16]. This excitation is responsible for the oscillatory modes whose amplitudes are shaped by the frequency response of the resonator. The excitation signal commonly contains discontinuities, resulting in wide-band spectra. For instance, the vibration of the reed in woodwinds can be approximated by a square wave [49], the friction between the bow and the strings results in an excitation similar to a sawtooth wave [16], the strike in percussion instruments can be approximated by a pulse [2], while the vibration of the lips in brass instruments results in a sequence of pulses [50] (somewhat similar to the glottal excitation, which is also wide band [46]). Figure 1b illustrates a full-band harmonic template spanning the entire frequency range, ﬁtting sinusoids to spectral peaks in the vicinity of harmonics of the fundamental frequency f . The spectrum of musical instruments is known to present deviations from perfect harmonicity [16], but quasi-harmonicity is supported by previous studies [51] that found deviations as small as 1%. In this work, the full-band harmonic template becomes quasi-harmonic after the estimation of parameters via least-squares followed by a frequency correction mechanism (see details in Section 3.1). Therefore, full-band spectral modeling assumes that both the excitation and the instrumental noise are wide band. Appl. Sci. 2016, 6, 127 5 of 20 3. Adaptive Sinusoidal Modeling with eaQHM In what follows, x (n) is the original sound waveform and x ˆ (n) is the sinusoidal model with sample index n. Then, the following relation holds: x (n) = x ˆ (n) + e (n) , (1) where e (n) is the modeling error or residual. Each frame of x (n) is x (n, m) = x (n) w (n mH) , m = 0, , M 1, (2) where m is the frame number, M is the number of frames, and H is the hop size. The analysis window w (n) has L samples and it deﬁnes the frame size. Typically, H < L such that the frames m overlap. Figure 2 presents an overview of the modeling steps in eaQHM. The feedback loop illustrates the adaptation cycle, where x ˆ (n) gets closer to x (n) with each iteration. The iterative process stops when the ﬁt improves by less than a threshold #. The dark blocks represent parameter estimation based on the quasi-harmonic model (QHM), followed by interpolation of the parameters across frames before additive [1] resynthesis (instead of overlap add (OLA) [52]). The resulting time-varying sinusoids are used as nonstationary basis functions for the next iteration, so the adaptation procedure illustrated in Figure 3 iteratively projects x (n) onto x (n). Next, QHM is summarized, followed by parameter interpolation and then eaQHM. Least Frequency Parameter Resynthesis x(t) Windowing x(t) Squares Interpolation Correction Time-Varying Basis Functions Figure 2. Block diagram depicting the modeling steps in the extended adaptive Quasi-Harmonic Model (eaQHM). The blocks with a dark background correspond to parameter estimation, while the feedback loop illustrates adaptation as iteration cycles around the loop. See text for the explanation of the symbols. F Frequency F Parameter F correction Interpolation Iteration 1 T T T Original Model becomes Basis Basis Model Parameter F F F Frequency Window Interpolation correction Iteration 2 T T Figure 3. Illustration of the adaptation of the frequency trajectory of a sinusoidal partial inside the analysis window in eaQHM. The ﬁgure depicts the ﬁrst and second iterations of eaQHM around the loop in Figure 2, showing local adaptation as the iterative projection of the original waveform onto the model. 3.1. The Quasi-Harmonic Model (QHM) j2pn /f k s ˆ QHM [48] projects x (n, m) onto a template of sinusoids e with constant frequencies f and sampling frequency f . QHM estimates the parameters of x ˆ (n, m) using s Appl. Sci. 2016, 6, 127 6 of 20 j2p f n/ f x ˆ n, m = a + nb e , (3) ( ) ( ) å k k k=K where k is the partial number, K is the number of real sinusoids, a the complex amplitude and b is k k th j2pn /f k s the complex slope of the k sinusoid. The term nb arises from the derivative of e with respect to frequency. The constant frequencies f deﬁne the spectral template used by QHM to ﬁt the analysis parameters a and b by least-squares (LS) [44,45]. In principle, any set of frequencies f can be used k k k because the estimation of a and b also provides a means of correcting the initial frequency values f k k k by making f converge to nearby frequencies f present in the signal frame. The mismatch between k k ˆ ˆ f and f leads to an estimation error h = f f . Pantazis et al. [48] showed that QHM provides k k k k k an estimate of h given by f Refa g Imfb g Imfa g Refb g s k k k k h ˆ = , (4) 2p ja j which corresponds to the frequency correction block in Figure 2. Then x (n, m) is locally synthesized as ˆ ˆ j 2pF n/ f +f ( s ) k k ˆ ˆ x (n, m) = a e , (5) å k k=K ˆ ˆ ˆ 6 where a ˆ = ja j, F = f + h ˆ , and f = a are constant inside the frame m. k k k k k k k The full-band harmonic spectral template shown in Figure 1b is obtained by setting f = k f with k 0 k an integer and 1 k /2 f . The f is not necessary to estimate the parameters, but it improves 0 0 the ﬁt because the initial full-band harmonic template approximates better the spectrum of isolated quasi-harmonic sounds. QHM assumes that the sound being analyzed contains a single source, so, for isolated notes from pitched musical instruments, a constant f is used across all frames m. 3.2. Parameter Interpolation across Frames ˆ ˆ The model parameters a ˆ , F , and f from Equation (5) are estimated as samples at the frame k k k 1 ˆ rate /H of the amplitude- and frequency-modulation (AM–FM) functions a ˆ n and f n = ( ) ( ) k k 2p ˆ ˆ /f F (n) + f , which describe, respectively, the long-term amplitude and frequency temporal k k variation of each sinusoid k. For each frame m, a ˆ t, m and F t, m are estimated using the sample ( ) ( ) k k index at the center of the frame n = t as reference. Resynthesis of x ˆ (n, m) requires a ˆ (n, m) and F n, m at the signal sampling rate f . Equation (5) uses constant values, resulting in locally stationary ( ) sinusoids with constant amplitudes and frequencies inside each frame m. However, the parameter values might vary across frames, resulting in discontinuities such as a ˆ (t, m) 6= a ˆ (t, m + 1) due to temporal variations happening at the frame rate /H. k k OLA resynthesis [52] uses the analysis window w (n) to taper discontinuities at the frame boundaries by resynthesizing x ˆ (n, m) = x ˆ (n) w (n) for each m similarly to Equation (2) and then overlap-adding ˆ ˆ x (n, m) across m to obtain x (n). Additive synthesis is an alternative to OLA that results in smoother temporal variation [20] by ˆ ˆ ﬁrst interpolating a (t, m) and f (t, m) across m and then summing over k. In this case, a (n) is k k k obtained by linear interpolation of a ˆ (t, m) and a ˆ (t, m + 1). Recursive calculation across m results in k k a piece-wise linear approximation of a (n). F (n) is estimated via piece-wise polynomial interpolation k k ˆ ˆ ˆ of F (t, m) across m with quadratic splines, and f (n) is obtained integrating F (n) in two steps k k k ˆ ¯ because f (t, m) is wrapped around 2p across m. First, f (n) is calculated as k k m+1 2p ¯ ˆ f (n) = f (t, m) + F (u) . (6) k k å k u=m Appl. Sci. 2016, 6, 127 7 of 20 ¯ ¯ ˆ The calculation of f n using Equation (6) does not guarantee that f t, m + 1 = f t, m + 1 + 2pP, ( ) ( ) ( ) k k k with P the closest integer to unwrap the phase (see details in [45]). Thus, f (n) is calculated as m+1 2p p (u mt) ˆ ˆ ˆ f (n) = f (t, m) + F (u) + g sin , (7) k k å k f (m + 1) t mt u=m where the term given by the sine function ensures continuity with f (t, m + 1) when g is ˆ ¯ p f (t, m + 1) + P f (t, m + 1) k k g = , (8) 2 (m + 1) t mt ˆ ¯ with P given byjf (t, m + 1) f (t, m + 1)j (see [45]). k k 3.3. The Extended Adaptive Quasi-Harmonic Model (eaQHM) Pantazis et al. [45] proposed adapting the phase of the sinusoids. The adaptive procedure applies LS, frequency correction, and frequency interpolation iteratively (see Figure 2), projecting x (n, m) onto x ˆ (n, m). Figure 3 shows the ﬁrst and second iterations to illustrate adaptation of one sinusoid. Kafentzis et al. [43] adapted both the instantaneous amplitude and the instantaneous phase of x ˆ (n, m) with a similar iterative procedure in eaQHM. The analysis stage uses jF (n,m) ˆ k x ˆ (n, m) = (a + nb ) A (n, m) e , (9) å k k k k=K where A (n, m) and F (n, m) are functions of the time-varying instantaneous amplitude and phase of k k each sinusoid, respectively [43,45], obtained from the parameter interpolation step and deﬁned as a ˆ (n) A (n, m) = , (10a) a (t, m) ˆ ˆ ˆ F (n, m) = f (n) f (t, m) , (10b) k k k where a (n) is the piece-wise linear amplitude and f (n) is estimated using Equation (7). Finally, k k eaQHM models x (n) as a set of amplitude and frequency modulated nonstationary sinusoids given by jf (n) k,i1 x ˆ (n) = a ˆ (n) e , (11) i å k,i1 k=K where a ˆ (n) and f (n) are the instantaneous amplitude and phase from the previous iteration k,i1 k,i1 i 1. Adaptation results from the iterative projection of x (n) onto x ˆ (n) from i 1 as the model x ˆ (n) are used as nonstationary basis functions locally adapted to the time-varying behavior of x n . Note that ( ) jF (n,m) ˆ k Equation (9) is simply Equation (3) with a nonstationary basis A (n, m) e . In fact, Equation (9) represents the next parameter estimation step, which will be again followed by frequency correction as in Figure 2. The convergence criterion for eaQHM is either a maximum number of iterations i or an adaptation threshold # calculated as i1 i SRER SRER < #, (12) i1 SRER where the signal-to-reconstruction-error ratio (SRER) is calculated as RMS (x) RMS (x) SRER = 20 log = 20 log . (13) 10 10 RMS (x x ˆ) RMS (e) Appl. Sci. 2016, 6, 127 8 of 20 The SRER measures the ﬁt between the model x ˆ n and the original recording x n by dividing ( ) ( ) the total energy in x (n) by the energy in the residual e (n). The higher the SRER, the better the ﬁt. Note that # stops adaptation whenever the ﬁt does not improve from iteration i 1 to i regardless of the absolute SRER value. Thus, even sounds from the same instruments can reach different SRER. 4. Experimental Setup We now investigate the full-band representation of musical instrument sounds and the nonstationarity of the adaptive AM–FM sinusoids from eaQHM. We aim to show that spectral decomposition fails to capture partials at the higher end of the spectrum so full-band quasi-harmonic modeling increases the quality of analysis and resynthesis by capturing sinusoids across the full range of the spectrum. Additionally, we aim to show that adaptive AM–FM sinusoids from eaQHM capture nonstationary partials inside the frame. We compare full-band modeling with eaQHM against the SM [1,20] and EDS estimated with ESPRIT [36] using the same number of partials K. We assume that the musical instrument sounds under investigation can be well represented as quasi-harmonic. Thus, we set K to the highest harmonic number k below Nyquist frequency /2 or equivalently the max highest integer K that satisﬁes K f s/2. The fundamental frequency f of all sounds was estimated 0 0 using the sawtooth waveform inspired pitch estimator (SWIPE) [53] because in the experiments the frame size L, the maximum number of partials K , and the full-band harmonic template depend on max f . In the SM, K is the number of spectral peaks modeled by sinusoids. For EDS, ESPRIT uses K to determine the separation between the dimension of the signal space (sinusoidal component) and of the residual. The SM is considered the baseline for comparison due to the quasi-stationary nature of the sinusoids and the need for spectral decomposition. EDS estimated with ESPRIT is considered the state-of-the-art due to the accurate analysis and synthesis and constant frequency of EDS inside the frame m. We present a comparison of the local and global SRER as a function of K and L for the SM and EDS against eaQHM in two experiments. In experiment 1, we vary K from 2 to K max and record the SRER. In experiment 2, we vary L from 3T f to 8T f samples and record the SRER, s s 0 0 where T = /f is the fundamental period. The local SRER is calculated within the ﬁrst frame m = 0, 0 0 where we expect the attack transients to be. The ﬁrst frame is centered at the onset with t = 0 (and the ﬁrst half is zero-padded), so artifacts such as pre-echo (in the ﬁrst half of the frame) are also expected to be captured by the local SRER. The global SRER is calculated across all frames, thus considering the whole sound signal x ˆ n . Next, we describe the musical instrument sounds modeled and the selection ( ) of parameter values for the algorithms. 4.1. The Musical Instrument Sound Dataset In total, 92 musical instrument sounds were selected. “Popular ” and “Keyboard” musical instruments are from the RWC Music Database: Musical Instrument Sound [54]. All other sounds are from the Vienna Symphonic Library [55] database of musical instrument samples. Table 1 lists the musical instrument sounds used. The recordings were chosen to represent the range of musical instruments commonly found in traditional Western orchestras and in popular recordings. Some instruments feature different registers (alto, baritone, bass, etc). All sounds used belong to the same pitch class (C), ranging in pitch height from C2 (f 0 65 Hz) to C6 (f 0 1046 Hz). The dynamics is indicated as forte (“f”) or fortissimo (“ff”), and the duration of most sounds is less than 2 s. Normal attack (“na”) and no vibrato (“nv”) were chosen whenever available. Presence of vibrato (“vib”), progressive attack (“pa”), and slow attack (“sa”) are indicated, as well as different playing modes such as staccato (“stacc”), sforzando (“sforz”), and pizzicato (“pz”), achieved by plucking string instruments. Extended techniques were also included, such as tongue ram (“tr ”) for the ﬂute, près de la table (“pdlt”) for the harp, muted (“mu”) strings, and bowed idiophones (vibraphone, xylophone, etc.) for short (“sh”) and long (“lg”) sounds. Appl. Sci. 2016, 6, 127 9 of 20 Different mallet materials such as metal (“met”), plastic (“pl”), and wood (“wo”) and hardness such as soft (“so”), medium (“med”), and hard (“ha”) are indicated. Table 1. Musical instrument sounds used in all experiments. See text in Section 4.1 for a description of the terms in brackets. Sounds in bold were used in the listening test described in Section 6. The quasi-harmonic model (QHM) failed for the sounds in italics marked *. Family Musical Instrument Sounds Bass Trombone (C3 f nv na), Bass Trombone (C3 f stac), Bass Trumpet (C3 f na vib), Cimbasso (C3 f nv na), Cimbasso (C3 f stac), Contrabass Trombone* (C2] f stac), Contrabass Tuba (C3 f na), Contrabass Tuba (C3 f stac), Cornet (C4 f), French Horn Brass (C3 f nv na), French Horn (C3 f stac), Piccolo Trumpet (C5 f nv na), Piccolo Trumpet (C5 f stac), Tenor Trombone (C3 f na vib), Tenor Trombone (C3 f nv sa), Tenor Trombone (C3 f stac), C Trumpet (C4 f nv na), C Trumpet (C4 f stac), Tuba (C3 f vib na), Tuba (C3 f stac), Wagner Tuba (C3 f na), Wagner Tuba (C3 f stac) Alto Flute (C4 f vib na), Bass Clarinet (C3 f na), Bass Clarinet (C3 f sforz), Bass Clarinet (C3 f stac), Bassoon (C3 f na), Bassoon (C3 f stac), Clarinet (C4 f na), Clarinet Woodwinds (C4 f stac), Contra Bassoon* (C2 f stac), Contra Bassoon* (C2 f sforz), English Horn (C4 f na), English Horn (C4 f stac), Flute (C4 f nv na), Flute (C4 f stac), Flute (C4 f tr), Flute (C4 f vib na), Oboe 1 (C4 f stac), Oboe 2 (C4 f nv na), Oboe (C4 f pa), Piccolo Flute (C6] f vib sforz), Piccolo Flute (C6 f nv ha ff) Plucked Cello (C3 f pz vib), Harp (C3 f), Harp (C3 f pdlt), Harp (C3 f mu), Viola (C3 f pz vib), Strings Violin (C4 f pz mu) Bowed Cello (C3 f vib), Cello (C3 f stac), Viola (C3 f vib), Viola (C4 f stac), Violin (C4 f), Strings Violin (C4] ff vib), Violin (C4 f stac) Glockenspiel (C4 f), Glockenspiel (C6 f wo), Glockenspiel (C6 f pl), Glockenspiel (C6 f met), Marimba (C4 f), Vibraphone (C4 f ha 0), Vibraphone (C4 f ha fa), Struck Vibraphone (C4 f ha sl), Vibraphone (C4 f med 0), Vibraphone (C4 f med fa), Percussion Vibraphone (C4 f med 0 mu), Vibraphone (C4 f med sl), Vibraphone (C4 f so 0), Vibraphone (C4 f so fa), Xylophone (C5 f GA L), Xylophone (C5 met), Xylophone (C5 f HO L), Xylophone (C5 f mP L), Xylophone (C5 f wP L) Bowed Vibraphone (C4 f sh vib), Vibraphone (C4 f sh nv), Vibraphone (C4 f lg nv) Percussion Accordion (C3] f), Acoustic Guitar (C3 f), Baritone Sax (C3 f), Bass Harmonica Popular (C3] f), Chromatic Harmonica (C4 f), Classic Guitar (C3 f), Mandolin (C4 f), Pan Flute (C5 f), Tenor Sax (C3] f), Ukulele (C4 f) Keyboard Celesta (C3 f na nv), Celesta (C3 f stac), Clavinet (C3 f), Piano (C3 f) In what follows, we will present the results for 89 sounds because QHM failed to adapt for the three sounds marked * in Table 1. The estimation of parameters for QHM uses LS [45]. The matrix inversion fails numerically when the matrix is close to singular (see [44]). The fundamental frequency (C2 65 Hz) of these sounds determines a full-band harmonic spectral template whose frequencies are separated by C2, which results in singular matrices. 4.2. Analysis Parameters The parameter estimation for the SM follows [20] with a Hann window for analysis, and phase interpolation across frames via cubic splines followed by additive resynthesis. The estimation of parameters for EDS uses ESPRIT with a rectangular window for analysis and OLA resynthesis [36]. Parameter estimation in eaQHM used a Hann window for analysis and additive resynthesis following Equation (11). In all experiments, # in Equation (12) is set to 0.01 and f = 16 kHz for all sounds. The step size for analysis (and OLA synthesis) was H = 16 samples for all algorithms, corresponding Appl. Sci. 2016, 6, 127 10 of 20 to 1 ms. The frame size is L = qT f samples with q an integer. The size of the FFT for the SM is kept 0 s constant at N = 4096 samples with zero padding. 5. Results and Discussion 5.1. Adaptation Cycles in eaQHM Figure 4 shows the global and local SRER as a function of the number of adaptation cycles (iterations). Each plot was averaged across the sounds indicated, while the plot “all instruments” is an average of the previously shown. The SRER increases quickly after a few iterations, slowly converging to a ﬁnal value considerably higher than before adaptation. Iteration 0 corresponds to QHM initialized with the full-band harmonic template, thus Figure 4 demonstrates that the adaptation of the sinusoids by eaQHM increases the SRER when compared to QHM. Brass Woodwinds 70 80 Global Global Local Local 60 70 50 60 40 50 30 40 20 30 10 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Bowed Strings Plucked Strings 80 70 Global Global Local Local 10 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Struck Percussion Popular 65 65 Global Global 60 Local Local 20 15 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Figure 4. Cont. SRER (dB) SRER (dB) SRER (dB) SRER (dB) SRER (dB) SRER (dB) Appl. Sci. 2016, 6, 127 11 of 20 Keyboard All Instruments 60 70 Global Global 55 Local Local 45 50 30 30 15 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Figure 4. Plot of the signal-to-reconstruction-error ratio (SRER) as a function of number of adaptations to illustrate how adaptation increases the SRER in eaQHM. Iteration 0 corresponds to QHM initialized with the full-band harmonic spectral template. 5.2. Experiment 1: Variation Across K (Constant L = 3T f ) We ran each algorithm varying K (the frame size was kept at L = 3T f ) and recorded the resulting 0 s local and global SRER values. We started from K = 2 and increased K by two partials up to K . max Figure 5 shows the local and global SRER (averaged across sounds) as a function of K for the SM, EDS, and eaQHM. Sounds with different f values have different K . Figure 5 shows that the addition of max partials for the SM does not result in an increase in SRER after a certain K. EDS tends to continuously increase the SRER with more partials that capture more spectral energy. Finally, eaQHM increases the SRER up to K . max Brass C3 Woodwinds C3 Bowed Strings C4 70 70 50 SM SM SM EDS EDS EDS 60 60 eaQHM eaQHM eaQHM 50 50 40 40 30 30 20 20 10 10 0 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 Number of Partials Number of Partials Number of Partials Brass C3 Woodwinds C3 Bowed Strings C4 80 80 80 SM SM SM 70 EDS 70 EDS EDS eaQHM eaQHM eaQHM 60 60 50 50 40 40 30 30 20 20 10 10 0 0 10 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 Number of Partials Number of Partials Number of Partials Figure 5. Cont. Global SRER (dB) Local SRER (dB) SRER (dB) Global SRER (dB) Local SRER (dB) SRER (dB) Global SRER (dB) Local SRER (dB) Appl. Sci. 2016, 6, 127 12 of 20 Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM 70 EDS 60 EDS EDS eaQHM eaQHM eaQHM 60 50 50 40 40 30 30 20 20 10 10 0 0 −10 −10 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 Number of Partials Number of Partials Number of Partials Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM 70 EDS 60 EDS EDS eaQHM eaQHM eaQHM 60 50 50 40 40 30 30 30 20 20 10 10 0 0 −10 0 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 Number of Partials Number of Partials Number of Partials Figure 5. Comparison between local and global SRER as a function of the number of partials for the three models (the standard sinusoidal model (SM), exponentially damped sinusoids (EDS), and eaQHM). The bars around the mean are the standard deviation across different sounds from the family indicated. The distributions are not symmetrical as suggested by the bars. The SM, EDS, and eaQHM use different analysis and different synthesis methods, which partially explains the different behavior under variation of K. More importantly, the addition of partials for each algorithm uses different criteria. Both the SM and EDS use spectral energy as a criterion, while eaQHM uses the frequencies of the sinusoids assuming quasi-harmonicity. In the SM, a new sinusoid is selected as the next spectral peak (increasing frequency) with spectral energy above a selected threshold regardless of the frequency of the peak. In fact, the frequency is estimated from the peak afterwards. For EDS, K determines the number of sinusoids used upon resynthesis. However, ESPRIT ranks the sinusoids by decreasing eigenvalue rather than the frequency, adding partials with high spectral energy that will increase the ﬁt of the reconstruction. The frequencies of the new partials are not constrained by harmonicity. Finally, eaQHM uses the spectral template to search for nearby spectral peaks with LS and frequency correction. The sinusoids will converge to spectral peaks in the neighborhood of the harmonic template with K harmonically related partials starting from f . Therefore, K in eaQHM 0 max corresponds to full-band analysis and synthesis but not necessarily for the SM or EDS. 5.3. Experiment 2: Variation Across L (Constant K = K ) max We ran each algorithm varying L from 3T f to 8T f with a constant number of partials K 0 s 0 s max and measured the resulting local and global SRER. In the literature [46], L = 3T f is considered 0 s a reasonable value for speech and audio signals when using the SM. We are unaware of a systematic investigation of how L affects modeling accuracy for EDS. Figure 6 shows the local and global SRER (averaged across sounds) as a function of L expressed as q times T f , so sounds with different f 0 s 0 values have different frame size L in samples. Figure 6 shows that the SRER decreases with L for all algorithms. The SM seldom outperforms EDS or eaQHM, but it is more robust against variations of L. For the SM, L affects both spectral estimation and temporal representation. In the FFT, L determines the trade-off between temporal and spectral resolution, which affects the performance of the peak picking algorithm for parameter estimation. The temporal representation is affected because the parameters are an average across L referenced to the center of the frame. In turn, ESPRIT estimates EDS with constant frequency inside the frames referenced to the beginning of the frame, thus L affects the temporal modeling accuracy more than the Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Local SRER (dB) Global SRER (dB) Appl. Sci. 2016, 6, 127 13 of 20 spectral estimation. However, the addition of sinusoids might compensate for the stationary frequency of EDS inside the frame. Finally, the SRER for eaQHM decreases considerably when L increases because L adversely affects the frequency correction and interpolation mechanisms. Frequency correction is applied at the center of the analysis frame and eaQHM uses spline interpolation to capture frequency modulations across frames. Thus, adaptation improves the ﬁt more slowly for longer L, generally reaching a lower absolute SRER value. Brass C3 Woodwinds C3 Bowed Strings C4 70 60 50 SM SM SM EDS EDS EDS eaQHM eaQHM eaQHM 0 10 5 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Brass C3 Woodwinds C3 Bowed Strings C4 80 80 80 SM SM SM EDS EDS EDS 70 70 eaQHM eaQHM eaQHM 60 60 50 50 40 40 30 30 10 20 20 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM EDS EDS EDS 70 50 eaQHM eaQHM eaQHM 60 40 50 30 40 20 30 10 20 0 10 10 −10 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM EDS EDS EDS eaQHM eaQHM eaQHM 50 45 30 30 10 10 15 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Figure 6. Comparison between local and global SRER as a function of the size of the frame for the three models (SM, EDS, and eaQHM). The bars around the mean are the standard deviation across different sounds from the family indicated. The distributions are not symmetrical as suggested by the bars. 5.4. Full-Band Quasi-Harmonic Analysis with AM–FM Sinusoids To simplify the comparison and reduce the information, we present the differences of SRER instead of absolute SRER values. For each sound, we subtract the absolute SRER values (in dB) for the SM and EDS from that of eaQHM to obtain the differences of SRER. The local value measures the ﬁt for the attack and the global value measures the overall ﬁt. Positive values indicate that eaQHM Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Local SRER (dB) Global SRER (dB) Global SRER (dB) Local SRER (dB) Appl. Sci. 2016, 6, 127 14 of 20 results in higher SRER than the other method for that particular sound, while a negative value means the opposite. The different SRER values are averaged across all musical instruments that belong to the family indicated. Table 2 shows the comparison of eaQHM against EDS and the SM with K = K max and L = 3T f clustered by instrumental family. The distributions are not symmetrical around the mean as suggested by the standard deviation. Table 2. Local and global difference of signal-to-reconstruction-error ratio (SRER) comparing eaQHM with exponentially damped sinusoids (EDS) and eaQHM with the standard sinusoidal model (SM) for the frame size L = 3T f and number of partials K = K . The three C2 sounds are not included. 0 s max SRER (eaQHM-EDS) SRER (eaQHM-SM) Family Local (dB) Global (dB) Local (dB) Global (dB) Brass 9.4 7.0 12.5 6.8 27.3 5.8 31.9 4.0 Woodwinds 7.8 3.9 22.0 5.9 30.9 7.5 36.1 4.7 Bowed Strings 12.2 4.2 24.1 6.7 35.0 4.7 40.0 4.7 Plucked Strings 8.3 5.0 4.7 3.4 49.5 4.3 46.6 5.1 Bowed Percussion 2.7 2.5 16.3 2.2 12.7 2.6 37.6 3.6 Struck Percussion 10.5 4.8 10.1 2.6 28.6 13.3 26.0 11.3 Popular 6.3 3.3 11.9 7.0 26.5 10.8 27.5 11.6 Keyboard 5.7 3.4 5.4 4.3 37.0 8.0 34.6 2.0 Total 5.3 2.4 13.2 3.3 31.0 7.1 35.0 5.9 Thus, Table 2 summarizes the result of full-band quasi-harmonic analysis with adaptive AM–FM sinusoids from eaQHM comparing with the SM and EDS under the same conditions, namely the same number of sinusoids K = K and frame size L = 3T f . When eaQHM is compared to max 0 s the SM, both local and global difference SRER are positive for all families. This means that full-band quasi-harmonic modeling with eaQHM results in a better ﬁt for the analysis and synthesis of musical instrument sounds. When eaQHM is compared to EDS, all global difference SRER are positive and all local difference SRER are positive except for Brass and Bowed Percussion. Thus, EDS can ﬁt the attack of Brass and Bowed Percussion better than eaQHM. The exponential amplitude envelope of EDS is considered suitable to model percussive sounds with sharp attacks such as harps, pianos, and marimbas [36,37]. The musical instrument families that contain percussive sounds are Plucked strings, Struck percussion, and Keyboard. Table 2 shows that eaQHM outperformed EDS locally and globally for all percussive sounds. The ability to adapt the amplitude of the sinusoidal partials to the local characteristics of the waveform makes eaQHM extremely ﬂexible to ﬁt both percussive and nonpercussive musical instrument sounds. On the other hand, both Brass and Bowed Percussion present slow attacks typically lasting longer than one frame L. Note that /f = 3T 22 ms for C3 ( f 131 Hz) while Bowed 0 0 Percussion can have attacks longer than 100 ms. Therefore, one frame L = 3T f does not measure the 0 s ﬁt for the entire duration of the attack. Note that the local SRER is important because the global SRER measures the overall ﬁt without indication of where the differences lie in the waveform. For musical instrument sounds, differences in the attack impact the results differently than elsewhere because the attack is among the most important perceptual features in dissimilarity judgment [56–58]. Consequently, when comparing two models with the global SRER, it is only safe to say that a higher SRER indicates that resynthesis results in a waveform that is closer to the original recording. 5.5. Full-Band Modeling and Quasi-Harmonicity Time-frequency transforms such as the STFT represent L samples in a frame with N DFT coefﬁcients provided that N L. Note that N 2 C, corresponding to p = 2N real numbers. There is signal expansion whenever the representation uses p parameters to represent L samples and p > L. Sinusoidal models represent L samples in a frame with K sinusoids. In turn, each sinusoid is described Appl. Sci. 2016, 6, 127 15 of 20 by p parameters, requiring pK parameters to represent L samples. Therefore, there is a maximum number of sinusoids to represent a frame without signal expansion. For example, white noise has a ﬂat spectrum across that would take a large number of sinusoids close together in frequency resulting in signal expansion. The pK parameters to represent L samples can be interpreted as the degrees of freedom of the ﬁt. As a general rule, more parameters mean greater ﬂexibility of representation (hence potentially a better ﬁt), but with the risk of over-ﬁtting. Table 3 shows a comparison of the number of real parameters p (per sinusoid k per frame m) for the analysis and synthesis stages of the SM, EDS, and eaQHM. Note that eaQHM and EDS require more parameters than the SM at the analysis stage, but eaQHM and the SM require fewer parameters than EDS for the synthesis stage. The difference is due to the resynthesis strategy used by each algorithm. EDS uses OLA resynthesis, which requires all analysis parameters for resynthesis, while both eaQHM and the SM use additive resynthesis. Table 3. Comparison of the number of real parameters p per sinusoid k per frame m for the analysis and synthesis stages of the SM, EDS, and eaQHM. The table presents the number of real parameters p to estimate and to resynthesize each sinusoid inside a frame. Number of Real Parameters p Per Sinusoid k Per Frame m SM EDS eaQHM Analysis p = 3 p = 4 p = 4 Synthesis p = 3 p = 4 p = 3 Harmonicity of the partials guarantees that there are no signal expansions in full-band modeling 1 f with sinusoids. Consider L = qT f with q an integer and T = /f . Using K s/2 f 0 s 0 0 max 0 (p f ) quasi-harmonic partials and p parameters per partial, it takes at most pK = /2 f numbers max 0 (q f ) (pK ) p to represent L = qT f = s /f samples, which gives the ratio r = max /L = /2q. Table 3 shows 0 s 0 that analysis with eaQHM requires p = 4 real parameters. Thus, a frame size with q > 2 is enough to guarantee no signal expansion. This result is due to the full-band paradigm using K harmonically max related partials, not a particular model. The advantage of full-band modeling results from the use of one single component instead of decomposition. Table 4 compares the complexity of SM, EDS, and eaQHM in Big-O notation. The complexity of SM is O (N log N), which is the complexity of the FFT algorithm for size N inputs. ESPRIT estimates the parameters of EDS with singular value decomposition (SVD), whose algorithmic complexity is 2 3 O(L + K ) for an L by K matrix (frame size versus the number of sinusoids). Adaptation in eaQHM is an iterative ﬁt where each iteration i requires running the model again as described in Section 3. For each iteration i, eaQHM estimates the parameters with least squares (LS) via calculation of the pseudoinverse matrix using QR decomposition. The algorithmic complexity of QR decomposition is O K for a square matrix of size K (the number of sinusoids). Adaptation of the sinusoids in eaQHM can result in over-ﬁtting. The amplitude and frequency modulations capture temporal variations inside the frame such as transients and instrumental noise around the partials. However, adaptation must not capture noise resulting from sources such as quantization, which is extraneous to the sound. Ideally, the residual should contain only external additive noise without any perceptually important information from the sound [17]. Table 4. Comparison of algorithmic complexity in Big-O notation. The table presents the complexity as a function of the size of the input N, L, and K and the number of iterations i. See text for details. Algorithmic Complexity SM EDS eaQHM 2 3 3 Complexity O (N log N) O(L + K ) O iK Appl. Sci. 2016, 6, 127 16 of 20 6. Evaluation of Perceptual Transparency with a Listening Test We performed a listening test to validate the full-band representation of musical instrument sounds with eaQHM. The aim of the test was to evaluate whether full-band modeling with eaQHM resulted in resynthesized musical instrument sounds that are perceptually indistinguishable from the original recordings. The 21 sounds in bold in Table 1 were selected for the listening test, which presented pairs original and resynthesis. The participants were instructed to listen to each pair as many times as necessary and to answer the question “Can you tell the difference between the two sounds in each pair?” Full-band (FB) resynthesis with eaQHM (using a harmonic template with K = K max sinusoids) was used for all 21 musical instrument sounds. For nine of these sounds, half-band (HB) resynthesis with eaQHM (using a harmonic template with K = max/2 sinusoids) was also included as control group to test the aptitude of the listeners and compare against the FB version. All HB versions were placed at random positions among the FB, so the test presented 30 pairs overall. The listening test can be accessed at [59]. In total, 20 people aged between 26 and 40 took the test. The participants declared themselves as experienced with listening tests and familiar with signal processing techniques. Figure 7 shows the result of the listening test as the percentage of the people who answered “no” to the question, indicating that they cannot tell the difference between the original recording and the resynthesis. In general, the result of the listening test shows that full-band modeling with eaQHM results in perceptually indistinguishable resynthesis for most musical instrument sounds tested. The ﬁgure indicates that 10 out of the 21 FB sounds tested were rated perceptually identical to the original by 100% of the listeners. As expected, most HB sounds fall under 30% (except Tenor Trombone) and most FB sounds lie above 70% (except Pan Flute). Table 1 shows that Tenor Trombone is played at C3 and Pan Flute at C5. The Tenor Trombone sound is not bright, which indicates that there is little spectral energy at the higher frequency end of the spectrum. Thus, the HB version synthesized with fewer partials than K was perceived as identical to the original by max some listeners. The Pan Flute sound contains a characteristic breathing noise captured as AM–FM elements in eaQHM. However, the breathing noise in the full-band version sounds brighter than the original recording and most listeners were able to tell the difference. Perceptual Similarity of Full-Band Modeling 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 FB 0.1 HB Figure 7. Result of the listening test on perceptual similarity of full-band (FB) and half-band (HB) resynthesis with eaQHM compared to the original recording. The sounds used in the listening test appear in bold in Table 1. Acoustic Guitar Bass Harmonica Accordion Bass Clarinet Cello Celesta Flute Piano Tenor Sax Tenor Trombone Ukulele C Trumpet Pan Flute Oboe 1 Oboe 2 Harp Glockenspiel French Horn Vibraphone Xylophone Viola Percentage (%) Appl. Sci. 2016, 6, 127 17 of 20 7. Conclusions We proposed the full-band quasi-harmonic modeling of musical instrument sounds with adaptive AM–FM sinusoids from eaQHM as an alternative to spectrum decomposition. We used the SRER to measure the ﬁt of the sinusoidal model to the original recording of 89 percussive and nonpercussive musical instruments sounds from different families. We showed that full-band modeling with eaQHM results in higher global SRER values when compared to the standard SM and to EDS estimated with ESPRIT for K sinusoids and frame size L = 3T f . EDS resulted in higher local SRER than eaQHM max 0 s for two of nine instrumental families, namely Brass and Bowed Percussion. A listening test conﬁrmed that full-band modeling with eaQHM resulted in perceptually indistinguishable resynthesis for most musical instrument sounds tested. Future work should investigate a method to prevent over-ﬁtting with eaQHM. Additionally, the use of least-squares to estimate the parameters leads to matrices that are badly conditioned for sounds with low fundamental frequencies. A more robust estimation method to prevent bad-conditioning would improve the stability of eaQHM. Currently, eaQHM can only estimate the parameters of isolated sounds. We intend to develop a method for polyphonic instruments and music. Future work also involves using eaQHM in musical instrument sound transformation, estimation of musical expressivity features such as vibrato, and solo instrumental music. The companion webpage [60] contains sound examples. Finally, the proposal of a full-band representation of musical instrument sounds with adaptive sinusoids motivates further investigation on full-band extensions of other sinusoidal methods, such as SM and EDS used here. Acknowledgments: This work was partly supported by project “NORTE-01-0145-FEDER-000020” ﬁnanced by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF) and by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant 644283. The latter project also supplied funds for covering the costs to publish in open access. Author Contributions: Marcelo Caetano conceived and designed the experiments, analyzed the data, and wrote the manuscript. George P. Kafentzis performed the experiments, helped analyze the results, and revised the manuscript. Athanasios Mouchtaris supervised the research and revised the manuscript. Yannis Stylianou supervised the research. Conﬂicts of Interest: The authors declare no conﬂict of interest. References 1. Serra, X.; Smith, J.O. Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Comput. Music J. 1990, 14, 49–56. 2. Beauchamp, J.W. Analysis and synthesis of musical instrument sounds. In Analysis, Synthesis, and Perception of Musical Sounds; Beauchamp, J.W., Ed.; Modern Acoustics and Signal Processing; Springer: New York, NY, USA, 2007; pp. 1–89. 3. Quatieri, T.; McAuley, R. Audio signal processing based on sinusoidal analysis/synthesis. In Applications of Digital Signal Processing to Audio and Acoustics; Kahrs, M., Brandenburg, K., Eds.; Kluwer Academic Publishers: Berlin/Heidelberg, Germany, 2002; Chapter 9, pp. 343–416. 4. Serra, X.; Bonada, J. Sound Transformations based on the SMS high level attributes. Proc. Digit. Audio Eff. Workshop 1998, 5. Available online: http://mtg.upf.edu/ﬁles/publications/dafx98-1.pdf (accessed on 26 April 2016). 5. Caetano, M.; Rodet, X. Musical Instrument sound morphing guided by perceptually motivated features. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1666–1675. 6. Barbedo, J.; Tzanetakis, G. Musical instrument classiﬁcation using individual partials. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 111–122. 7. Herrera, P.; Bonada, J. Vibrato Extraction and parameterization in the spectral modeling synthesis framework. Proc. Digit. Audio Eff. Workshop 1998, 99. Available online: http://www.mtg.upf.edu/ﬁles/publications/ dafx98-perfe.pdf (accessed on 26 April 2016). Appl. Sci. 2016, 6, 127 18 of 20 8. Glover, J.; Lazzarini, V.; Timoney, J. Real-time detection of musical onsets with linear prediction and sinusoidal modeling. EURASIP J. Adv. Signal Process. 2011, doi:10.1186/1687-6180-2011-68. 9. Virtanen, T.; Klapuri, A. Separation of harmonic sound sources using sinusoidal modeling. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, 5–9 June 2000; Volume 2, pp. II765–II768. 10. Lagrange, M.; Marchand, S.; Rault, J.B. Long interpolation of audio signals using linear prediction in sinusoidal modeling. J. Audio Eng. Soc. 2005, 53, 891–905. 11. Hermus, K.; Verhelst, W.; Lemmerling, P.; Wambacq, P.; Huffel, S.V. Perceptual audio modeling with exponentially damped sinusoids. Signal Process. 2005, 85, 163–176. 12. Nsabimana, F.; Zolzer, U. Audio signal decomposition for pitch and time scaling. In Proceedings of the International Symposium on Communications, Control, and Signal Processing (ISCCSP), St Julians, Malta, 12–14 March 2008; pp. 1285–1290. 13. El-Jaroudi, A.; Makhoul, J. Discrete all-pole modeling. IEEE Trans. Commun. Technol. 1969, 39, 481–488. 14. Caetano, M.; Rodet, X. A source-ﬁlter model for musical instrument sound transformation. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 137–140. 15. Wen, X.; Sandler, M. Source-Filter Modeling in the Sinusoidal Domain. J. Audio Eng. Soc. 2010, 58, 795–808. 16. Fletcher, N.H.; Rossing, T.D. The Physics of Musical Instruments, 2nd ed.; Springer: New York, NY, USA, 1998. 17. Caetano, M.; Kafentzis, G.P.; Degottex, G.; Mouchtaris, A.; Stylianou, Y. Evaluating how well ﬁltered white noise models the residual from sinusoidal modeling of musical instrument sounds. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. 18. Bader, R.; Hansen, U. Modeling of musical instruments. In Handbook of Signal Processing in Acoustics; Havelock, D., Kuwano, S., Vorländer, M., Eds.; Springer: New York, NY, USA, 2009; pp. 419–446. 19. Fletcher, N.H. The nonlinear physics of musical instruments. Rep. Prog. Phys. 1999, 62, 723–764. 20. McAulay, R.J.; Quatieri, T.F. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 744–754. 21. Green, R.A.; Haq, A. B-spline enhanced time-spectrum analysis. Signal Process. 2005, 85, 681–692. 22. Belega, D.; Petri, D. Frequency estimation by two- or three-point interpolated Fourier algorithms based on cosine windows. Signal Process. 2015, 117, 115–125. 23. Prudat, Y.; Vesin, J.M. Multi-signal extension of adaptive frequency tracking algorithms. Signal Process. 2009, 89, 96–973. 24. Candan, Ç. Fine resolution frequency estimation from three DFT samples: Case of windowed data. Signal Process. 2015, 114, 245–250. 25. Röbel, A. Adaptive additive modeling with continuous parameter trajectories. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1440–1453. 26. Verma, T.S.; Meng, T.H.Y. Extending spectral modeling synthesis with transient modeling synthesis. Comput. Music J. 2000, 24, 47–59. 27. Laurenti, N.; De Poli, G.; Montagner, D. A nonlinear method for stochastic spectrum estimation in the modeling of musical sounds. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 531–541. 28. Daudet, L. A review on techniques for the extraction of transients in musical signals. Proc. Int. Symp. Comput. Music Model. Retr. 2006, 3902, 219–232. 29. Jang, H.; Park, J.S. Multiresolution sinusoidal model with dynamic segmentation for timescale modiﬁcation of polyphonic audio signals. IEEE Trans. Speech Audio Process. 2005, 13, 254–262. 30. Beltrán, J.R.; de León, J.P. Estimation of the instantaneous amplitude and the instantaneous frequency of audio signals using complex wavelets. Signal Process. 2010, 90, 3093–3109. 31. Levine, S.N.; Smith, J.O. A compact and malleable sines+transients+noise model for sound. In Analysis, Synthesis, and Perception of Musical Sounds; Beauchamp, J.W., Ed.; Modern Acoustics and Signal Processing; Springer: New York, NY, USA, 2007; pp. 145–174. 32. Markovsky, I.; Huffel, S.V. Overview of total least-squares methods. Signal Process. 2007, 87, 2283–2302. 33. Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process 1989, 37, 984–995. Appl. Sci. 2016, 6, 127 19 of 20 34. Van Huffel, S.; Park, H.; Rosen, J. Formulation and solution of structured total least norm problems for parameter estimation. IEEE Trans. Signal Process. 1996, 44, 2464–2474. 35. Liu, Z.S.; Li, J.; Stoica, P. RELAX-based estimation of damped sinusoidal signal parameters. Signal Process. 1997, 62, 311–321. 36. Nieuwenhuijse, J.; Heusens, R.; Deprettere, E.F. Robust exponential modeling of audio signals. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seattle, WA, USA, 12–15 May 1998; Volume 6, pp. 3581–3584. 37. Badeau, R.; Boyer, R.; David, B. EDS Parametric Modeling And Tracking of Audio Signals. In Proceedings of the 5th International Conference on Digital Audio Effects (DAFx), Hambourg, Germany, 26–28 September 2002; pp. 26–28. 38. Jensen, J.; Heusdens, R. A comparison of sinusoidal model variants for speech and audio representation. In Proceedings of the 2002 11th European Signal Processing Conference (EUSIPCO), Toulouse, France, 3–6 September 2002; pp. 1–4. 39. Auger, F.; Flandrin, P. Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans. Signal Process. 1995, 43, 1068–1089. 40. Fulop, S.A.; Fitz, K. Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications. J. Acoust. Soc. Am. 2006, 119, 360–371. 41. Li, X.; Bi, G. The reassigned local polynomial periodogram and its properties. Signal Process. 2009, 89, 206–217. 42. Girin, L.; Marchand, S.; Di Martino, J.; Röbel, A.; Peeters, G. Comparing the order of a polynomial phase model for the synthesis of quasi-harmonic audio signals. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 19–22 October 2003; pp. 193–196. 43. Kafentzis, G.P.; Pantazis, Y.; Rosec, O.; Stylianou, Y. An extension of the adaptive quasi-harmonic model. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan, 25–30 March 2012; pp. 4605–4608. 44. Kafentzis, G.P.; Rosec, O.; Stylianou, Y. On the modeling of voiceless stop sounds of speech using adaptive quasi-harmonic models. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, OR, USA, 9–13 September 2012. 45. Pantazis, Y.; Rosec, O.; Stylianou, Y. Adaptive AM–FM signal decomposition with application to speech analysis. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 290–300. 46. Degottex, G.; Stylianou, Y. Analysis and synthesis of speech using an adaptive full-band harmonic model. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 2085–2095. 47. Caetano, M.; Kafentzis, G.P.; Mouchtaris, A.; Stylianou, Y. Adaptive sinusoidal modeling of percussive musical instrument sounds. In Proceedings of the European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, 9–13 September 2013; pp. 1–5. 48. Pantazis, Y.; Rosec, O.; Stylianou, Y. On the Properties of a time-varying quasi-harmonic model of speech. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Brisbane, Australia, 22–26 September 2008; pp. 1044–1047. 49. Smyth, T.; Abel, J.S. Toward an estimation of the clarinet reed pulse from instrument performance. J. Acoust. Soc. Am. 2012, 131, 4799–4810. 50. Smyth, T.; Scott, F. Trombone synthesis by model and measurement. EURASIP J. Adv. Signal Process. 2011, doi:10.1155/2011/151436. 51. Brown, J.C. Frequency ratios of spectral components of musical sounds. J. Acoust. Soc. Am. 1996, 99, 1210–1218. 52. Borss, C.; Martin, R. On the construction of window functions with constant overlap-add constraint for arbitrary window shifts. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 337–340. 53. Camacho, A.; Flory, H.Y. A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 2008, 124, 1638–1652. Appl. Sci. 2016, 6, 127 20 of 20 54. Goto, M.; Hashiguchi, H.; Nishimura, T.; Oka, R. RWC Music Database: Music Genre Database and Musical Instrument Sound Database. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), Baltimore, MD, USA, 26–30 October 2003; pp. 229–230. Available online: http://staff.aist.go.jp/m.goto/RWC-MDB/ (accessed on 26 April 2016). 55. Vienna Symphonic Library–GmbH. Available online: http://www.vsl.co.at/ (accessed on 26 April 2016). 56. Grey, J.M.; Gordon, J.W. Multidimensional perceptual scaling of musical timbre. J. Acoust. Soc. Am. 1977, 61, 1270–1277. 57. Krumhansl, C.L. Why is musical timbre so hard to understand? In Structure and Perception of Electroacoustic Sound and Music; Nielzén, S., Olsson, O., Eds.; Excerpta Medica: New York, NY, USA, 1989; pp. 43–54. 58. McAdams, S.; Giordano, B.L. The perception of musical timbre. In The Oxford Handbook of Music Psychology; Hallam, S., Cross, I., Thaut, M., Eds.; Oxford University Press: New York, NY, USA, 2009; pp. 72–80. 59. Listening Test. Webpage for the Listening Test. Available online: http://ixion.csd.uoc.gr/kafentz/listest/ pmwiki.php?n=Main.JMusLT (accessed on 26 April 2016). 60. AdaptiveSinMus. Companion webpage with sound examples. Available online: http://www.csd.uoc.gr/ kafentz/listest/pmwiki.php?n=Main.AdaptiveSinMus (accessed on 26 April 2016). c 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Sciences Multidisciplinary Digital Publishing Institute http://www.deepdyve.com/lp/multidisciplinary-digital-publishing-institute/full-band-quasi-harmonic-analysis-and-synthesis-of-musical-instrument-kurCrTSxqa

Loading next page...

References (68)

N. Fletcher (1999)
The nonlinear physics of musical instruments
Reports on Progress in Physics, 62
T. Quatieri, R. McAulay (2002)
Audio Signal Processing Based on Sinusoidal Analysis/Synthesis
H. Jang, Ju Park (2005)
Multiresolution sinusoidal model with dynamic segmentation for timescale modification of polyphonic audio signals
IEEE Transactions on Speech and Audio Processing, 13
J. Grey (1977)
Multidimensional perceptual scaling of musical timbres.
The Journal of the Acoustical Society of America, 61 5
G. Degottex, Y. Stylianou (2013)
Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model
IEEE Transactions on Audio, Speech, and Language Processing, 21
F. Nsabimana, U. Zolzer (2008)
Audio signal decomposition for pitch and time scaling
2008 3rd International Symposium on Communications, Control and Signal Processing
Y. Prudat, J. Vesin (2009)
Multi-signal extension of adaptive frequency tracking algorithms
Signal Process., 89
Xiumei Li, G. Bi (2009)
The reassigned local polynomial periodogram and its properties
Signal Process., 89
T. Verma, T. Meng (2000)
Extending Spectral Modeling Synthesis with Transient Modeling Synthesis
Computer Music Journal, 24
(1989)
Why is musical timbre so hard to understand? In Structure and Perception of Electroacoustic Sound and Music
Xue Wen, M. Sandler (2009)
Source―Filter Modeling in the Sinusoid Domain
Journal of The Audio Engineering Society, 58
J. Barbedo, G. Tzanetakis (2011)
Musical Instrument Classification Using Individual Partials
IEEE Transactions on Audio, Speech, and Language Processing, 19
Zheng-She Liu, Jian Li, P. Stoica (1997)
RELAX-based estimation of damped sinusoidal signal parameters
Signal Process., 62
Companion webpage with sound examples Available online: http://www.csd.uoc.gr/ kafentz/listest/pmwiki.php?n=Main.AdaptiveSinMus (accessed on 26 April 2016). c 2016 by the authors; licensee MDPI
George Kafentzis, Yannis Pantazis, O. Rosec, Y. Stylianou (2012)
An extension of the adaptive Quasi-Harmonic Model
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
T. Smyth, J. Abel (2012)
Toward an estimation of the clarinet reed pulse from instrument performance.
The Journal of the Acoustical Society of America, 131 6
R. McAulay, T. Quatieri (1986)
Speech analysis/Synthesis based on a sinusoidal representation
IEEE Trans. Acoust. Speech Signal Process., 34
R. Bader, U. Hansen (2008)
Modeling of Musical Instruments
Xavier Serra, J. Smith (1990)
Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition
Computer Music Journal, 14
Arturo Camacho, J. Harris (2008)
A sawtooth waveform inspired pitch estimator for speech and music.
The Journal of the Acoustical Society of America, 124 3
D. Hall (2007)
Analysis, Synthesis, and Perception of Musical Sounds
Journal of the Acoustical Society of America, 121
J. Beauchamp (2007)
Analysis and Synthesis of Musical Instrument Sounds
N. Fletcher, T. Rossing (1991)
Music Producers.(Book Reviews: The Physics of Musical Instruments.)
Marcelo Caetano, X. Rodet (2013)
Musical Instrument Sound Morphing Guided by Perceptually Motivated Features
IEEE Transactions on Audio, Speech, and Language Processing, 21
T. Smyth, Frederick Scott (2011)
Trombone Synthesis by Model and Measurement
EURASIP Journal on Advances in Signal Processing, 2011
R. Green, Adnanul Haq (2005)
B-spline enhanced time-spectrum analysis
Signal Process., 85
Xavier Serra, Jacqui Smith, A. Aydin (1990)
A sound analysis/synthesis system based on a deterministic plus stochastic decomposition
N. Laurenti, Giovanni Poli, Daniele Montagner (2007)
A Nonlinear Method for Stochastic Spectrum Estimation in the Modeling of Musical Sounds
IEEE Transactions on Audio, Speech, and Language Processing, 15
R. Badeau, R. Boyer, B. David (2002)
EDS PARAMETRIC MODELING AND TRACKING OF AUDIO SIGNALS
T. Virtanen, Anssi Klapuri (2000)
Separation of harmonic sound sources using sinusoidal modeling
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2
I. Markovsky, S. Huffel (2007)
Overview of total least-squares methods
Signal Process., 87
Marcelo Caetano, George Kafentzis, A. Mouchtaris, Y. Stylianou (2013)
Adaptive sinusoidal modeling of percussive musical instrument sounds
21st European Signal Processing Conference (EUSIPCO 2013)
John Glover, Victor Lazzarini, J. Timoney (2011)
Real-time detection of musical onsets with linear prediction and sinusoidal modeling
EURASIP Journal on Advances in Signal Processing, 2011
Judith Brown (1996)
Frequency ratios of spectral components of musical sounds
Journal of the Acoustical Society of America, 99
Vienna Symphonic Library – GmbH
Yannis Pantazis, O. Rosec, Y. Stylianou (2011)
Adaptive AM–FM Signal Decomposition With Application to Speech Analysis
IEEE Transactions on Audio, Speech, and Language Processing, 19
George Kafentzis, O. Rosec, Y. Stylianou (2012)
On the Modeling of Voiceless Stop Sounds of Speech using Adaptive Quasi-Harmonic Models
L. Daudet (2005)
A Review on Techniques for the Extraction of Transients in Musical Signals
Marcelo Caetano, X. Rodet (2012)
A source-filter model for musical instrument sound transformation
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2016)
Listening Test Webpage for the Listening Test Available online: http://ixion
S. Huffel, Haesun Park, J. Rosen (1996)
Formulation and solution of structured total least norm problems for parameter estimation
IEEE Trans. Signal Process., 44
J. Jensen, R. Heusdens (2002)
A comparison of sinusoidal model variants for speech and audio representation
2002 11th European Signal Processing Conference
Masataka Goto, Hiroki Hashiguchi, T. Nishimura, R. Oka (2003)
RWC Music Database: Music genre database and musical instrument sound database
, 2003
José Blázquez, Jesús León (2010)
Estimation of the instantaneous amplitude and the instantaneous frequency of audio signals using complex wavelets
Signal Process., 90
S. Hallam, I. Cross, M. Thaut (2011)
The Oxford handbook of music psychology
P. Herrera, J. Bonada (1998)
Vibrato Extraction and Parameterization in the Spectral Modeling Synthesis framework
Yannis Pantazis, O. Rosec, Y. Stylianou (2008)
On the properties of a time-varying quasi-harmonic model of speech
Ç. Candan (2015)
Fine resolution frequency estimation from three DFT samples: Case of windowed data
Signal Process., 114
Joost Nieuwenhuijse, R. Heusdens, E. Deprettere (1998)
Robust exponential modeling of audio signals
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), 6
K. Hermus, W. Verhelst, P. Lemmerling, P. Wambacq, S. Huffel (2005)
Perceptual audio modeling with exponentially damped sinusoids
Signal Process., 85
(2009)
Webpage for the Listening Test Companion webpage with sound examples
(2007)
Analysis and synthesis of musical instrument sounds. In Analysis, Synthesis, and Perception of Musical Sounds; Beauchamp, J.W., Ed.; Modern Acoustics and Signal Processing
R. Roy, T. Kailath (1989)
ESPRIT-estimation of signal parameters via rotational invariance techniques
IEEE Trans. Acoust. Speech Signal Process., 37
(1989)
Why is musical timbre so hard to understand ?
D. Belega, D. Petri (2015)
Frequency estimation by two- or three-point interpolated Fourier algorithms based on cosine windows
Signal Process., 117
S. Fulop, K. Fitz (2004)
Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications.
The Journal of the Acoustical Society of America, 119 1
S. McAdams, Bruno Giordano (2008)
The Perception of Musical Timbre
Marcelo Caetano, George Kafentzis, G. Degottex, A. Mouchtaris, Y. Stylianou (2013)
Evaluating how well filtered white noise models the residual from sinusoidal modeling of musical instrument sounds
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
C. Borß, Rainer Martin (2012)
On the construction of window functions with constant-overlap-add constraint for arbitrary window shifts
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
M. Lagrange, S. Marchand, J. Rault (2005)
Long Interpolation of Audio Signals Using Linear Prediction in Sinusoidal Modeling
Journal of The Audio Engineering Society, 53
F. Auger, P. Flandrin (1995)
Improving the readability of time-frequency and time-scale representations by the reassignment method
IEEE Trans. Signal Process., 43
Xavier Serra, J. Bonada (1998)
Sound transformations based on the SMS high level attributes
Laurent Girin, S. Marchand, J. Martino, A. Robel, G. Peeters (2003)
Comparing the order of a polynomial phase model for the synthesis of quasi-harmonic audio signals
2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684)
S. Levine, J. Smith (2007)
A Compact and Malleable Sines+Transients+Noise Model for Sound
A. Röbel (2006)
Adaptive additive modeling with continuous parameter trajectories
IEEE Transactions on Audio, Speech, and Language Processing, 14
A. El-Jaroudi, J. Makhoul (1991)
Discrete all-pole modeling
IEEE Trans. Signal Process., 39
(2016)
Available online: http://www.vsl.co.at/ (accessed on 26
R. Badeau, R. Boyer, B. David (2002)
PARAMETRIC MODELING AND TRACKING OF AUDIO SIGNALS

Publisher: Multidisciplinary Digital Publishing Institute
Copyright: © 1996-2019 MDPI (Basel, Switzerland) unless otherwise stated
ISSN: 2076-3417
DOI: 10.3390/app6050127
Publisher site: See Article on Publisher Site

Abstract

applied sciences Article Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids 1, 2 2,3 2 Marcelo Caetano *, George P. Kafentzis , Athanasios Mouchtaris and Yannis Stylianou Sound and Music Computing Group, Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), 4200-465 Porto, Portugal Multimedia Informatics Lab, Department of Computer Science, University of Crete, 700-13 Heraklion, Greece; kafentz@csd.uoc.gr (G.P.K.); mouchtar@ics.forth.gr (A.M.); yannis@csd.uoc.gr (Y.S.) Signal Processing Laboratory, Institute of Computer Science, Foundation for Technology & Research-Hellas (FORTH), 700-13 Heraklion, Greece * Correspondence: mcaetano@inesctec.pt; Tel.: +351-22-209-4217 Academic Editor: Vesa Valimaki Received: 16 February 2016; Accepted: 19 April 2016; Published: 2 May 2016 Abstract: Sinusoids are widely used to represent the oscillatory modes of musical instrument sounds in both analysis and synthesis. However, musical instrument sounds feature transients and instrumental noise that are poorly modeled with quasi-stationary sinusoids, requiring spectral decomposition and further dedicated modeling. In this work, we propose a full-band representation that ﬁts sinusoids across the entire spectrum. We use the extended adaptive Quasi-Harmonic Model (eaQHM) to iteratively estimate amplitude- and frequency-modulated (AM–FM) sinusoids able to capture challenging features such as sharp attacks, transients, and instrumental noise. We use the signal-to-reconstruction-error ratio (SRER) as the objective measure for the analysis and synthesis of 89 musical instrument sounds from different instrumental families. We compare against quasi-stationary sinusoids and exponentially damped sinusoids. First, we show that the SRER increases with adaptation in eaQHM. Then, we show that full-band modeling with eaQHM captures partials at the higher frequency end of the spectrum that are neglected by spectral decomposition. Finally, we demonstrate that a frame size equal to three periods of the fundamental frequency results in the highest SRER with AM–FM sinusoids from eaQHM. A listening test conﬁrmed that the musical instrument sounds resynthesized from full-band analysis with eaQHM are virtually perceptually indistinguishable from the original recordings. Keywords: musical instruments; analysis and synthesis; sinusoidal modeling; AM–FM sinusoids; adaptive modeling; nonstationary sinusoids; full-band modeling PACS: 43.75.Zz; 43.75.De; 43.75.Ef; 43.75.Fg; 43.75.Gh; 43.75.Kk; 43.75.Mn; 43.75.Pq; 43.75.Qr 1. Introduction Sinusoidal models are widely used in the analysis [1,2], synthesis [2,3], and transformation [4,5] of musical instrument sounds. The musical instrument sound is modeled by a waveform consisting of a sum of time-varying sinusoids parameterized by their amplitudes, frequencies, and phases [1–3]. Sinusoidal analysis consists of the estimation of parameters, synthesis comprises techniques to retrieve a waveform from the analysis parameters, and transformations are performed as changes of the parameter values. The time-varying sinusoids, called partials, represent how the oscillatory modes of the musical instrument change with time, resulting in a ﬂexible representation with perceptually meaningful parameters. The parameters completely describe each partial, which can be manipulated independently. Appl. Sci. 2016, 6, 127; doi:10.3390/app6050127 www.mdpi.com/journal/applsci Appl. Sci. 2016, 6, 127 2 of 20 Several important features can be directly estimated from the analysis parameters, such as fundamental frequency, spectral centroid, inharmonicity, spectral ﬂux, onset asynchrony, among many others [2]. The model parameters can also be used in musical instrument classiﬁcation, recognition, and identification [6], vibrato detection [7], onset detection [8], source separation [9], audio restoration [10], and audio coding [11]. Typical transformations are pitch shifting, time scaling [12], and musical instrument sound morphing [5]. Additionally, the parameters from sinusoidal models can be used to estimate alternative representations of musical instrument sounds, such as spectral envelopes [13] and the source-ﬁlter model [14,15]. The quality of the representation is critical and can impact the results for the above applications. In general, sinusoidal models render a close representation of musical instrument sounds because most pitched musical instruments are designed to present very clear modes of vibration [16]. However, sinusoidal models do not result in perfect reconstruction upon resynthesis, leaving a modeling residual that contains whatever was not captured by the sinusoids [17]. Musical instrument sounds have particularly challenging features to represent with sinusoids, such as sharp attacks, transients, inharmonicity, and instrumental noise [16]. Percussive sounds produced by plucking strings (such as harpsichords, harps, and the pizzicato playing technique) or striking percussion instruments (such as drums, idiophones, or the piano) feature sharp onsets with highly nonstationary oscillations that die out very quickly, called transients [18]. Flute sounds characteristically comprise partials on top of breathing noise [16]. The reed in woodwind instruments presents a highly nonlinear behavior that also results in attack transients [19], while the stiffness of piano strings results in a slightly inharmonic spectrum [18]. The residual from most sinusoidal representations of musical instrument sounds contains perceptually important information [17]. However, the extent of this information ultimately depends on what the sinusoids are able to capture. The standard sinusoidal model (SM) [1,20] was developed as a parametric extension of the short-time Fourier transform (STFT) so both analysis and synthesis present the same time-frequency limitations as the Discrete Fourier Transform (DFT) [21]. The parameters are estimated with well-known techniques, such as peak-picking and parabolic interpolation [20,22], and then connected across overlapping frames (partial tracking [23]). Peak-picking is known to bias the estimation of parameters because errors in the estimation of frequencies can bias the estimation of amplitudes [22,24]. Additionally, the inherent time-frequency uncertainty of the DFT further limits the estimation because long analysis windows blur the temporal resolution to improve the frequency resolution and vice-versa [21]. The SM uses quasi-stationary sinusoids (QSS) under the assuption that the partials are relatively stable inside each frame. QSS can accurately capture the lower frequencies because these have fewer periods inside each frame and thus less temporal variation. However, higher frequencies have more periods inside each frame with potentially more temporal variation lost by QSS. Additionally, the parameters of QSS are estimated using the center of the frame as the reference and the values are less accurate towards the edges because the DFT has a stationary basis [25]. This results in the loss of sharpness of attack known as pre-echo. The lack of transients and noise is perceptually noticeable in musical instrument sounds represented with QSS [17,26]. Serra and Smith [1] proposed to decompose the musical instrument sound into a sinusoidal component represented with QSS and a residual component obtained by subtraction of the sinusoidal component from the original recording. This residual is assumed to be noise not captured by the sinusoids and commonly modeled by ﬁltering white noise with a time-varying ﬁlter that emulates the spectral characteristics of the residual component [1,17]. However, the residual contains both errors in parameter estimation and transients plus noise missed by the QSS [27]. The time-frequency resolution trade-off imposes severe limits on the detection of transients with the DFT. Transients are essentially localized in time and usually require shorter frames which blur the peaks in the spectrum. Daudet [28] reviews several techniques to detect and extract transients with sinusoidal models. Multi-resolution techniques [29,30] use multiple frame sizes to circumvent Appl. Sci. 2016, 6, 127 3 of 20 the time-frequency uncertainty and to detect modulations at different time scales. Transient modeling synthesis (TMS) [26,27,31] decomposes sounds into sinusoids plus transients plus noise and models each separately. TMS performs sinusoidal plus residual decomposition with QSS and then extracts the transients from the residual. An alternative to multiresolution techniques is the use of high-resolution techniques based on total least squares [32] such as ESPRIT [33], MUSIC [34], and RELAX [35] to ﬁt exponentially damped sinusoids (EDS). EDS are widely used to represent musical instrument sounds [11,36,37]. EDS are sinusoids with stationary (i.e., constant) frequencies modulated in amplitude by an exponential function. The exponentially decaying amplitude envelope from EDS is considered suitable to represent percussive sounds when the beginning of the frame is synchronized with the onsets [38]. However, EDS requires additional partials when there is no synchronization, which increases the complexity of the representation. ESPRIT decomposes the signal space into sinusoidal and residual, further ranking the sinusoids by decreasing magnitude of eigenvalue (i.e., spectral energy). Therefore, the ﬁrst K sinusoids maximize the energy upon resynthesis regardless of their frequencies. Both the SM and EDS rely on sinusoids with stationary frequencies, which are not appropriate to represent nonstationary oscillations [21]. Time-frequency reassignment [39–41] was developed to estimate nonstationary sinusoids. Polynomial phase signals [20,25] such as splines [21] are commonly used as an alternative to stationary sinusoids. McAulay and Quatieri [20] were among the ﬁrst to interpolate the phase values estimated at the center of the analysis window across frames with cubic polynomials to obtain nonstationary sinusoids inside each frame. Girin et al. [42] investigated the impact of the order of the polynomial used to represent the phase and concluded that order ﬁve does not improve the modeling performance sufﬁciently to justify the increased complexity. However, even nonstationary sinusoids leave a residual with perceptually important information that requires further modeling [25]. Sinusoidal models rely on spectral decomposition assuming that the lower end of the spectrum can be modeled with sinusoids while the higher end essentially contains noise. The estimation of the separation between the sinusoidal and residual components has proved difﬁcult [27]. Ultimately, spectral decomposition misses partials on the higher end of the spectrum because the separation is artiﬁcial, depending on the spectrum estimation method rather than the spectral characteristics of musical instrument sounds. We consider spectral decomposition to be a consequence of artifacts from previous sinusoidal models instead of an acoustic property of musical instruments. Therefore, we propose the full-band modeling of musical instrument sounds with adaptive sinusoids as an alternative to spectral decomposition. Adaptive sinusoids (AS) are nonstationary sinusoids estimated to ﬁt the signal being analyzed usually via an iterative parameter re-estimation process. AS have been used to model speech [43–46] and musical instrument sounds [25,47]. Pantazis [45,48] developed the adaptive Quasi-Harmonic Model (aQHM), which iteratively adapts the frequency trajectories of all sinusoids at the same time based on the Quasi-Harmonic Model (QHM). Adaptation improves the ﬁt of a spectral template via an iterative least-squares (LS) parameter estimation followed by frequency correction. Later, Kafentzis [43] devised the extended adaptive Quasi-Harmonic Model (eaQHM), capable of adapting both amplitude and frequency trajectories of all sinusoids iteratively. In eaQHM, adaptation is equivalent to the iterative projection of the original waveform onto nonstationary basis functions that are locally adapted to the time-varying characteristics of the sound, capable of modeling sudden changes such as sharp attacks, transients, and instrumental noise. In a previous work [47], we showed that eaQHM is capable of retaining the sharpness of the attack of percussive sounds. In this work, we propose full-band modeling with eaQHM for a high-quality analysis and synthesis of isolated musical instrument sounds with a single component. We compare our method to QSS estimated with the standard SM [20] and EDS estimated with ESPRIT [36]. In the next section, we discuss the differences in full-band spectral modeling and traditional decomposition for musical instrument sounds. Next, we describe the full-band quasi-harmonic adaptive sinusoidal modeling Appl. Sci. 2016, 6, 127 4 of 20 behind eaQHM. Then, we present the experimental setup, describe the musical instrument sound database used in this work and the analysis parameters. We proceed to the experiments, present the results, and evaluate the performance of QSS, EDS, and eaQHM in modeling musical instrument sounds. Finally, we discuss the results and present conclusions and perspectives for future work. 2. Full-Band Modeling Spectrum decomposition splits the spectrum of musical instrument sounds into a sinusoidal component and a residual as illustrated in Figure 1a. Spectrum decomposition assumes that there are partials only up to a certain cutoff frequency f , above which there is only noise. Figure 1a represents the spectral peaks as spikes on top of colored noise (wide light grey frequency bands) and f as the separation between the sinusoidal and residual components. Therefore, f determines the number of sinusoids because only the peaks at the lower frequency end of the spectrum are represented with sinusoids (narrow dark grey bars) and the rest is considered wide-band and stochastic noise existing across the whole range of the spectrum. There is noise between the spectral peaks and at the higher end of the spectrum. In a previous study [17], we showed that the residual from the SM is perceptually different from ﬁltered (colored) white noise. Figure 1a shows that there are spectral peaks left in the residual because the spectral peaks above f are buried under the estimation noise ﬂoor (and sidelobes). Consequently, the residual from sinusoidal models that rely on spectral decomposition such as the SM is perceptually different from ﬁltered white noise. PSD PSD Fs/2 Fs/2 Sinusoidal Residual Full-Band Harmonic Template Spectral Peaks Noise Sinusoids Frequency Frequency (a) Spectrum Decomposition (b) Full-Band Harmonic Template Figure 1. Illustration of the spectral decomposition and full-band modeling paradigms. From an acoustic point of view, the physical behavior of musical instruments can be modeled as the interaction between an excitation and a resonator (the body of the instrument) [16]. This excitation is responsible for the oscillatory modes whose amplitudes are shaped by the frequency response of the resonator. The excitation signal commonly contains discontinuities, resulting in wide-band spectra. For instance, the vibration of the reed in woodwinds can be approximated by a square wave [49], the friction between the bow and the strings results in an excitation similar to a sawtooth wave [16], the strike in percussion instruments can be approximated by a pulse [2], while the vibration of the lips in brass instruments results in a sequence of pulses [50] (somewhat similar to the glottal excitation, which is also wide band [46]). Figure 1b illustrates a full-band harmonic template spanning the entire frequency range, ﬁtting sinusoids to spectral peaks in the vicinity of harmonics of the fundamental frequency f . The spectrum of musical instruments is known to present deviations from perfect harmonicity [16], but quasi-harmonicity is supported by previous studies [51] that found deviations as small as 1%. In this work, the full-band harmonic template becomes quasi-harmonic after the estimation of parameters via least-squares followed by a frequency correction mechanism (see details in Section 3.1). Therefore, full-band spectral modeling assumes that both the excitation and the instrumental noise are wide band. Appl. Sci. 2016, 6, 127 5 of 20 3. Adaptive Sinusoidal Modeling with eaQHM In what follows, x (n) is the original sound waveform and x ˆ (n) is the sinusoidal model with sample index n. Then, the following relation holds: x (n) = x ˆ (n) + e (n) , (1) where e (n) is the modeling error or residual. Each frame of x (n) is x (n, m) = x (n) w (n mH) , m = 0, , M 1, (2) where m is the frame number, M is the number of frames, and H is the hop size. The analysis window w (n) has L samples and it deﬁnes the frame size. Typically, H < L such that the frames m overlap. Figure 2 presents an overview of the modeling steps in eaQHM. The feedback loop illustrates the adaptation cycle, where x ˆ (n) gets closer to x (n) with each iteration. The iterative process stops when the ﬁt improves by less than a threshold #. The dark blocks represent parameter estimation based on the quasi-harmonic model (QHM), followed by interpolation of the parameters across frames before additive [1] resynthesis (instead of overlap add (OLA) [52]). The resulting time-varying sinusoids are used as nonstationary basis functions for the next iteration, so the adaptation procedure illustrated in Figure 3 iteratively projects x (n) onto x (n). Next, QHM is summarized, followed by parameter interpolation and then eaQHM. Least Frequency Parameter Resynthesis x(t) Windowing x(t) Squares Interpolation Correction Time-Varying Basis Functions Figure 2. Block diagram depicting the modeling steps in the extended adaptive Quasi-Harmonic Model (eaQHM). The blocks with a dark background correspond to parameter estimation, while the feedback loop illustrates adaptation as iteration cycles around the loop. See text for the explanation of the symbols. F Frequency F Parameter F correction Interpolation Iteration 1 T T T Original Model becomes Basis Basis Model Parameter F F F Frequency Window Interpolation correction Iteration 2 T T Figure 3. Illustration of the adaptation of the frequency trajectory of a sinusoidal partial inside the analysis window in eaQHM. The ﬁgure depicts the ﬁrst and second iterations of eaQHM around the loop in Figure 2, showing local adaptation as the iterative projection of the original waveform onto the model. 3.1. The Quasi-Harmonic Model (QHM) j2pn /f k s ˆ QHM [48] projects x (n, m) onto a template of sinusoids e with constant frequencies f and sampling frequency f . QHM estimates the parameters of x ˆ (n, m) using s Appl. Sci. 2016, 6, 127 6 of 20 j2p f n/ f x ˆ n, m = a + nb e , (3) ( ) ( ) å k k k=K where k is the partial number, K is the number of real sinusoids, a the complex amplitude and b is k k th j2pn /f k s the complex slope of the k sinusoid. The term nb arises from the derivative of e with respect to frequency. The constant frequencies f deﬁne the spectral template used by QHM to ﬁt the analysis parameters a and b by least-squares (LS) [44,45]. In principle, any set of frequencies f can be used k k k because the estimation of a and b also provides a means of correcting the initial frequency values f k k k by making f converge to nearby frequencies f present in the signal frame. The mismatch between k k ˆ ˆ f and f leads to an estimation error h = f f . Pantazis et al. [48] showed that QHM provides k k k k k an estimate of h given by f Refa g Imfb g Imfa g Refb g s k k k k h ˆ = , (4) 2p ja j which corresponds to the frequency correction block in Figure 2. Then x (n, m) is locally synthesized as ˆ ˆ j 2pF n/ f +f ( s ) k k ˆ ˆ x (n, m) = a e , (5) å k k=K ˆ ˆ ˆ 6 where a ˆ = ja j, F = f + h ˆ , and f = a are constant inside the frame m. k k k k k k k The full-band harmonic spectral template shown in Figure 1b is obtained by setting f = k f with k 0 k an integer and 1 k /2 f . The f is not necessary to estimate the parameters, but it improves 0 0 the ﬁt because the initial full-band harmonic template approximates better the spectrum of isolated quasi-harmonic sounds. QHM assumes that the sound being analyzed contains a single source, so, for isolated notes from pitched musical instruments, a constant f is used across all frames m. 3.2. Parameter Interpolation across Frames ˆ ˆ The model parameters a ˆ , F , and f from Equation (5) are estimated as samples at the frame k k k 1 ˆ rate /H of the amplitude- and frequency-modulation (AM–FM) functions a ˆ n and f n = ( ) ( ) k k 2p ˆ ˆ /f F (n) + f , which describe, respectively, the long-term amplitude and frequency temporal k k variation of each sinusoid k. For each frame m, a ˆ t, m and F t, m are estimated using the sample ( ) ( ) k k index at the center of the frame n = t as reference. Resynthesis of x ˆ (n, m) requires a ˆ (n, m) and F n, m at the signal sampling rate f . Equation (5) uses constant values, resulting in locally stationary ( ) sinusoids with constant amplitudes and frequencies inside each frame m. However, the parameter values might vary across frames, resulting in discontinuities such as a ˆ (t, m) 6= a ˆ (t, m + 1) due to temporal variations happening at the frame rate /H. k k OLA resynthesis [52] uses the analysis window w (n) to taper discontinuities at the frame boundaries by resynthesizing x ˆ (n, m) = x ˆ (n) w (n) for each m similarly to Equation (2) and then overlap-adding ˆ ˆ x (n, m) across m to obtain x (n). Additive synthesis is an alternative to OLA that results in smoother temporal variation [20] by ˆ ˆ ﬁrst interpolating a (t, m) and f (t, m) across m and then summing over k. In this case, a (n) is k k k obtained by linear interpolation of a ˆ (t, m) and a ˆ (t, m + 1). Recursive calculation across m results in k k a piece-wise linear approximation of a (n). F (n) is estimated via piece-wise polynomial interpolation k k ˆ ˆ ˆ of F (t, m) across m with quadratic splines, and f (n) is obtained integrating F (n) in two steps k k k ˆ ¯ because f (t, m) is wrapped around 2p across m. First, f (n) is calculated as k k m+1 2p ¯ ˆ f (n) = f (t, m) + F (u) . (6) k k å k u=m Appl. Sci. 2016, 6, 127 7 of 20 ¯ ¯ ˆ The calculation of f n using Equation (6) does not guarantee that f t, m + 1 = f t, m + 1 + 2pP, ( ) ( ) ( ) k k k with P the closest integer to unwrap the phase (see details in [45]). Thus, f (n) is calculated as m+1 2p p (u mt) ˆ ˆ ˆ f (n) = f (t, m) + F (u) + g sin , (7) k k å k f (m + 1) t mt u=m where the term given by the sine function ensures continuity with f (t, m + 1) when g is ˆ ¯ p f (t, m + 1) + P f (t, m + 1) k k g = , (8) 2 (m + 1) t mt ˆ ¯ with P given byjf (t, m + 1) f (t, m + 1)j (see [45]). k k 3.3. The Extended Adaptive Quasi-Harmonic Model (eaQHM) Pantazis et al. [45] proposed adapting the phase of the sinusoids. The adaptive procedure applies LS, frequency correction, and frequency interpolation iteratively (see Figure 2), projecting x (n, m) onto x ˆ (n, m). Figure 3 shows the ﬁrst and second iterations to illustrate adaptation of one sinusoid. Kafentzis et al. [43] adapted both the instantaneous amplitude and the instantaneous phase of x ˆ (n, m) with a similar iterative procedure in eaQHM. The analysis stage uses jF (n,m) ˆ k x ˆ (n, m) = (a + nb ) A (n, m) e , (9) å k k k k=K where A (n, m) and F (n, m) are functions of the time-varying instantaneous amplitude and phase of k k each sinusoid, respectively [43,45], obtained from the parameter interpolation step and deﬁned as a ˆ (n) A (n, m) = , (10a) a (t, m) ˆ ˆ ˆ F (n, m) = f (n) f (t, m) , (10b) k k k where a (n) is the piece-wise linear amplitude and f (n) is estimated using Equation (7). Finally, k k eaQHM models x (n) as a set of amplitude and frequency modulated nonstationary sinusoids given by jf (n) k,i1 x ˆ (n) = a ˆ (n) e , (11) i å k,i1 k=K where a ˆ (n) and f (n) are the instantaneous amplitude and phase from the previous iteration k,i1 k,i1 i 1. Adaptation results from the iterative projection of x (n) onto x ˆ (n) from i 1 as the model x ˆ (n) are used as nonstationary basis functions locally adapted to the time-varying behavior of x n . Note that ( ) jF (n,m) ˆ k Equation (9) is simply Equation (3) with a nonstationary basis A (n, m) e . In fact, Equation (9) represents the next parameter estimation step, which will be again followed by frequency correction as in Figure 2. The convergence criterion for eaQHM is either a maximum number of iterations i or an adaptation threshold # calculated as i1 i SRER SRER < #, (12) i1 SRER where the signal-to-reconstruction-error ratio (SRER) is calculated as RMS (x) RMS (x) SRER = 20 log = 20 log . (13) 10 10 RMS (x x ˆ) RMS (e) Appl. Sci. 2016, 6, 127 8 of 20 The SRER measures the ﬁt between the model x ˆ n and the original recording x n by dividing ( ) ( ) the total energy in x (n) by the energy in the residual e (n). The higher the SRER, the better the ﬁt. Note that # stops adaptation whenever the ﬁt does not improve from iteration i 1 to i regardless of the absolute SRER value. Thus, even sounds from the same instruments can reach different SRER. 4. Experimental Setup We now investigate the full-band representation of musical instrument sounds and the nonstationarity of the adaptive AM–FM sinusoids from eaQHM. We aim to show that spectral decomposition fails to capture partials at the higher end of the spectrum so full-band quasi-harmonic modeling increases the quality of analysis and resynthesis by capturing sinusoids across the full range of the spectrum. Additionally, we aim to show that adaptive AM–FM sinusoids from eaQHM capture nonstationary partials inside the frame. We compare full-band modeling with eaQHM against the SM [1,20] and EDS estimated with ESPRIT [36] using the same number of partials K. We assume that the musical instrument sounds under investigation can be well represented as quasi-harmonic. Thus, we set K to the highest harmonic number k below Nyquist frequency /2 or equivalently the max highest integer K that satisﬁes K f s/2. The fundamental frequency f of all sounds was estimated 0 0 using the sawtooth waveform inspired pitch estimator (SWIPE) [53] because in the experiments the frame size L, the maximum number of partials K , and the full-band harmonic template depend on max f . In the SM, K is the number of spectral peaks modeled by sinusoids. For EDS, ESPRIT uses K to determine the separation between the dimension of the signal space (sinusoidal component) and of the residual. The SM is considered the baseline for comparison due to the quasi-stationary nature of the sinusoids and the need for spectral decomposition. EDS estimated with ESPRIT is considered the state-of-the-art due to the accurate analysis and synthesis and constant frequency of EDS inside the frame m. We present a comparison of the local and global SRER as a function of K and L for the SM and EDS against eaQHM in two experiments. In experiment 1, we vary K from 2 to K max and record the SRER. In experiment 2, we vary L from 3T f to 8T f samples and record the SRER, s s 0 0 where T = /f is the fundamental period. The local SRER is calculated within the ﬁrst frame m = 0, 0 0 where we expect the attack transients to be. The ﬁrst frame is centered at the onset with t = 0 (and the ﬁrst half is zero-padded), so artifacts such as pre-echo (in the ﬁrst half of the frame) are also expected to be captured by the local SRER. The global SRER is calculated across all frames, thus considering the whole sound signal x ˆ n . Next, we describe the musical instrument sounds modeled and the selection ( ) of parameter values for the algorithms. 4.1. The Musical Instrument Sound Dataset In total, 92 musical instrument sounds were selected. “Popular ” and “Keyboard” musical instruments are from the RWC Music Database: Musical Instrument Sound [54]. All other sounds are from the Vienna Symphonic Library [55] database of musical instrument samples. Table 1 lists the musical instrument sounds used. The recordings were chosen to represent the range of musical instruments commonly found in traditional Western orchestras and in popular recordings. Some instruments feature different registers (alto, baritone, bass, etc). All sounds used belong to the same pitch class (C), ranging in pitch height from C2 (f 0 65 Hz) to C6 (f 0 1046 Hz). The dynamics is indicated as forte (“f”) or fortissimo (“ff”), and the duration of most sounds is less than 2 s. Normal attack (“na”) and no vibrato (“nv”) were chosen whenever available. Presence of vibrato (“vib”), progressive attack (“pa”), and slow attack (“sa”) are indicated, as well as different playing modes such as staccato (“stacc”), sforzando (“sforz”), and pizzicato (“pz”), achieved by plucking string instruments. Extended techniques were also included, such as tongue ram (“tr ”) for the ﬂute, près de la table (“pdlt”) for the harp, muted (“mu”) strings, and bowed idiophones (vibraphone, xylophone, etc.) for short (“sh”) and long (“lg”) sounds. Appl. Sci. 2016, 6, 127 9 of 20 Different mallet materials such as metal (“met”), plastic (“pl”), and wood (“wo”) and hardness such as soft (“so”), medium (“med”), and hard (“ha”) are indicated. Table 1. Musical instrument sounds used in all experiments. See text in Section 4.1 for a description of the terms in brackets. Sounds in bold were used in the listening test described in Section 6. The quasi-harmonic model (QHM) failed for the sounds in italics marked *. Family Musical Instrument Sounds Bass Trombone (C3 f nv na), Bass Trombone (C3 f stac), Bass Trumpet (C3 f na vib), Cimbasso (C3 f nv na), Cimbasso (C3 f stac), Contrabass Trombone* (C2] f stac), Contrabass Tuba (C3 f na), Contrabass Tuba (C3 f stac), Cornet (C4 f), French Horn Brass (C3 f nv na), French Horn (C3 f stac), Piccolo Trumpet (C5 f nv na), Piccolo Trumpet (C5 f stac), Tenor Trombone (C3 f na vib), Tenor Trombone (C3 f nv sa), Tenor Trombone (C3 f stac), C Trumpet (C4 f nv na), C Trumpet (C4 f stac), Tuba (C3 f vib na), Tuba (C3 f stac), Wagner Tuba (C3 f na), Wagner Tuba (C3 f stac) Alto Flute (C4 f vib na), Bass Clarinet (C3 f na), Bass Clarinet (C3 f sforz), Bass Clarinet (C3 f stac), Bassoon (C3 f na), Bassoon (C3 f stac), Clarinet (C4 f na), Clarinet Woodwinds (C4 f stac), Contra Bassoon* (C2 f stac), Contra Bassoon* (C2 f sforz), English Horn (C4 f na), English Horn (C4 f stac), Flute (C4 f nv na), Flute (C4 f stac), Flute (C4 f tr), Flute (C4 f vib na), Oboe 1 (C4 f stac), Oboe 2 (C4 f nv na), Oboe (C4 f pa), Piccolo Flute (C6] f vib sforz), Piccolo Flute (C6 f nv ha ff) Plucked Cello (C3 f pz vib), Harp (C3 f), Harp (C3 f pdlt), Harp (C3 f mu), Viola (C3 f pz vib), Strings Violin (C4 f pz mu) Bowed Cello (C3 f vib), Cello (C3 f stac), Viola (C3 f vib), Viola (C4 f stac), Violin (C4 f), Strings Violin (C4] ff vib), Violin (C4 f stac) Glockenspiel (C4 f), Glockenspiel (C6 f wo), Glockenspiel (C6 f pl), Glockenspiel (C6 f met), Marimba (C4 f), Vibraphone (C4 f ha 0), Vibraphone (C4 f ha fa), Struck Vibraphone (C4 f ha sl), Vibraphone (C4 f med 0), Vibraphone (C4 f med fa), Percussion Vibraphone (C4 f med 0 mu), Vibraphone (C4 f med sl), Vibraphone (C4 f so 0), Vibraphone (C4 f so fa), Xylophone (C5 f GA L), Xylophone (C5 met), Xylophone (C5 f HO L), Xylophone (C5 f mP L), Xylophone (C5 f wP L) Bowed Vibraphone (C4 f sh vib), Vibraphone (C4 f sh nv), Vibraphone (C4 f lg nv) Percussion Accordion (C3] f), Acoustic Guitar (C3 f), Baritone Sax (C3 f), Bass Harmonica Popular (C3] f), Chromatic Harmonica (C4 f), Classic Guitar (C3 f), Mandolin (C4 f), Pan Flute (C5 f), Tenor Sax (C3] f), Ukulele (C4 f) Keyboard Celesta (C3 f na nv), Celesta (C3 f stac), Clavinet (C3 f), Piano (C3 f) In what follows, we will present the results for 89 sounds because QHM failed to adapt for the three sounds marked * in Table 1. The estimation of parameters for QHM uses LS [45]. The matrix inversion fails numerically when the matrix is close to singular (see [44]). The fundamental frequency (C2 65 Hz) of these sounds determines a full-band harmonic spectral template whose frequencies are separated by C2, which results in singular matrices. 4.2. Analysis Parameters The parameter estimation for the SM follows [20] with a Hann window for analysis, and phase interpolation across frames via cubic splines followed by additive resynthesis. The estimation of parameters for EDS uses ESPRIT with a rectangular window for analysis and OLA resynthesis [36]. Parameter estimation in eaQHM used a Hann window for analysis and additive resynthesis following Equation (11). In all experiments, # in Equation (12) is set to 0.01 and f = 16 kHz for all sounds. The step size for analysis (and OLA synthesis) was H = 16 samples for all algorithms, corresponding Appl. Sci. 2016, 6, 127 10 of 20 to 1 ms. The frame size is L = qT f samples with q an integer. The size of the FFT for the SM is kept 0 s constant at N = 4096 samples with zero padding. 5. Results and Discussion 5.1. Adaptation Cycles in eaQHM Figure 4 shows the global and local SRER as a function of the number of adaptation cycles (iterations). Each plot was averaged across the sounds indicated, while the plot “all instruments” is an average of the previously shown. The SRER increases quickly after a few iterations, slowly converging to a ﬁnal value considerably higher than before adaptation. Iteration 0 corresponds to QHM initialized with the full-band harmonic template, thus Figure 4 demonstrates that the adaptation of the sinusoids by eaQHM increases the SRER when compared to QHM. Brass Woodwinds 70 80 Global Global Local Local 60 70 50 60 40 50 30 40 20 30 10 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Bowed Strings Plucked Strings 80 70 Global Global Local Local 10 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Struck Percussion Popular 65 65 Global Global 60 Local Local 20 15 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Figure 4. Cont. SRER (dB) SRER (dB) SRER (dB) SRER (dB) SRER (dB) SRER (dB) Appl. Sci. 2016, 6, 127 11 of 20 Keyboard All Instruments 60 70 Global Global 55 Local Local 45 50 30 30 15 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Adaptations Number of Adaptations Figure 4. Plot of the signal-to-reconstruction-error ratio (SRER) as a function of number of adaptations to illustrate how adaptation increases the SRER in eaQHM. Iteration 0 corresponds to QHM initialized with the full-band harmonic spectral template. 5.2. Experiment 1: Variation Across K (Constant L = 3T f ) We ran each algorithm varying K (the frame size was kept at L = 3T f ) and recorded the resulting 0 s local and global SRER values. We started from K = 2 and increased K by two partials up to K . max Figure 5 shows the local and global SRER (averaged across sounds) as a function of K for the SM, EDS, and eaQHM. Sounds with different f values have different K . Figure 5 shows that the addition of max partials for the SM does not result in an increase in SRER after a certain K. EDS tends to continuously increase the SRER with more partials that capture more spectral energy. Finally, eaQHM increases the SRER up to K . max Brass C3 Woodwinds C3 Bowed Strings C4 70 70 50 SM SM SM EDS EDS EDS 60 60 eaQHM eaQHM eaQHM 50 50 40 40 30 30 20 20 10 10 0 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 Number of Partials Number of Partials Number of Partials Brass C3 Woodwinds C3 Bowed Strings C4 80 80 80 SM SM SM 70 EDS 70 EDS EDS eaQHM eaQHM eaQHM 60 60 50 50 40 40 30 30 20 20 10 10 0 0 10 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 Number of Partials Number of Partials Number of Partials Figure 5. Cont. Global SRER (dB) Local SRER (dB) SRER (dB) Global SRER (dB) Local SRER (dB) SRER (dB) Global SRER (dB) Local SRER (dB) Appl. Sci. 2016, 6, 127 12 of 20 Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM 70 EDS 60 EDS EDS eaQHM eaQHM eaQHM 60 50 50 40 40 30 30 20 20 10 10 0 0 −10 −10 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 Number of Partials Number of Partials Number of Partials Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM 70 EDS 60 EDS EDS eaQHM eaQHM eaQHM 60 50 50 40 40 30 30 30 20 20 10 10 0 0 −10 0 0 10 20 30 40 50 60 70 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 Number of Partials Number of Partials Number of Partials Figure 5. Comparison between local and global SRER as a function of the number of partials for the three models (the standard sinusoidal model (SM), exponentially damped sinusoids (EDS), and eaQHM). The bars around the mean are the standard deviation across different sounds from the family indicated. The distributions are not symmetrical as suggested by the bars. The SM, EDS, and eaQHM use different analysis and different synthesis methods, which partially explains the different behavior under variation of K. More importantly, the addition of partials for each algorithm uses different criteria. Both the SM and EDS use spectral energy as a criterion, while eaQHM uses the frequencies of the sinusoids assuming quasi-harmonicity. In the SM, a new sinusoid is selected as the next spectral peak (increasing frequency) with spectral energy above a selected threshold regardless of the frequency of the peak. In fact, the frequency is estimated from the peak afterwards. For EDS, K determines the number of sinusoids used upon resynthesis. However, ESPRIT ranks the sinusoids by decreasing eigenvalue rather than the frequency, adding partials with high spectral energy that will increase the ﬁt of the reconstruction. The frequencies of the new partials are not constrained by harmonicity. Finally, eaQHM uses the spectral template to search for nearby spectral peaks with LS and frequency correction. The sinusoids will converge to spectral peaks in the neighborhood of the harmonic template with K harmonically related partials starting from f . Therefore, K in eaQHM 0 max corresponds to full-band analysis and synthesis but not necessarily for the SM or EDS. 5.3. Experiment 2: Variation Across L (Constant K = K ) max We ran each algorithm varying L from 3T f to 8T f with a constant number of partials K 0 s 0 s max and measured the resulting local and global SRER. In the literature [46], L = 3T f is considered 0 s a reasonable value for speech and audio signals when using the SM. We are unaware of a systematic investigation of how L affects modeling accuracy for EDS. Figure 6 shows the local and global SRER (averaged across sounds) as a function of L expressed as q times T f , so sounds with different f 0 s 0 values have different frame size L in samples. Figure 6 shows that the SRER decreases with L for all algorithms. The SM seldom outperforms EDS or eaQHM, but it is more robust against variations of L. For the SM, L affects both spectral estimation and temporal representation. In the FFT, L determines the trade-off between temporal and spectral resolution, which affects the performance of the peak picking algorithm for parameter estimation. The temporal representation is affected because the parameters are an average across L referenced to the center of the frame. In turn, ESPRIT estimates EDS with constant frequency inside the frames referenced to the beginning of the frame, thus L affects the temporal modeling accuracy more than the Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Local SRER (dB) Global SRER (dB) Appl. Sci. 2016, 6, 127 13 of 20 spectral estimation. However, the addition of sinusoids might compensate for the stationary frequency of EDS inside the frame. Finally, the SRER for eaQHM decreases considerably when L increases because L adversely affects the frequency correction and interpolation mechanisms. Frequency correction is applied at the center of the analysis frame and eaQHM uses spline interpolation to capture frequency modulations across frames. Thus, adaptation improves the ﬁt more slowly for longer L, generally reaching a lower absolute SRER value. Brass C3 Woodwinds C3 Bowed Strings C4 70 60 50 SM SM SM EDS EDS EDS eaQHM eaQHM eaQHM 0 10 5 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Brass C3 Woodwinds C3 Bowed Strings C4 80 80 80 SM SM SM EDS EDS EDS 70 70 eaQHM eaQHM eaQHM 60 60 50 50 40 40 30 30 10 20 20 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM EDS EDS EDS 70 50 eaQHM eaQHM eaQHM 60 40 50 30 40 20 30 10 20 0 10 10 −10 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Plucked Strings C3 Struck Percussion C4 Keyboard C3 80 70 60 SM SM SM EDS EDS EDS eaQHM eaQHM eaQHM 50 45 30 30 10 10 15 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Window Size (x T0) Window Size (x T0) Window Size (x T0) Figure 6. Comparison between local and global SRER as a function of the size of the frame for the three models (SM, EDS, and eaQHM). The bars around the mean are the standard deviation across different sounds from the family indicated. The distributions are not symmetrical as suggested by the bars. 5.4. Full-Band Quasi-Harmonic Analysis with AM–FM Sinusoids To simplify the comparison and reduce the information, we present the differences of SRER instead of absolute SRER values. For each sound, we subtract the absolute SRER values (in dB) for the SM and EDS from that of eaQHM to obtain the differences of SRER. The local value measures the ﬁt for the attack and the global value measures the overall ﬁt. Positive values indicate that eaQHM Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Global SRER (dB) Local SRER (dB) Local SRER (dB) Global SRER (dB) Global SRER (dB) Local SRER (dB) Appl. Sci. 2016, 6, 127 14 of 20 results in higher SRER than the other method for that particular sound, while a negative value means the opposite. The different SRER values are averaged across all musical instruments that belong to the family indicated. Table 2 shows the comparison of eaQHM against EDS and the SM with K = K max and L = 3T f clustered by instrumental family. The distributions are not symmetrical around the mean as suggested by the standard deviation. Table 2. Local and global difference of signal-to-reconstruction-error ratio (SRER) comparing eaQHM with exponentially damped sinusoids (EDS) and eaQHM with the standard sinusoidal model (SM) for the frame size L = 3T f and number of partials K = K . The three C2 sounds are not included. 0 s max SRER (eaQHM-EDS) SRER (eaQHM-SM) Family Local (dB) Global (dB) Local (dB) Global (dB) Brass 9.4 7.0 12.5 6.8 27.3 5.8 31.9 4.0 Woodwinds 7.8 3.9 22.0 5.9 30.9 7.5 36.1 4.7 Bowed Strings 12.2 4.2 24.1 6.7 35.0 4.7 40.0 4.7 Plucked Strings 8.3 5.0 4.7 3.4 49.5 4.3 46.6 5.1 Bowed Percussion 2.7 2.5 16.3 2.2 12.7 2.6 37.6 3.6 Struck Percussion 10.5 4.8 10.1 2.6 28.6 13.3 26.0 11.3 Popular 6.3 3.3 11.9 7.0 26.5 10.8 27.5 11.6 Keyboard 5.7 3.4 5.4 4.3 37.0 8.0 34.6 2.0 Total 5.3 2.4 13.2 3.3 31.0 7.1 35.0 5.9 Thus, Table 2 summarizes the result of full-band quasi-harmonic analysis with adaptive AM–FM sinusoids from eaQHM comparing with the SM and EDS under the same conditions, namely the same number of sinusoids K = K and frame size L = 3T f . When eaQHM is compared to max 0 s the SM, both local and global difference SRER are positive for all families. This means that full-band quasi-harmonic modeling with eaQHM results in a better ﬁt for the analysis and synthesis of musical instrument sounds. When eaQHM is compared to EDS, all global difference SRER are positive and all local difference SRER are positive except for Brass and Bowed Percussion. Thus, EDS can ﬁt the attack of Brass and Bowed Percussion better than eaQHM. The exponential amplitude envelope of EDS is considered suitable to model percussive sounds with sharp attacks such as harps, pianos, and marimbas [36,37]. The musical instrument families that contain percussive sounds are Plucked strings, Struck percussion, and Keyboard. Table 2 shows that eaQHM outperformed EDS locally and globally for all percussive sounds. The ability to adapt the amplitude of the sinusoidal partials to the local characteristics of the waveform makes eaQHM extremely ﬂexible to ﬁt both percussive and nonpercussive musical instrument sounds. On the other hand, both Brass and Bowed Percussion present slow attacks typically lasting longer than one frame L. Note that /f = 3T 22 ms for C3 ( f 131 Hz) while Bowed 0 0 Percussion can have attacks longer than 100 ms. Therefore, one frame L = 3T f does not measure the 0 s ﬁt for the entire duration of the attack. Note that the local SRER is important because the global SRER measures the overall ﬁt without indication of where the differences lie in the waveform. For musical instrument sounds, differences in the attack impact the results differently than elsewhere because the attack is among the most important perceptual features in dissimilarity judgment [56–58]. Consequently, when comparing two models with the global SRER, it is only safe to say that a higher SRER indicates that resynthesis results in a waveform that is closer to the original recording. 5.5. Full-Band Modeling and Quasi-Harmonicity Time-frequency transforms such as the STFT represent L samples in a frame with N DFT coefﬁcients provided that N L. Note that N 2 C, corresponding to p = 2N real numbers. There is signal expansion whenever the representation uses p parameters to represent L samples and p > L. Sinusoidal models represent L samples in a frame with K sinusoids. In turn, each sinusoid is described Appl. Sci. 2016, 6, 127 15 of 20 by p parameters, requiring pK parameters to represent L samples. Therefore, there is a maximum number of sinusoids to represent a frame without signal expansion. For example, white noise has a ﬂat spectrum across that would take a large number of sinusoids close together in frequency resulting in signal expansion. The pK parameters to represent L samples can be interpreted as the degrees of freedom of the ﬁt. As a general rule, more parameters mean greater ﬂexibility of representation (hence potentially a better ﬁt), but with the risk of over-ﬁtting. Table 3 shows a comparison of the number of real parameters p (per sinusoid k per frame m) for the analysis and synthesis stages of the SM, EDS, and eaQHM. Note that eaQHM and EDS require more parameters than the SM at the analysis stage, but eaQHM and the SM require fewer parameters than EDS for the synthesis stage. The difference is due to the resynthesis strategy used by each algorithm. EDS uses OLA resynthesis, which requires all analysis parameters for resynthesis, while both eaQHM and the SM use additive resynthesis. Table 3. Comparison of the number of real parameters p per sinusoid k per frame m for the analysis and synthesis stages of the SM, EDS, and eaQHM. The table presents the number of real parameters p to estimate and to resynthesize each sinusoid inside a frame. Number of Real Parameters p Per Sinusoid k Per Frame m SM EDS eaQHM Analysis p = 3 p = 4 p = 4 Synthesis p = 3 p = 4 p = 3 Harmonicity of the partials guarantees that there are no signal expansions in full-band modeling 1 f with sinusoids. Consider L = qT f with q an integer and T = /f . Using K s/2 f 0 s 0 0 max 0 (p f ) quasi-harmonic partials and p parameters per partial, it takes at most pK = /2 f numbers max 0 (q f ) (pK ) p to represent L = qT f = s /f samples, which gives the ratio r = max /L = /2q. Table 3 shows 0 s 0 that analysis with eaQHM requires p = 4 real parameters. Thus, a frame size with q > 2 is enough to guarantee no signal expansion. This result is due to the full-band paradigm using K harmonically max related partials, not a particular model. The advantage of full-band modeling results from the use of one single component instead of decomposition. Table 4 compares the complexity of SM, EDS, and eaQHM in Big-O notation. The complexity of SM is O (N log N), which is the complexity of the FFT algorithm for size N inputs. ESPRIT estimates the parameters of EDS with singular value decomposition (SVD), whose algorithmic complexity is 2 3 O(L + K ) for an L by K matrix (frame size versus the number of sinusoids). Adaptation in eaQHM is an iterative ﬁt where each iteration i requires running the model again as described in Section 3. For each iteration i, eaQHM estimates the parameters with least squares (LS) via calculation of the pseudoinverse matrix using QR decomposition. The algorithmic complexity of QR decomposition is O K for a square matrix of size K (the number of sinusoids). Adaptation of the sinusoids in eaQHM can result in over-ﬁtting. The amplitude and frequency modulations capture temporal variations inside the frame such as transients and instrumental noise around the partials. However, adaptation must not capture noise resulting from sources such as quantization, which is extraneous to the sound. Ideally, the residual should contain only external additive noise without any perceptually important information from the sound [17]. Table 4. Comparison of algorithmic complexity in Big-O notation. The table presents the complexity as a function of the size of the input N, L, and K and the number of iterations i. See text for details. Algorithmic Complexity SM EDS eaQHM 2 3 3 Complexity O (N log N) O(L + K ) O iK Appl. Sci. 2016, 6, 127 16 of 20 6. Evaluation of Perceptual Transparency with a Listening Test We performed a listening test to validate the full-band representation of musical instrument sounds with eaQHM. The aim of the test was to evaluate whether full-band modeling with eaQHM resulted in resynthesized musical instrument sounds that are perceptually indistinguishable from the original recordings. The 21 sounds in bold in Table 1 were selected for the listening test, which presented pairs original and resynthesis. The participants were instructed to listen to each pair as many times as necessary and to answer the question “Can you tell the difference between the two sounds in each pair?” Full-band (FB) resynthesis with eaQHM (using a harmonic template with K = K max sinusoids) was used for all 21 musical instrument sounds. For nine of these sounds, half-band (HB) resynthesis with eaQHM (using a harmonic template with K = max/2 sinusoids) was also included as control group to test the aptitude of the listeners and compare against the FB version. All HB versions were placed at random positions among the FB, so the test presented 30 pairs overall. The listening test can be accessed at [59]. In total, 20 people aged between 26 and 40 took the test. The participants declared themselves as experienced with listening tests and familiar with signal processing techniques. Figure 7 shows the result of the listening test as the percentage of the people who answered “no” to the question, indicating that they cannot tell the difference between the original recording and the resynthesis. In general, the result of the listening test shows that full-band modeling with eaQHM results in perceptually indistinguishable resynthesis for most musical instrument sounds tested. The ﬁgure indicates that 10 out of the 21 FB sounds tested were rated perceptually identical to the original by 100% of the listeners. As expected, most HB sounds fall under 30% (except Tenor Trombone) and most FB sounds lie above 70% (except Pan Flute). Table 1 shows that Tenor Trombone is played at C3 and Pan Flute at C5. The Tenor Trombone sound is not bright, which indicates that there is little spectral energy at the higher frequency end of the spectrum. Thus, the HB version synthesized with fewer partials than K was perceived as identical to the original by max some listeners. The Pan Flute sound contains a characteristic breathing noise captured as AM–FM elements in eaQHM. However, the breathing noise in the full-band version sounds brighter than the original recording and most listeners were able to tell the difference. Perceptual Similarity of Full-Band Modeling 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 FB 0.1 HB Figure 7. Result of the listening test on perceptual similarity of full-band (FB) and half-band (HB) resynthesis with eaQHM compared to the original recording. The sounds used in the listening test appear in bold in Table 1. Acoustic Guitar Bass Harmonica Accordion Bass Clarinet Cello Celesta Flute Piano Tenor Sax Tenor Trombone Ukulele C Trumpet Pan Flute Oboe 1 Oboe 2 Harp Glockenspiel French Horn Vibraphone Xylophone Viola Percentage (%) Appl. Sci. 2016, 6, 127 17 of 20 7. Conclusions We proposed the full-band quasi-harmonic modeling of musical instrument sounds with adaptive AM–FM sinusoids from eaQHM as an alternative to spectrum decomposition. We used the SRER to measure the ﬁt of the sinusoidal model to the original recording of 89 percussive and nonpercussive musical instruments sounds from different families. We showed that full-band modeling with eaQHM results in higher global SRER values when compared to the standard SM and to EDS estimated with ESPRIT for K sinusoids and frame size L = 3T f . EDS resulted in higher local SRER than eaQHM max 0 s for two of nine instrumental families, namely Brass and Bowed Percussion. A listening test conﬁrmed that full-band modeling with eaQHM resulted in perceptually indistinguishable resynthesis for most musical instrument sounds tested. Future work should investigate a method to prevent over-ﬁtting with eaQHM. Additionally, the use of least-squares to estimate the parameters leads to matrices that are badly conditioned for sounds with low fundamental frequencies. A more robust estimation method to prevent bad-conditioning would improve the stability of eaQHM. Currently, eaQHM can only estimate the parameters of isolated sounds. We intend to develop a method for polyphonic instruments and music. Future work also involves using eaQHM in musical instrument sound transformation, estimation of musical expressivity features such as vibrato, and solo instrumental music. The companion webpage [60] contains sound examples. Finally, the proposal of a full-band representation of musical instrument sounds with adaptive sinusoids motivates further investigation on full-band extensions of other sinusoidal methods, such as SM and EDS used here. Acknowledgments: This work was partly supported by project “NORTE-01-0145-FEDER-000020” ﬁnanced by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF) and by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant 644283. The latter project also supplied funds for covering the costs to publish in open access. Author Contributions: Marcelo Caetano conceived and designed the experiments, analyzed the data, and wrote the manuscript. George P. Kafentzis performed the experiments, helped analyze the results, and revised the manuscript. Athanasios Mouchtaris supervised the research and revised the manuscript. Yannis Stylianou supervised the research. Conﬂicts of Interest: The authors declare no conﬂict of interest. References 1. Serra, X.; Smith, J.O. Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Comput. Music J. 1990, 14, 49–56. 2. Beauchamp, J.W. Analysis and synthesis of musical instrument sounds. In Analysis, Synthesis, and Perception of Musical Sounds; Beauchamp, J.W., Ed.; Modern Acoustics and Signal Processing; Springer: New York, NY, USA, 2007; pp. 1–89. 3. Quatieri, T.; McAuley, R. Audio signal processing based on sinusoidal analysis/synthesis. In Applications of Digital Signal Processing to Audio and Acoustics; Kahrs, M., Brandenburg, K., Eds.; Kluwer Academic Publishers: Berlin/Heidelberg, Germany, 2002; Chapter 9, pp. 343–416. 4. Serra, X.; Bonada, J. Sound Transformations based on the SMS high level attributes. Proc. Digit. Audio Eff. Workshop 1998, 5. Available online: http://mtg.upf.edu/ﬁles/publications/dafx98-1.pdf (accessed on 26 April 2016). 5. Caetano, M.; Rodet, X. Musical Instrument sound morphing guided by perceptually motivated features. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1666–1675. 6. Barbedo, J.; Tzanetakis, G. Musical instrument classiﬁcation using individual partials. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 111–122. 7. Herrera, P.; Bonada, J. Vibrato Extraction and parameterization in the spectral modeling synthesis framework. Proc. Digit. Audio Eff. Workshop 1998, 99. Available online: http://www.mtg.upf.edu/ﬁles/publications/ dafx98-perfe.pdf (accessed on 26 April 2016). Appl. Sci. 2016, 6, 127 18 of 20 8. Glover, J.; Lazzarini, V.; Timoney, J. Real-time detection of musical onsets with linear prediction and sinusoidal modeling. EURASIP J. Adv. Signal Process. 2011, doi:10.1186/1687-6180-2011-68. 9. Virtanen, T.; Klapuri, A. Separation of harmonic sound sources using sinusoidal modeling. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, 5–9 June 2000; Volume 2, pp. II765–II768. 10. Lagrange, M.; Marchand, S.; Rault, J.B. Long interpolation of audio signals using linear prediction in sinusoidal modeling. J. Audio Eng. Soc. 2005, 53, 891–905. 11. Hermus, K.; Verhelst, W.; Lemmerling, P.; Wambacq, P.; Huffel, S.V. Perceptual audio modeling with exponentially damped sinusoids. Signal Process. 2005, 85, 163–176. 12. Nsabimana, F.; Zolzer, U. Audio signal decomposition for pitch and time scaling. In Proceedings of the International Symposium on Communications, Control, and Signal Processing (ISCCSP), St Julians, Malta, 12–14 March 2008; pp. 1285–1290. 13. El-Jaroudi, A.; Makhoul, J. Discrete all-pole modeling. IEEE Trans. Commun. Technol. 1969, 39, 481–488. 14. Caetano, M.; Rodet, X. A source-ﬁlter model for musical instrument sound transformation. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 137–140. 15. Wen, X.; Sandler, M. Source-Filter Modeling in the Sinusoidal Domain. J. Audio Eng. Soc. 2010, 58, 795–808. 16. Fletcher, N.H.; Rossing, T.D. The Physics of Musical Instruments, 2nd ed.; Springer: New York, NY, USA, 1998. 17. Caetano, M.; Kafentzis, G.P.; Degottex, G.; Mouchtaris, A.; Stylianou, Y. Evaluating how well ﬁltered white noise models the residual from sinusoidal modeling of musical instrument sounds. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. 18. Bader, R.; Hansen, U. Modeling of musical instruments. In Handbook of Signal Processing in Acoustics; Havelock, D., Kuwano, S., Vorländer, M., Eds.; Springer: New York, NY, USA, 2009; pp. 419–446. 19. Fletcher, N.H. The nonlinear physics of musical instruments. Rep. Prog. Phys. 1999, 62, 723–764. 20. McAulay, R.J.; Quatieri, T.F. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 744–754. 21. Green, R.A.; Haq, A. B-spline enhanced time-spectrum analysis. Signal Process. 2005, 85, 681–692. 22. Belega, D.; Petri, D. Frequency estimation by two- or three-point interpolated Fourier algorithms based on cosine windows. Signal Process. 2015, 117, 115–125. 23. Prudat, Y.; Vesin, J.M. Multi-signal extension of adaptive frequency tracking algorithms. Signal Process. 2009, 89, 96–973. 24. Candan, Ç. Fine resolution frequency estimation from three DFT samples: Case of windowed data. Signal Process. 2015, 114, 245–250. 25. Röbel, A. Adaptive additive modeling with continuous parameter trajectories. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1440–1453. 26. Verma, T.S.; Meng, T.H.Y. Extending spectral modeling synthesis with transient modeling synthesis. Comput. Music J. 2000, 24, 47–59. 27. Laurenti, N.; De Poli, G.; Montagner, D. A nonlinear method for stochastic spectrum estimation in the modeling of musical sounds. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 531–541. 28. Daudet, L. A review on techniques for the extraction of transients in musical signals. Proc. Int. Symp. Comput. Music Model. Retr. 2006, 3902, 219–232. 29. Jang, H.; Park, J.S. Multiresolution sinusoidal model with dynamic segmentation for timescale modiﬁcation of polyphonic audio signals. IEEE Trans. Speech Audio Process. 2005, 13, 254–262. 30. Beltrán, J.R.; de León, J.P. Estimation of the instantaneous amplitude and the instantaneous frequency of audio signals using complex wavelets. Signal Process. 2010, 90, 3093–3109. 31. Levine, S.N.; Smith, J.O. A compact and malleable sines+transients+noise model for sound. In Analysis, Synthesis, and Perception of Musical Sounds; Beauchamp, J.W., Ed.; Modern Acoustics and Signal Processing; Springer: New York, NY, USA, 2007; pp. 145–174. 32. Markovsky, I.; Huffel, S.V. Overview of total least-squares methods. Signal Process. 2007, 87, 2283–2302. 33. Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process 1989, 37, 984–995. Appl. Sci. 2016, 6, 127 19 of 20 34. Van Huffel, S.; Park, H.; Rosen, J. Formulation and solution of structured total least norm problems for parameter estimation. IEEE Trans. Signal Process. 1996, 44, 2464–2474. 35. Liu, Z.S.; Li, J.; Stoica, P. RELAX-based estimation of damped sinusoidal signal parameters. Signal Process. 1997, 62, 311–321. 36. Nieuwenhuijse, J.; Heusens, R.; Deprettere, E.F. Robust exponential modeling of audio signals. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seattle, WA, USA, 12–15 May 1998; Volume 6, pp. 3581–3584. 37. Badeau, R.; Boyer, R.; David, B. EDS Parametric Modeling And Tracking of Audio Signals. In Proceedings of the 5th International Conference on Digital Audio Effects (DAFx), Hambourg, Germany, 26–28 September 2002; pp. 26–28. 38. Jensen, J.; Heusdens, R. A comparison of sinusoidal model variants for speech and audio representation. In Proceedings of the 2002 11th European Signal Processing Conference (EUSIPCO), Toulouse, France, 3–6 September 2002; pp. 1–4. 39. Auger, F.; Flandrin, P. Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans. Signal Process. 1995, 43, 1068–1089. 40. Fulop, S.A.; Fitz, K. Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications. J. Acoust. Soc. Am. 2006, 119, 360–371. 41. Li, X.; Bi, G. The reassigned local polynomial periodogram and its properties. Signal Process. 2009, 89, 206–217. 42. Girin, L.; Marchand, S.; Di Martino, J.; Röbel, A.; Peeters, G. Comparing the order of a polynomial phase model for the synthesis of quasi-harmonic audio signals. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 19–22 October 2003; pp. 193–196. 43. Kafentzis, G.P.; Pantazis, Y.; Rosec, O.; Stylianou, Y. An extension of the adaptive quasi-harmonic model. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan, 25–30 March 2012; pp. 4605–4608. 44. Kafentzis, G.P.; Rosec, O.; Stylianou, Y. On the modeling of voiceless stop sounds of speech using adaptive quasi-harmonic models. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, OR, USA, 9–13 September 2012. 45. Pantazis, Y.; Rosec, O.; Stylianou, Y. Adaptive AM–FM signal decomposition with application to speech analysis. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 290–300. 46. Degottex, G.; Stylianou, Y. Analysis and synthesis of speech using an adaptive full-band harmonic model. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 2085–2095. 47. Caetano, M.; Kafentzis, G.P.; Mouchtaris, A.; Stylianou, Y. Adaptive sinusoidal modeling of percussive musical instrument sounds. In Proceedings of the European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, 9–13 September 2013; pp. 1–5. 48. Pantazis, Y.; Rosec, O.; Stylianou, Y. On the Properties of a time-varying quasi-harmonic model of speech. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Brisbane, Australia, 22–26 September 2008; pp. 1044–1047. 49. Smyth, T.; Abel, J.S. Toward an estimation of the clarinet reed pulse from instrument performance. J. Acoust. Soc. Am. 2012, 131, 4799–4810. 50. Smyth, T.; Scott, F. Trombone synthesis by model and measurement. EURASIP J. Adv. Signal Process. 2011, doi:10.1155/2011/151436. 51. Brown, J.C. Frequency ratios of spectral components of musical sounds. J. Acoust. Soc. Am. 1996, 99, 1210–1218. 52. Borss, C.; Martin, R. On the construction of window functions with constant overlap-add constraint for arbitrary window shifts. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 337–340. 53. Camacho, A.; Flory, H.Y. A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 2008, 124, 1638–1652. Appl. Sci. 2016, 6, 127 20 of 20 54. Goto, M.; Hashiguchi, H.; Nishimura, T.; Oka, R. RWC Music Database: Music Genre Database and Musical Instrument Sound Database. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), Baltimore, MD, USA, 26–30 October 2003; pp. 229–230. Available online: http://staff.aist.go.jp/m.goto/RWC-MDB/ (accessed on 26 April 2016). 55. Vienna Symphonic Library–GmbH. Available online: http://www.vsl.co.at/ (accessed on 26 April 2016). 56. Grey, J.M.; Gordon, J.W. Multidimensional perceptual scaling of musical timbre. J. Acoust. Soc. Am. 1977, 61, 1270–1277. 57. Krumhansl, C.L. Why is musical timbre so hard to understand? In Structure and Perception of Electroacoustic Sound and Music; Nielzén, S., Olsson, O., Eds.; Excerpta Medica: New York, NY, USA, 1989; pp. 43–54. 58. McAdams, S.; Giordano, B.L. The perception of musical timbre. In The Oxford Handbook of Music Psychology; Hallam, S., Cross, I., Thaut, M., Eds.; Oxford University Press: New York, NY, USA, 2009; pp. 72–80. 59. Listening Test. Webpage for the Listening Test. Available online: http://ixion.csd.uoc.gr/kafentz/listest/ pmwiki.php?n=Main.JMusLT (accessed on 26 April 2016). 60. AdaptiveSinMus. Companion webpage with sound examples. Available online: http://www.csd.uoc.gr/ kafentz/listest/pmwiki.php?n=Main.AdaptiveSinMus (accessed on 26 April 2016). c 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Journal

Applied Sciences – Multidisciplinary Digital Publishing Institute

Published: May 2, 2016

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids

References (68)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies