Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect a,< a b a,b Daniel Michelsanti , Zheng-Hua Tan , Sigurdur Sigurdsson and Jesper Jensen Department of Electronic Systems, Aalborg University, Denmark Oticon A/S, Denmark A R T I C L E I N F O A B S T R A C T Keywords: When speaking in presence of background noise, humans reflexively change their way of speaking Lombard effect in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Col- audio-visual speech enhancement lecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement deep learning systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially speech quality added. Since these systems are often used in situations where Lombard speech occurs, in this work speech intelligibility we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field. We conduct several experiments using an audio-visual Lombard speech corpus consisting of ut- terances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders ex- ists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of *5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech. visually, head and face motion are more pronounced and 1. Introduction the movements of the lips and jaw are amplified (Vatikiotis- Speech is perhaps the most common way that people use Bateson et al., 2007; Garnier et al., 2010, 2012); temporally, to communicate with each other. Often, this kind of com- the speech rate changes due to an increase of the vowel du- munication is harmed by several sources of disturbance that ration (Junqua, 1993; Cooke et al., 2014). may have different nature, such as the presence of competing Although Lombard effect improves the intelligibility of speakers, the loud music during a party, and the noise inside speech in noise (Summers et al., 1988; Pittman and Wiley, a car cabin. We refer to the sounds other than the speech of 2001), effective communication might still be challenged by interest as background noise. some particular conditions, e.g. the hearing impairment of Background noise is known to affect two attributes of the listener. In these situations, speech enhancement (SE) speech: intelligibility and quality (Loizou, 2007). Both of algorithms may be applied to the noisy signal aiming at im- these aspects are important in a conversation, since poor in- proving speech quality and speech intelligibility. In the lit- telligibility makes it hard to comprehend what a speaker is erature, several SE techniques have been proposed. Some saying and poor quality may affect speech naturalness and approaches consider SE as a statistical estimation problem listening effort (Loizou, 2007). Humans tend to tackle the (Loizou, 2007), and include some well-known methods, like negative effects of background noise by instinctively chang- the Wiener filtering (Lim and Oppenheim, 1979) and the ing the way of speaking, their speaking style, in a process minimum mean square error estimator of the short-time mag- known as Lombard effect (Lombard, 1911; Zollinger and nitude spectrum (Ephraim and Malah, 1984). Many improv- Brumm, 2011). The changes that can be observed vary wide- ed methods have been proposed, which primarily distinguish ly across individuals (Junqua, 1993; Marxer et al., 2018) and themselves by refined statistical speech models (Martin, 2005; affect multiple dimensions: acoustically, the average funda- Erkelens et al., 2007; Gerkmann and Martin, 2009) or noise mental frequency (F0) and the sound energy increase, the models (Martin and Breithaupt, 2003; Loizou, 2007). These spectral tilt flattens due to an energy increment at high fre- techniques, which make statistical assumptions on the distri- quencies and the centre frequency of the first and second for- butions of the signals, have been reported to be largely un- mant (F1 and F2) shifts (Junqua, 1993; Lu and Cooke, 2008); able to provide speech intelligibility improvements (Hu and Loizou, 2007; Jensen and Hendriks, 2012). As an alterna- Corresponding author danmi@es.aau.dk (D. Michelsanti); zt@es.aau.dk (Z. Tan); tive, data-driven techniques, especially deep learning, do not ssig@oticon.com (S. Sigurdsson); jje@es.aau.dk, jesj@oticon.com (J. make any assumptions on the distribution of the speech, of Jensen) the noise or on the way they are mixed: a learning algorithm ORCID(s): 0000-0002-3575-1600 (D. Michelsanti) is used to find a function that best maps features from de- - Page 1 of 16 arXiv:1905.12605v1 [eess.AS] 29 May 2019 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect graded speech to features from clean speech. Over the years, Command Colour* Preposition Letter* Digit* Adverb the speech processing community has put a considerable ef- bin blue at again fort into designing training targets and objective functions lay green by A–Z now 0–9 place red in (no W) please (Wang et al., 2014; Erdogan et al., 2015; Williamson et al., set white with soon 2016; Michelsanti et al., 2019b) for different neural network models, including deep neural networks (Xu et al., 2014; Table 1 Kolbæk et al., 2017), denoising autoencoders (Lu et al., 2013), Sentence structure for the Lombard GRID corpus (Alghamdi et al., 2018). The ‘*’ indicates a keyword. Adapted from recurrent neural networks (Weninger et al., 2014), fully con- (Cooke et al., 2006). volutional neural networks (Park and Lee, 2017), and gen- erative adversarial networks (Michelsanti and Tan, 2017). These methods represent the current state of the art in the speaker variability has on the systems is carried out, both field (Wang and Chen, 2018), and since they use only audio in relation to acoustic as well as visual features. Next, as signals, we refer to them as audio-only SE (AO-SE) systems. an example application, a system trained with both Lom- Previous studies show that observing the speaker’s facial bard and non-Lombard data using a wide signal-to-noise- and lip movements contributes to speech perception (Sumby ratio (SNR) range is compared with a system trained only on and Pollack, 1954; Erber, 1975; McGurk and MacDonald, non-Lombard speech, as it is currently done for the state-of- 1976). This finding suggests that a SE system could tolerate the-art models. Finally, especially since existing objective higher levels of background noise, if visual cues could be measures are limited to predict speech quality and intelligi- used in the enhancement process. This intuition is confirmed bility from the audio signals in isolation, listening tests using by a pioneering study on audio-visual SE (AV-SE) by Girin audio-visual stimuli have been performed. This test setup, et al. (2001), where simple geometric features extracted from which is generally not employed to evaluate SE systems, is the video of the speaker’s mouth are used. Later, more com- closer to a real-world scenario, where a listener is usually plex frameworks based on classical statistical approaches ha- able to look at the face of the talker. ve been proposed (Almajai and Milner, 2011; Abel and Hus- sain, 2014; Abel et al., 2014), and very recently deep learn- ing methods have been used for AV-SE (Hou et al., 2018; 2. Materials: Audio-Visual Speech Corpus Gabbay et al., 2018; Ephrat et al., 2018; Afouras et al., 2018; and Noise Data Owens and Efros, 2018; Morrone et al., 2019). The speech material used in this study is the Lombard It is reasonable to think that visual features are mostly GRID corpus (Alghamdi et al., 2018), which is an exten- helpful for SE when the speech is so degraded that AO-SE sion of the popular audio-visual GRID dataset (Cooke et al., systems achieve poor performance, i.e. when background 2006). It consists of 55 native speakers of British English noise heavily dominates over the speech of interest. Since in (25 males and 30 females) that are between 18 and 30 years such acoustical environment spoken communication is par- old. The sentences pronounced by the talkers adhere to the ticularly hard, we can assume that the speakers are under syntax from the GRID corpus, six-word sentences with the the influence of Lombard effect. In other words, the input following structure: <command> <color*> <preposition> to SE systems in this situation is Lombard speech. Despite <letter*> <digit*> <adverb> (Table 1). The words marked this consideration, state-of-the-art SE systems do not take with a * are keywords, whereas the others are fillers (Cooke Lombard effect into account, because collecting Lombard et al., 2006). speech is usually expensive. The training and the evaluation Each speaker was recorded while reading a unique set of of the systems are usually performed with speech recorded in 50 sentences in non-Lombard (NL) and Lombard (L) con- quiet and afterwards degraded with additive noise. Previous ditions (in total, 100 utterances per speaker). In both cases, work shows that speaker (Hansen and Varadarajan, 2009) the audio signals were recorded with a microphone placed and speech recognition (Junqua, 1993) systems that ignore in front of the speakers, while the video recordings were Lombard effect achieve sub-optimal performance, also in vi- collected with two cameras mounted on a helmet to have a sual (Heracleous et al., 2013; Marxer et al., 2018) and audio- frontal and a profile views of the talkers. visual settings (Heracleous et al., 2013). It is therefore of In order to induce the Lombard effect, speech shaped interest to conduct a similar study also in a SE context. noise (SSN) at 80 dB sound pressure level (SPL) was pre- With the objective of providing a more extensive analy- sented to the speakers, while they were reading the sentences sis of the impact of Lombard effect on deep-learning-based to a listener. The presence of a listener, who assured a nat- SE systems, the present work extends a preliminary study ural communication environment by asking the participants (Michelsanti et al., 2019a), providing the following novel to repeat the utterances from time to time, was needed, be- contributions. First, new experiments are conducted, where cause talkers usually adjust their speech to communicate bet- deep-learning-based SE systems trained with Lombard or ter with the people they are talking to (Lane and Tranel, non-Lombard speech are evaluated on Lombard speech us- 1971; Lu and Cooke, 2008), a process known as external or ing a cross-validation setting to avoid that a potential intra- public loop (Lane and Tranel, 1971). Since talkers tend to speaker variability of the adopted dataset leads to biased con- regulate their speaking style also based on the level of their clusions. Then, an investigation of the effect that the inter- own speech, in what is generally called internal or private - Page 2 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Consecutive Video Input Video Frames Face Detection Video Estimated Ideal Face Alignment Encoder Amplitude Mask Enhanced Speech Mouth Region Extraction Audio Inverse Decoder STFT Magnitude STFT Audio Input Magnitude Audio STFT Encoder Computation Figure 1: Pipeline of the audio-visual speech enhancement framework used in this study, adapted from (Gabbay et al., 2018), and identical to (Michelsanti et al., 2019a). The deep-learning-based system estimates an ideal amplitude mask from the video of the speaker’s mouth and the magnitude spectrogram of the noisy speech. The estimated mask is used to enhance the speech in time-frequency domain. STFT indicates the short-time Fourier transform. loop (Lane and Tranel, 1971), the speech signal was mixed and d.n/, respectively. Our models adopt a mask approxi- with the SSN at a carefully adjusted level, providing a self- mation approach (Michelsanti et al., 2019b), producing an monitoring feedback to the speakers. estimate M.k; l/ of the ideal amplitude mask, defined as In our study, the audio and the video signals from the M.k; l/ = ðX.k; l/ð_ðY .k; l/ð, with the following objective frontal camera were arranged as explained in Section 4 to function: build training, validation, and test sets. The audio signals 1 2 have a sampling rate of 16 kHz. The resolution of the frontal J = M.k; l/ * M.k; l/ ; (1) TF video stream is 720480 pixels with a variable frame rate of k;l around 24 frames per second (FPS). Audio and video signals with k Ë ^1;§ ; F`, l Ë ^1;§ ; T`, and T  F being the are temporally aligned. dimension of the training target. Recent preliminary exper- To generate speech in noise, SSN was added to the audio iments have shown that using this objective function leads signals of the Lombard GRID database. SSN was chosen to to better performance for AV-SE than competing methods match the kind of noise used in the database, since, as re- (Michelsanti et al., 2019b). ported by Hansen and Varadarajan (2009), Lombard effect occurs differently across noise types, although other stud- 3.2. Preprocessing ies (Lu and Cooke, 2009; Garnier and Henrich, 2014) failed In this work, each audio signal was peak-normalised. We to find such an evidence. The SSN we used was generated used a sample rate of 16 kHz and a 640-point STFT, with a as in (Kolbæk et al., 2016), by filtering white noise with a Hamming window of 640 samples (40 ms) and a hop size of low-order linear predictor, whose coefficients were found us- 160 samples (10 ms). Only the 321 bins that cover the pos- ing 100 random sentences from the Akustiske Databaser for itive frequencies were used, because of the conjugate sym- Dansk (ADFD) speech database. metry of the STFT. Each video signal was resampled at a frame rate of 25 3. Methodology FPS using motion interpolation as implemented in FFMPEG . The face of the speaker was detected in every frame using the In this study, we train and evaluate systems that perform frontal face detector implemented in the dlib toolkit (King, spectral SE using deep learning, as illustrated in Figure 1. 2009), consisting of5 histogram of oriented gradients (HOG) The processing pipeline is inspired by Gabbay et al. (2018) filters and a linear support vector machine (SVM). The bound- and the same as the one used in (Michelsanti et al., 2019a). ing box of the single-frame detections was tracked using a To have a self-contained exposition, we report the main de- Kalman filter. The face was aligned based on 5 landmarks tails of it in this section. using a model that estimated the position of the corners of the eyes and of the bottom of the nose (King, 2009) and was 3.1. Audio-Visual Speech Enhancement scaled to 256256 pixels. The mouth was extracted by crop- We assume to have access to two streams of information: ping the central lower face region of size 128  128 pixels. the video of the talker’s face, and an audio signal, y.n/ = Each segment of 5 consecutive grayscale video frames x.n/+d.n/, where x.n/ is the clean signal of interest, d.n/ is spanning a total of 200 ms was paired with the respective 20 an additive noise signal, and n indicates the discrete-time in- consecutive audio frames. dex. The additive noise model presented in time domain, can also be expressed in the time-frequency (TF) domain 3.3. Neural Network Architecture and Training as Y .k; l/ = X.k; l/ + D.k; l/, where Y .k; l/, X.k; l/, and The preprocessed audio and video signals, standardised D.k; l/ are the short-time Fourier transform (STFT) coeffi- using the mean and the variance from the training set, were cients at frequency bin k and at time frame l of y.n/, x.n/, used as input to a video and an audio encoders, respectively. https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf http://ffmpeg.org - Page 3 of 16 Fusion Sub-Network Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Training Material Both encoders consisted of 6 convolutional layers, each of them followed by leaky-ReLU activation functions (Maas Non-Lombard Lombard Speech Speech et al., 2013) and batch normalisation (Ioffe and Szegedy, System Narrow Wide Narrow Wide 2015). For the video encoder, also max-pooling and 0.25 Input SNR Range SNR Range SNR Range SNR Range dropout (Hinton et al., 2012) were adopted. The fusion of the (w) (w) Vision VO-NL VO-NL VO-L VO-L (w) (w) two modalities was accomplished using a sub-network con- Audio AO-NL AO-NL AO-L AO-L (w) (w) Audio-Visual AV-NL AV-NL AV-L AV-L sisting of 3 fully connected layers, followed by leaky-ReLU activations, on the outputs of the 2 encoders. The 321  20 Table 2 estimated mask was obtained with an audio decoder having (w) Models used in this study. The ‘ ’ is used to distinguish the 6 transposed convolutional layers followed by leaky-ReLU systems trained with a wide SNR range from the ones trained activations and a ReLU activation as output layer. Skip con- with a narrow SNR range. nections between the layers 1, 3, and 5 of the audio encoder and the corresponding decoder layers were used to avoid that the bottleneck hindered the information flow (Isola et al., opinion score (MOS) values, on a scale from approximately 2017). The values of the training target, M.k; l/, were lim- 1 to 4.64. ESTOI scores, which estimate speech intelligibil- ited in the [0;10] interval (Wang et al., 2014). ity, practically range from 0 to 1, where high values corre- The weights of the network were initialised with the Xa- spond to high speech intelligibility. vier approach (Glorot and Bengio, 2010). The training was As mentioned before (Section 2), clean speech signals performed using the Adam optimiser (Kingma and Ba, 2015) were mixed with SSN to match the noise type used in the with the objective function in Equation (1) and a batch size Lombard GRID corpus. Current state-of-the-art SE systems *4 of 64. The learning rate, initially set to 4 10 , was scaled are trained with signals at several SNRs to make them robust by a factor of 0:5 when the loss increased on the validation to various noise levels. We followed a similar methodology set. An early stopping technique was used, by selecting the and trained our models with two different SNR ranges, nar- network that performed the best on the validation set across row (between *20 dB and 5 dB) and wide (between *20 dB the 50 epochs used for training. and 30 dB). We used these two ranges because on the one hand we would like to assess the performance of SE sys- 3.4. Postprocessing tems when Lombard speech occurs, and on the other hand we The estimated ideal amplitude mask of an utterance was would like to have SNR-independent systems, i.e. systems obtained by concatenating the outputs of the network, ob- that also work well at higher SNRs. Such a setup allows us to tained by processing non-overlapping consecutive audio-vi- better understand whether Lombard speech, which is usually sual paired segments. The estimated mask was point-wise not available because it is hard to collect, should be used to multiplied with the complex-valued STFT spectrogram of train SE systems and which are the advantages and the dis- the noisy signal and the result inverted using an overlap-add advantages of various training configurations. The models procedure to get the time-domain signal (Allen, 1977; Grif- used in this work are shown in Table 2. fin and Lim, 1984). Similarly to the work by Marxer et al. (2018), the ex- periments were conducted adopting a multi-speaker setup, 3.5. Mono-Modal Speech Enhancement in which all the speakers in the database were used for both Until now, we only presented AV-SE systems. In or- training and evaluating the systems. This choice was made der to understand the relative contribution of the audio and for a practical reason. People may exhibit speech charac- the visual modalities, we also trained networks to perform teristics that differ considerably from each other when they mono-modal SE, by removing one of the two encoders from speak in presence of noise (Junqua, 1993; Marxer et al., 2018). the neural network architecture, without changing the other It is possible to model these differences by training speaker- explained settings and procedures. Both AO-SE and video- dependent systems, but this requires a large set of Lombard only SE (VO-SE) systems estimate a mask and apply it to speech for every speaker. Unfortunately, the audio-visual the noisy speech, but they differ in the signals used as input. speech corpus that we use, despite being one of the largest existing audio-visual databases for Lombard speech, only contains 50 utterances per speaker, which are not enough to 4. Experiments train a deep-learning-based model. The experiments conducted in this study compare the The experiments were performed according to a strati- performance of AO-SE, VO-SE, and AV-SE systems in terms fied five-fold cross-validation procedure (Liu and Özsu, 2009). of two widely adopted objective measures: perceptual eval- Specifically, the data was divided into five folds of approx- uation of speech quality (PESQ) (Rix et al., 2001), specifi- imately the same size, four of them used for training and cally the wideband extension (ITU, 2005) as implemented by validation, and one for testing. This process was repeated Loizou (2007), and extended short-time objective intelligi- five times for different test sets in order to evaluate the sys- bility (ESTOI) (Jensen and Taal, 2016). PESQ scores, used tems on the whole dataset. Before the split, the signals were to estimate speech quality, lie between *0:5 and 4:5, where rearranged to have about the same amount of data for each high values correspond to high speech quality. However, the speaker across the training (í 35 utterances), the validation wideband extension that we use maps these scores to mean - Page 4 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ VO-L VO-NL AO-L AO-NL AV-L AV-NL (í 5 utterances), and the test (í 10 utterances) sets. This ensured that each fold was a good representative of the inter- *20 - 5 dB 1.163 1.113 1.353 1.283 1.446 1.331 speaker variations of the whole dataset. For some speakers, ESTOI VO-L VO-NL AO-L AO-NL AV-L AV-NL some data was missing or corrupted, so we used fewer utter- *20 - 5 dB 0.372 0.335 0.448 0.423 0.528 0.488 ances. Among the 55 speakers, the recordings from speaker Table 3 s1 were discarded by the database collectors due to technical Average scores for the systems trained on a narrow SNR range. issues, and the data from speaker s51 was used only in the training set, because only 40 of the utterances could be used. Effectively, 53 speakers were used to evaluate our systems. fact that the noise level is so high that recovering the clean speech only using the noisy audio input is very challenging, 4.1. Systems Trained on a Narrow SNR Range and that the visual modality provides a richer information Since we would like to assess the performance of SE source at this noise level. systems when Lombard speech occurs, SSN is added to the For all the modalities, L systems tend to be better than speech signals from the Lombard GRID corpus at 6 differ- the respective NL systems. The only exception is AO-NL, ent SNRs, in uniform steps between *20 dB and 5 dB. This which have a higher PESQ score than AO-L at *20 dB SNR, choice was driven by the following considerations (Michel- but this difference is very modest (0:011). AV-L always out- santi et al., 2019a). Since Lombard and non-Lombard ut- performs AV-NL in terms of PESQ by a large margin, with terances from the Lombard GRID corpus have an energy more than 5 dB SNR gain, if we consider the performance difference between 3 and 13 dB (Marxer et al., 2018), the between *20 dB and *10 dB SNR. On average (Table 3), the actual SNR can be computed assuming that the conversa- performance gap in terms of PESQ between L and NL sys- tional speech level is between 60 and 70 dB sound pressure tems, is greater for the audio-visual case (0:115) than for the level (SPL) (Raphael et al., 2007; Moore et al., 2012) and the audio-only (0:070) and the video-only (0:050) cases, mean- noise level at 80 dB SPL, like in the recording conditions of ing that the speaking style mismatch is more detrimental the database. The SNR range obtained in this way is between when both the modalities are used. Regarding ESTOI, the *17 and 3 dB. In the experiments, we used a slightly wider gap between AV-L and AV-NL (0:040) is still the largest, range because of the possible speech level variations caused but the one between VO-L and VO-NL (0:037) is greater by the distance between the listener and the speaker. than the gap between AO-L and AO-NL (0:025): this sug- For all the systems, Lombard speech was used to build gests that the impact of visual differences between Lombard the test set, while for training and validation we used Lom- and non-Lombard speech on estimated speech intelligibility bard speech for VO-L, AO-L, and AV-L, and non-Lombard is higher than the impact of acoustic differences. speech for VO-NL, AO-NL, and AV-NL (Table 2). These results suggest that training systems with Lom- 4.1.1. Results and Discussion bard speech is beneficial in terms of both estimated speech Figure 2 shows the cross-validation results in terms of quality and estimated speech intelligibility. This is in line PESQ and ESTOI for all the different systems. On average, with and extends our preliminary study (Michelsanti et al., every model improves the estimated speech quality and the 2019a), where only a subset of the whole database was used estimated speech intelligibility of the unprocessed signals, to evaluate the models. with the exception of VO-NL at 5 dB SNR, which shows 4.1.2. Effects of Inter-Speaker Variability an ESTOI score comparable with the one of noisy speech. Previous work found a large inter-speaker variability for Another general trend that can be observed is that AV sys- Lombard speech, especially between male and female speak- tems outperform the respective AO and VO systems, an ex- ers (Junqua, 1993). Here, we investigate whether this vari- pected result since the information that can be exploited us- ability affects the performance of SE systems. ing two modalities is no less than the information of the sin- Figure 3 shows the average PESQ and ESTOI scores by gle modalities taken separately. gender. Since the scores are computed on different speech It is worth noting that VO systems’ performance changes material, it may be hard to make a direct comparison be- across SNR, although they do not use the audio signal to esti- tween males and females by looking at the absolute perfor- mate the ideal amplitude mask. This is because the estimated mance. Instead, we focus on the gap between L and NL mask is applied to the noisy input signal, so the performance systems averaged across SNRs for same gender. At a first depends on the noise level of the input audio signal. glance, the trends of the different conditions are as expected: PESQ scores show that the performance that can be ob- L systems are better than the respective NL ones, and AV tained with AO systems is comparable with VO systems per- systems outperform AO systems trained with speech of the formance at very low SNRs. Only for SNR g *10 dB, AO same speaking style, in terms of both estimated speech qual- models start to perform substantially better than VO mod- ity and estimated speech intelligibility. We also notice that els. The difference increases with higher SNRs. Also for the scores of VO systems are worse than the AO ones, also ESTOI, this pattern can be observed when SNR g *10 dB, for ESTOI. This is because we average across all the SNRs but for SNR f *15 dB VO systems perform better than the and VO is better than AO only at very low SNRs, but con- respective AO systems, especially at *20 dB SNR where the siderably worse for SNR g *5 dB (Figure 2). performance gap is very large. This can be explained by the - Page 5 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ ESTOI 2.2 0.8 1.16 VO-L VO-L VO-NL VO-NL 1.14 AO-L AO-L 0.7 AO-NL AO-NL 2 AV-L AV-L 1.12 AV-NL AV-NL Unproc. Unproc. 1.1 0.6 1.08 1.8 0.5 1.06 1.04 1.6 0.4 1.02 -20 -15 SNR (dB) 0.3 1.4 0.2 1.2 0.1 1 0 -20 -15 -10 -5 0 5 -20 -15 -10 -5 0 5 SNR (dB) SNR (dB) Figure 2: Cross-validation results in terms of PESQ and ESTOI for the systems trained on a narrow SNR range. At every SNR, there are three pairs of coloured bars with error bars, each of them referring to VO, AO, and AV systems (from left to right). The wide bars in dark colours represent L systems, while the narrow ones in light colours represent NL systems. The heights of each bar and the error bars indicate the average scores and the 95% confidence intervals computed on the pooled data, respectively. The transparent boxes with black edges, overlaying the bars of the other systems, and the error bars indicate the average scores of the unprocessed signals (Unproc.) and their 95% confidence intervals, respectively. PESQ ESTOI 1.5 0.6 1.45 0.5 1.4 1.35 0.4 1.3 1.25 0.3 MA 1.2 0.2 1.15 VO-L VO-NL MS AO-L 1.1 AO-NL 0.1 AV-L 1.05 Figure 4: Mouth aperture (MA) and mouth spreading (MS) AV-NL Unproc. from 4 facial landmarks. 1 0 Male Female Male Female Figure 3: Cross-validation results for male and female speakers in terms of PESQ and ESTOI. been used to study Lombard speech in previous work (Gar- nier et al., 2006, 2012; Tang et al., 2015; Alghamdi, 2017): F0, mouth aperture (MA) and mouth spreading (MS). The The difference between L and NL systems is larger for average F0 for each speaker was estimated with Praat (Boersma females than it is for males. This can be observed for all the and Weenink, 2001), using the default settings for pitch es- modalities and it is more noticeable for AV systems, most timation. The average MA and MS per speaker were com- likely because they account for both audio and visual dif- puted from 4 facial landmarks (Figure 4) obtained with the ferences. In order to better understand this behaviour, we pose estimation algorithm (Kazemi and Sullivan, 2014), train- provide a more in-depth analysis, investigating the impact ed on the iBUG 300-W database (Sagonas et al., 2016), im- that some acoustic and geometric articulatory features have plemented in the dlib toolkit (King, 2009). Let F0, MA, and MS denote the average difference on estimated speech quality and estimated speech intelligi- in audio and visual features, respectively, between Lombard bility. and non-Lombard speech. Similarly, letPESQ andESTOI We consider three different features that have already - Page 6 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect 0.3 0.3 0.3 Male Male Male Female Female Female 0.25 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0 -2 0 2 4 6 -2 -1 0 1 2 3 4 -10 0 10 20 30 40 50 MA (Pixels) MS (Pixels) F0 (Hz) 0.12 0.12 0.12 Male Male Male Female Female Female 0.1 0.1 0.1 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 -0.02 -0.02 -0.02 -2 0 2 4 6 -2 -1 0 1 2 3 4 -10 0 10 20 30 40 50 MA (Pixels) MS (Pixels) F0 (Hz) Figure 5: Scatter plots showing the relationship between the audio/visual features and PESQ/ESTOI. For each circle, which refers to a particular speaker, the y-coordinate indicates the average performance increment of AV-L with respect to AV-NL in terms of PESQ or ESTOI, while the x-coordinate indicates the average increment of audio (fundamental frequency) or visual (mouth aperture and mouth spreading) features in Lombard condition with respect to the respective feature in non-Lombard condition. The lines show the least-squares lines for male speakers (blue), female speakers (red), and all the speakers (yellow). MA, MS, and F0 indicate mouth aperture, mouth spreading, and fundamental frequency, respectively. denote the increment in PESQ and ESTOI, respectively, of Given n pairs of.x ; y / observations, with i Ë ^1;§ ; n`, i i AV-L with respect to AV-NL. Figure 5 illustrates the rela- from two variables x and y, whose sample means are denoted tionship between F0, MA, and MS and PESQ, and as x„ and y„, respectively, we refer to the Pearson’s correlation ESTOI. We notice that on average for each speaker PESQ coefficient as  .x; y/. We have that *1 f  .x; y/ f 1, P P and ESTOI are both positive, with only one exception rep- where 0 denotes the absence of a linear relationship between resented by a male speaker, whose ESTOI is slightly less the two variables, and *1 and 1 a perfect positive linear than 0. This indicates that no matter how different the speak- relationship and a perfect negative linear relationship, re- ing style of a person is in presence of noise, there is a benefit spectively. To complement the Pearson’s correlation coef- in training a system with Lombard speech. Focusing on the ficient, we also consider the Spearman’s correlation coeffi- range of the features’ variations, most of the speakers have cient,  .x; y/, defined as (Sharma, 2005): positive MA, MS, and F0. This is in accordance with .x; y/ =  .r ; r /; (2) S P x y previous research, which suggests that in Lombard condition where r and r indicate rank variables. The advantage of there is a tendency to amplify lips’ movements and rise the x y using ranks is that  allows to assess whether the relation- pitch (Garnier et al., 2010, 2012; Junqua, 1993). MA and ship between x and y is monotonic (not limited to linear). MS values lie between *2 and 6 pixels, and between *2 As shown in Table 4, for AV systems, F0 has a higher and 4 pixels, respectively, for both male and female speak- correlation withPESQ ( = 0:73,  = 0:73) andESTOI ers. Regarding the F0 range, it is wider for females, up to P S ( = 0:81,  = 0:77) than MA and MS. We observe 50 Hz, against the 25 Hz reached by males. P S that for female speakers, the correlation between the fea- Among the three features considered, F0 is the one that tures’ increments and the performance measures’ increments seems to be related the most with PESQ and ESTOI. is usually higher, especially when considering MS, sug- This can be seen by comparing the distributions of the cir- gesting that some inter-gender difference should be present cles with the least-squares lines in the plots of Figure 5 or not only for F0 (whose range is way wider for females as by analysing the correlation between PESQ/ESTOI incre- previously stated), but also for visual features. ments and audio/visual feature increments, using Pearson’s In Table 4 we also report the correlation coefficients for and Spearman’s correlation coefficients. the single modalities. The correlation of visual features’ in- - Page 7 of 16 ESTOI for AV Systems PESQ for AV Systems ESTOI for AV Systems PESQ for AV Systems ESTOI for AV Systems PESQ for AV Systems Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect (w) (w) (w) (w) (w) (w) PESQ VO-L VO-NL AO-L AO-NL AV-L AV-NL P S *20 - 5 dB 1.153 1.080 1.346 1.295 1.424 1.323 all m f all m f 10 - 30 dB 2.348 2.418 3.127 3.155 3.151 3.169 PESQ (VO) - MA :29 :32 :24 :35 :30 :29 (w) (w) (w) (w) (w) (w) ESTOI VO-L VO-NL AO-L AO-NL AV-L AV-NL PESQ (AO) - MA :43 :49 :40 :55 :49 :51 *20 - 5 dB 0.376 0.330 0.442 0.422 0.517 0.483 PESQ (AV) - MA :57 :59 :56 :65 :52 :66 10 - 30 dB 0.844 0.825 0.927 0.929 0.928 0.930 ESTOI (VO) - MA :46 :19 :57 :52 :16 :69 Table 5 ESTOI (AO) - MA :43 :47 :46 :52 :52 :50 ESTOI (AV) - MA :57 :53 :65 :67 :47 :72 Average scores for the systems trained on a wide SNR range. PESQ (VO) - MS :19 *:08 :35 :12 *:03 :31 PESQ (AO) - MS :31 :20 :45 :33 :19 :54 PESQ (AV) - MS :45 :21 :68 :44 :28 :71 Lombard and non-Lombard speech, are preferred. There are ESTOI (VO) - MS :45 *:12 :73 :22 *:21 :62 several ways to achieve this goal. For example, it is possi- ESTOI (AO) - MS :30 :05 :47 :22 :07 :48 ble to train a system (with Lombard speech) that works at ESTOI (AV) - MS :47 :02 :72 :34 *:02 :66 low SNRs, and another one (with non-Lombard speech) that PESQ (VO) - F0 :34 :26 :31 :36 :23 :35 works at high SNRs. This approach requires switching be- PESQ (AO) - F0 :62 :53 :58 :61 :52 :61 tween the two systems, which can be problematic, because PESQ (AV) - F0 :73 :58 :75 :73 :59 :80 it involves an online estimation of the SNR. An alternative ESTOI (VO) - F0 :77 :57 :77 :77 :58 :82 ESTOI (AO) - F0 :64 :55 :60 :60 :56 :61 approach is to train general systems with Lombard speech at ESTOI (AV) - F0 :81 :64 :81 :77 :61 :84 low SNRs and non-Lombard speech at high SNRs. We fol- lowed this alternative approach, building such systems and Table 4 studying their strengths and limitations. We also compared Pearson’s ( ) and Spearman’s ( ) correlation coefficients P S them with systems trained only with non-Lombard speech between PESQ/ESTOI increments and audio/visual feature for the whole SNR range, because this is what current state- increments for male speakers (m), female speakers (f), and all the speakers. MA, MS, and F0 indicate mouth aperture, of-the-art systems do. mouth spreading, and fundamental frequency, respectively. The test set was built by mixing additive SSN with Lom- bard speech at 6 SNRs between *20 and 5 dB, and with non-Lombard speech at 5 SNRs between 10 and 30 dB. For (w) (w) (w) crements with PESQ or ESTOI is sometimes higher for VO-NL , AO-NL , and AV-NL , only non-Lombard (w) AO systems than it is for VO systems. This might seem speech was used during training, while for VO-L , AO- (w) (w) counter-intuitive, because AO systems do not use visual in- L , and AV-L , Lombard speech was used with SNR f 5 formation. However, correlation does not imply causation dB and non-Lombard speech with SNR g 10 dB, to match (Field, 2013): since visual and acoustic features are corre- the speaking style of the test set (Table 2). The results in lated (Almajai et al., 2006), it is possible that other acoustic terms of PESQ and ESTOI are shown in Figure 6. features, which are not considered in this study even though The relative performance of the systems at SNR f 5 dB they might be correlated with MA and MS, play a role is similar to the one observed for the systems trained on a (w) in the enhancement. Similar considerations can be done for narrow SNR range (Section 4.1): L systems outperform (w) F0, which has a correlation with ESTOI for VO systems the respective NL systems, AV performance is higher than ( = 0:77,  = 0:77) higher than the one for AO systems AO and VO performance, and VO is considerably better than P S ( = 0:64,  = 0:60). By looking at the inter-gender dif- AO only in terms of ESTOI at very low SNRs. P S (w) ferences, we find that, in general, the correlation coefficients When SNR g 10 dB, NL systems perform better than (w) computed for female speakers are higher than the ones com- L systems in terms of PESQ. The difference is on aver- puted for male speakers, especially when considering MS. age (Table 5) larger for VO (0:070) than it is for AO (0:028) In general, a performance difference between genders and AV (0:018). This can be explained by the fact that it is (w) exists when L systems are compared with NL ones, with a harder for VO-L to recognise when non-Lombard speech gap that is larger for females. This is unlikely to be caused by occurs using only the video of the speaker. However, these (w) the small gender imbalance in the training set (23 males and performance gaps are smaller than the ones between L and (w) 30 females). Instead, it is reasonable to assume that this re- NL systems at SNR f 5 dB (0:073 for VO, 0:051 for AO, sult is due to the characteristics of the Lombard speech of fe- and 0:101 for AV). male speakers, which shows a large increment of F0, the fea- Regarding ESTOI at SNR g 10 dB, the difference be- ture that correlates the most with the estimated speech qual- tween AO and AV becomes negligible, with VO systems that ity and the estimated speech intelligibility increases, among perform considerably worse. This is because audio features the ones considered. are more informative than visual ones at high SNRs, making AO-SE systems already good to recover speech intelligibil- (w) (w) 4.2. Systems Trained on a Wide SNR Range ity. In addition, the average gaps between NL and L are The models presented in Section 4.1 have been trained to quite small: 0:002 for AO and AV, while for VO it is actually enhance signals when Lombard effect occurs, i.e. at SNRs *0:019. between *20 and 5 dB. However, from a practical perspec- In general, at SNR f 5 dB, the systems that use both tive, SNR-independent systems, capable of enhancing both Lombard and non-Lombard speech for training perform bet- - Page 8 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ (w) VO-L (w) VO-NL 3.8 (w) 3.5 AO-L (w) 3.6 AO-NL (w) AV-L 3.4 (w) AV-NL Unproc. 3.2 2.5 20 25 30 SNR (dB) 1.5 -20 -15 -10 -5 0 5 10 15 20 25 30 SNR (dB) ESTOI (w) VO-L (w) VO-NL 0.8 (w) AO-L (w) AO-NL (w) AV-L 0.6 (w) AV-NL Unproc. 0.4 0.2 -20 -15 -10 -5 0 5 10 15 20 25 30 SNR (dB) Figure 6: As Figure 2, but for the systems trained on a wide SNR range. ter than the ones that only use non-Lombard speech. At sual stimuli. Both tests were conducted in a silent room, higher SNRs, their PESQ and ESTOI scores are slightly worse where a MacBookPro11,4 equipped with an external moni- than the ones of the systems trained only with non-Lombard tor, a sound card (Focusrite Scarlett 2i2) and a set of closed speech. However, this performance gap is small, and seems headphones (Beyerdynamic DT770) was used for audio and to be larger for the estimated speech quality than for the es- video playback. The multimedia player (VLC media player timated speech intelligibility. The way we combined non- 3.0.4) was controlled by the subjects with a graphical user in- Lombard and Lombard speech for training seems to be the terface (GUI) modified from MUSHRAM (Vincent, 2005). best solution for an SNR-independent system, although a The processed signals used in this test were from the systems small performance loss may occur at high SNRs. trained on the narrow SNR range previously described (Sec- tion 4.1). All the audio stimuli were normalised according to the two-pass EBU R128 loudness normalisation proce- 5. Listening Tests dure (EBU, 2014), as implemented in ffmpeg-normalize , to Although it has been shown that visual cues have an im- guarantee that signals of different conditions were perceived pact on speech perception (Sumby and Pollack, 1954; McGurk as having the same volume. The subjects were allowed to and MacDonald, 1976), the currently available objective mea- adjust the general loudness to a comfortable level during the sures used to estimate speech quality and speech intelligibil- training session of each test. ity, e.g. PESQ and ESTOI, only take into account the audio signals. Even when listening tests are performed to evaluate 5.1. Speech Quality Test the performance of a SE system, visual stimuli are usually The quality test was carried out by 13 experienced listen- ignored and not presented to the participants (Hussain et al., ers, who volunteered to be part of the study. The participants 2017), despite the fact that visual inputs are typically avail- were between 26 and 44 years old, and had self-reported nor- able during practical deployment of SE systems. mal hearing and normal (or corrected to normal) vision. On For these reasons, and in an attempt to evaluate the pro- average, each participant spent approximately 30 minutes to posed AV enhancement systems in a setting as realistic as complete the test. possible, we performed two listening tests, one to assess the https://github.com/slhck/ffmpeg-normalize speech quality and the other to assess the speech intelligi- bility, where the processed audio signals from the Lombard GRID corpus were accompanied by their corresponding vi- - Page 9 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect MUSHRA (-5 dB SNR) MUSHRA (5 dB SNR) Small Effect Size Medium Effect Size Large Effect Size 100 100 0:11 f ðd ð < 0:28 0:28 f ðd ð < 0:43 0:43 f ðd ð f 1 90 90 C C C 80 80 Table 6 70 70 Interpretation of the effect size (Cliff’s delta, d ). Adapted from (Vargha and Delaney, 2000). 60 60 50 50 40 40 *5 dB SNR 5 dB SNR 30 30 Comparison p d p d C C 20 20 AO-L - AO-NL < :0083 :30 :0134 :22 10 10 AV-L - AV-NL < :0083 :32 < :0083 :23 0 0 AO-L - AV-L :0498 *:14 :7476 :02 AO-NL - AV-NL < :0083 *:21 :8262 *:02 AO-L - Unproc. :0479 :57 < :0083 :74 AV-L - Unproc. :0134 :59 < :0083 :79 Figure 7: Box plots showing the results of the MUSHRA ex- periments for the signals at *5 dB SNR (left) and at 5 dB SNR Table 7 (right). The red horizontal lines and the diamond markers in- p-values (p) and effect sizes (Cliff’s delta, d ) for the MUSHRA dicate the median and the mean values, respectively. Outliers experiments. The significant level (0.0083) for the p-values is (identified according to the 1.5 interquartile range rule) are corrected with the Bonferroni method. displayed as red crosses. Ref. indicates the reference signals. dopted to determine whether there exists a median differ- 5.1.1. Procedure ence between the MUSHRA scores obtained for two differ- The test used the MUlti Stimulus test with Hidden Refer- ent conditions. Differences in median are considered signif- ence and Anchor (MUSHRA) (ITU, 2003) paradigm to as- icant for p < _m = 0:0083 ( = 0:05, m = 6), where the sess the speech quality on a scale from 0 to 100, divided into significance level is corrected with the Bonferroni method to 5 equal intervals labelled as bad, poor, fair, good, and ex- compensate for multiple hypotheses tests (Field, 2013). The cellent. No definition of speech quality was provided to the use of p-values as the only analysis strategy has been heavily participants. Each subject was presented with 2 sequences criticized (Hentschke and Stüttgen, 2011) because statistical of 8 trials each, 4 to evaluate the systems at *5 dB SNR, and significance can be obtained with a big sample size (Sullivan 4 to evaluate the systems at 5 dB SNR. Lower SNRs were and Feinn, 2012; Moore et al., 2012) even if the magnitude of not considered to ensure that the perceptual quality assess- the effect is negligible (Hentschke and Stüttgen, 2011). For ment was not influenced too much by the decrease in intel- this reason, we complement p-values with a non-parametric ligibility. One trial consisted of one reference (clean speech measure of the effect size, the Cliff’s delta (Cliff, 1993): signal) and seven other signals to be rated with respect to the ³ ³ ³ ³ m n m n [x > y ] * [x < y ] reference: 1 hidden reference, 4 systems under test (AO-L, i j i j i=1 j=1 i=1 j=1 d = ; (3) AO-NL, AV-L, AV-NL), 1 unprocessed signal, and 1 hid- mn den anchor (unprocessed signal at *10 dB SNR). The par- where x and y are the observations of the samples of sizes ticipants were allowed to switch at will between any of the i j m and n to be compared and[P] indicates the Iverson bracket, signals inside the same trial. The order of presentation of which is 1 if P is true and 0 otherwise. As reported in both the trials and the conditions was randomised, and sig- Table 6, we consider the effect size to be small if 0:11 f nals from 4 different randomly chosen speakers were used ðd ð < 0:28, medium if 0:28 f ðd ð < 0:43, and large for each sequence of trials. C C if ðd ð g 0:43, according to the indication by Vargha and Before the actual test, the participants were trained in a C Delaney (2000). The p-values and the effect sizes for the special separate session, with the purpose of exposing them comparisons considered in this study are shown in Table 7. to the nature of the impairments and making them familiar At SNR = *5 dB, a significant (p < 0:0083) medium with the equipment and the grading system. (0:28 < ðd ð < 0:43) difference exists between Lombard and non-Lombard systems for both the audio-only and the 5.1.2. Results and Discussion audio-visual cases. The increment in quality when using vi- The average scores assigned by the subjects for each con- sion with respect to audio-only systems is perceived by the dition are shown in Figure 7 in the form of box plots. subjects (ðd ð > 0:11), but it has only a relatively small ef- Non-parametric approaches are used to analyse the data fect (ðd ð < 0:28). This was expected, since visual cues af- (Mendonça and Delikaris-Manias, 2018; Winter et al., 2018), fect more the intelligibility at low SNRs than quality, as also since the assumption of normal distribution of the data is in- shown by objective measures (Figure 2). More specifically, valid, given the number of participants and their different in- for non-Lombard systems, this difference is significant and terpretation of the MUSHRA scale. Specifically, the paired greater than the one found for Lombard systems, meaning two-sided Wilcoxon signed-rank test (Wilcoxon, 1945) is a- - Page 10 of 16 Ref. AO-L AO-NL AV-L AV-NL Unproc. Anchor Ref. AO-L AO-NL AV-L AV-NL Unproc. Anchor Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect that vision contributes more when the enhancement of Lom- p SNR bard speech is performed with systems that were not trained Comparison -20 dB -15 dB -10 dB -5 dB with it. We can notice that there is a large (ðd ð > 0:43) AO-L - AO-NL :3066 :4688 :0430 :2539 difference between the unprocessed signals and the version AV-L - AV-NL :0625 :8633 :0742 :1055 enhanced with Lombard systems. However, this difference AO-L - AV-L :0010 :0117 :5625 :2344 is not significant, probably due to the heterogeneous inter- AO-NL - AV-NL :0527 :0430 :3359 :2070 pretation of the MUSHRA scale by the subjects and their AO-L - Unproc. :0332 :0547 :9004 :1250 preference of the different natures of the impairment (pres- AV-L - Unproc. :1270 :0078 :8828 :8828 ence of noise or artefacts caused by the enhancement). d SNR At an SNR of 5 dB a small difference between Lombard and non-Lombard systems is observed, despite being not sig- Comparison -20 dB -15 dB -10 dB -5 dB nificant in the audio-only case (p = 0:0134). At this noise AO-L - AO-NL *:08 :06 :31 *:31 level, audio-visual systems appear to be indistinguishable AV-L - AV-NL :32 :01 :39 :28 (ðd ð < 0:11) from the respective audio-only systems. This C AO-L - AV-L *:91 *:35 *:17 *:34 confirms the intuition that vision does not help in improving AO-NL - AV-NL *:32 *:37 *:31 :21 the speech quality at high SNRs. Finally, the difference be- AO-L - Unproc. *:31 :17 *:09 *:26 tween the unprocessed signals and the respective enhanced AV-L - Unproc. :18 :46 0 :08 versions using Lombard systems is both large (ðd ð > 0:43) Table 8 and significant (p < 0:0083), which makes it clear that both p-values (p) and effect sizes (Cliff’s delta, d ) for the mean AO-L and AV-L improve the speech quality. intelligibility scores for all the keywords obtained in the listening tests. 5.2. Speech Intelligibility Test The intelligibility test was carried out by 11 listeners, who volunteered to be part of the study. The participants Table 8 shows Cliff’s deltas and p-values, computed with were between 24 and 65 years old, and had self-reported nor- the paired two-sided Wilcoxon signed-rank test, as in the mal hearing and normal (or corrected to normal) vision. On MUSHRA experiments. average, each participant spent approximately 45 minutes to The effect sizes support the observations made from Fig- complete the test. ure 8. Medium and large differences (ðd ð > 0:28) exist be- tween AO and AV systems, especially at low SNRs. While 5.2.1. Procedure AO-L and AO-NL are indistinguishable (ðd ð < 0:11) for Each subject was presented with2 sequences of80 audio- SNR < *10 dB, there is a medium (0:28 f ðd ð < 0:43) dif- visual stimuli from the Lombard GRID corpus: 8 speakers ference between AV-L and AV-NL, except for *15 dB SNR  4 SNRs (*20, *15, *10, and *5 dB)  5 processing con- (d = 0:01). Moreover, the intelligibility increase of AV-L ditions (unprocessed, AO-L, AO-NL, AV-L, AV-NL). The over the unprocessed signals is perceived by the subjects at participants were asked to listen to each stimulus only once SNR f *15 dB (ðd ð > 0:11). and, based on what they heard, they had to select the colour Regarding the p-values, if we focus on each SNR sep- and the digit from a list of options and to write the letter arately, the difference between two approaches can be con- (Table 1). The order of presentation of the stimuli was ran- sidered significant for p < 0:0083 (cf. Section 5.1.2). This domised. condition is met only when we compare AO-L with AV-L Before the actual test, the participants were trained in a at *20 dB SNR and AV-L with the noisy speech at *15 dB special separate session consisting of a sequence of40 audio- SNR. visual stimuli. There are three main sources of variability that most like- ly prevent the differences to be significant. First, the varia- 5.2.2. Results and Discussion tion in lipreading ability among individuals is large and does The mean percentage of correctly identified keywords as not directly reflect the variation found in auditory speech per- a function of the SNR is shown in Figure 8. We can see ception skills (Summerfield, 1992). Secondly, individuals that among the three fields, the colour is the easiest word to have very different fusion responses to discrepancy in the be identified by the participants. In general, the following auditory and visual syllables (Mallick et al., 2015), which trends can be observed. At low SNRs the intelligibility of in our case might occur due to the artefacts produced in the the signals enhanced with AV systems is higher than the in- enhancement process. Finally, the participants were not ex- telligibility obtained with AO systems. This difference sub- posed to the same utterances processed with the different ap- stantially diminishes when the SNR increases. There is no proaches like in MUSHRA. Since the vocabulary set of the big performance difference between L and NL systems, but Lombard GRID corpus is small and some words are easier to in general AV-L tends to have higher percentage scores than understand because they contain unambiguous visemes, the the other systems. AV-L is also the only system that does intelligibility scores are affected not only by the various pro- not decrease the mean intelligibility scores for all the fields cessing conditions, but also by the different sentences used. if compared to the unprocessed signals. - Page 11 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Colour Letter Digit Mean 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 AO-L AO-NL 20 20 20 20 AV-L AV-NL Unproc. 0 0 0 0 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5 SNR (dB) SNR (dB) SNR (dB) SNR (dB) Figure 8: Percentage of correctly identified words obtained in the listening tests for the colour, the letter, and the digit fields, averaged across 11 subjects. The mean intelligibility scores for all the fields are also reported. significant differences between Lombard and non-Lombard 6. Conclusion systems at all the used SNRs for the audio-visual case and In this paper, we presented an extensive analysis of the only at *5 dB SNR for the audio-only case. Regarding the impact of Lombard effect on audio, visual and audio-visual speech intelligibility test, we observed that on average the speech enhancement systems based on deep learning. We scores obtained with the audio-visual system trained with conducted several experiments using a database consisting Lombard speech are higher than the other processing con- of 54 speakers and showed the general benefit of training a ditions. However, we were unable to find significant differ- system with Lombard speech. ences in most of the cases, suggesting that in future works In more detail, we first trained systems with Lombard or more effort should be put into designing new paradigms for non-Lombard speech and evaluated them on Lombard speech speech intelligibility tests to control the several sources of adopting a cross-validation setup. The results showed that variability caused by the combination of auditory and visual systems trained with Lombard speech outperform the respec- stimuli. tive systems trained with non-Lombard speech in terms of both estimated speech quality and estimated speech intelli- 7. Acknowledgements gibility. We also observed a performance difference across speakers, with an evident gap between genders: the perfor- This work was supported, in part, by the Oticon Founda- mance difference between the systems trained with Lom- tion. bard speech and the ones trained with non-Lombard speech is larger for females than it is for males. The analysis that References we performed suggests that this difference might be primar- Abel, A., Hussain, A., 2014. Novel two-stage audiovisual ily due to the large increment in the fundamental frequency speech filtering in noisy environments. Cognitive Com- that female speakers exhibit from non-Lombard to Lombard putation 6 (2), 200–217. conditions. With the objective of building more general systems able Abel, A., Hussain, A., Luo, B., 2014. Cognitively inspired to deal with a wider SNR range, we then trained systems us- speech processing for multimodal hearing technology. In: ing Lombard and non-Lombard speech and compared them Proceedings of CICARE. IEEE, pp. 56–63. with systems trained only on non-Lombard speech. As in the narrow SNR case, systems that include Lombard speech Afouras, T., Chung, J. S., Zisserman, A., 2018. The conver- perform considerably better than the others at low SNRs. sation: Deep audio-visual speech enhancement. In: Pro- At high SNRs, the estimated speech quality and the esti- ceedings of Interspeech. pp. 3244–3248. mated speech intelligibility obtained with systems trained Alghamdi, N., 2017. Visual speech enhancement and its ap- only with non-Lombard speech are higher, even though the plication in speech perception training. Ph.D. thesis, Uni- performance gap is very small for the audio and the audio- versity of Sheffield. visual cases. Combining non-Lombard and Lombard speech for training in the way we did guarantees a good compromise Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, for the enhancement performance across all the SNRs. G. J., 2018. A corpus of audio-visual Lombard speech We also performed subjective listening tests with audio- with frontal and profile views. The Journal of the Acous- visual stimuli, in order to evaluate the systems in a situation tical Society of America 143 (6), EL523–EL529. closer to the real-world scenario, where the listener can see the face of the talker. For the speech quality test, we found - Page 12 of 16 Intelligibility (%) Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Allen, J., 1977. Short term spectral analysis, synthesis, Gabbay, A., Shamir, A., Peleg, S., 2018. Visual speech en- and modification by discrete Fourier transform. IEEE hancement. In: Proceedings of Interspeech. pp. 1170– Transactions on Acoustics, Speech, and Signal Process- 1174. ing 25 (3), 235–238. Garnier, M., Bailly, L., Dohen, M., Welby, P., Lœvenbruck, Almajai, I., Milner, B., 2011. Visually derived Wiener fil- H., 2006. An acoustic and articulatory study of Lombard ters for speech enhancement. IEEE Transactions on Au- speech: Global effects on the utterance. In: Proceedings dio, Speech, and Language Processing 19 (6), 1642–1651. of Interspeech/ICSLP. pp. 2246–2249. Almajai, I., Milner, B., Darch, J., 2006. Analysis of corre- Garnier, M., Henrich, N., 2014. Speaking in noise: How lation between audio and visual speech features for clean does the Lombard effect improve acoustic contrasts be- audio feature prediction in noise. In: Proceedings of In- tween speech and ambient noise? Computer Speech & terspeech/ICSLP. p. 1634. Language 28 (2), 580–597. Boersma, P., Weenink, D., 2001. Praat: doing phonet- Garnier, M., Henrich, N., Dubois, D., 2010. Influence of ics by computer. http://www.fon.hum.uva.nl/praat/, ac- sound immersion and communicative interaction on the cessed: March 20, 2019. Lombard effect. Journal of Speech, Language, and Hear- ing Research 53 (3), 588–608. Cliff, N., 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin 114 (3), Garnier, M., Ménard, L., Richard, G., 2012. Effect of be- 494. ing seen on the production of visible speech cues. A pi- lot study on Lombard speech. In: Proceedings of Inter- Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An speech/ICSLP. pp. 611–614. audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society Gerkmann, T., Martin, R., 2009. On the statistics of spec- of America 120 (5), 2421–2424. tral amplitudes after variance reduction by temporal cep- strum smoothing and cepstral nulling. IEEE Transactions Cooke, M., King, S., Garnier, M., Aubanel, V., 2014. on Signal Processing 57 (11), 4165–4174. The listening talker: A review of human and algorith- mic context-induced modifications of speech. Computer Girin, L., Schwartz, J.-L., Feng, G., 2001. Audio-visual en- Speech & Language 28 (2), 543–571. hancement of speech in noise. The Journal of the Acous- tical Society of America 109 (6), 3007–3020. EBU, 2014. EBU recommendation R128 - Loudness nor- malisation and permitted maximum level of audio signals. Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Proceed- Ephraim, Y., Malah, D., 1984. Speech enhancement using ings of AISTATS. pp. 249–256. a minimum-mean square error short-time spectral ampli- tude estimator. IEEE Transactions on Acoustics, Speech, Grancharov, V., Kleijn, W., 2008. Speech Quality Assess- and Signal Processing 32 (6), 1109–1121. ment. Springer Berlin Heidelberg, pp. 83–100. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Has- Griffin, D., Lim, J., 1984. Signal estimation from modi- sidim, A., Freeman, W. T., Rubinstein, M., 2018. Look- fied short-time Fourier transform. IEEE Transactions on ing to listen at the cocktail party: A speaker-independent Acoustics, Speech, and Signal Processing 32 (2), 236– audio-visual model for speech separation. ACM Transac- 243. tions on Graphics 37 (4), 112:1–112:11. Hansen, J. H., Varadarajan, V., 2009. Analysis and compen- Erber, N. P., 1975. Auditory-visual perception of speech. sation of Lombard speech across noise type and levels Journal of Speech and Hearing Disorders 40 (4), 481–492. with application to in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Pro- Erdogan, H., Hershey, J. R., Watanabe, S., Le Roux, J., 2015. cessing 17 (2), 366–378. Phase-sensitive and recognition-boosted speech separa- tion using deep recurrent neural networks. In: Proceed- Hentschke, H., Stüttgen, M. C., 2011. Computation of mea- ings of ICASSP. IEEE, pp. 708–712. sures of effect size for neuroscience data sets. European Journal of Neuroscience 34 (12), 1887–1894. Erkelens, J. S., Hendriks, R. C., Heusdens, R., Jensen, J., 2007. Minimum mean-square error estimation of discrete Heracleous, P., Ishi, C. T., Sato, M., Ishiguro, H., Hagita, N., Fourier coefficients with generalized Gamma priors. IEEE 2013. Analysis of the visual Lombard effect and automatic Transactions on Audio, Speech, and Language Processing recognition experiments. Computer Speech & Language 15 (6), 1741–1752. 27 (1), 288–300. Field, A., 2013. Discovering statistics using IBM SPSS statistics. Sage. - Page 13 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Kolbæk, M., Tan, Z.-H., Jensen, J., 2016. Speech enhance- Salakhutdinov, R. R., 2012. Improving neural networks ment using long short-term memory based recurrent neu- by preventing co-adaptation of feature detectors. arXiv ral networks for noise robust speaker verification. In: Pro- preprint arXiv:1207.0580. ceedings of SLT. IEEE, pp. 305–311. Hou, J.-C., Wang, S.-S., Lai, Y.-H., Lin, J.-C., Tsao, Y., Kolbæk, M., Tan, Z.-H., Jensen, J., 2017. Speech intelli- Chang, H.-W., Wang, H.-M., 2018. Audio-visual speech gibility potential of general and specialized deep neural enhancement based on multimodal deep convolutional network based speech enhancement systems. IEEE/ACM neural network. IEEE Transactions on Emerging Topics Transactions on Audio, Speech and Language Processing in Computational Intelligence 2 (2), 117–128. 25 (1), 153–167. Hu, Y., Loizou, P. C., 2007. A comparative intelligibility Lane, H., Tranel, B., 1971. The Lombard sign and the role study of single-microphone noise reduction algorithms. of hearing in speech. Journal of Speech and Hearing Re- The Journal of the Acoustical Society of America 122 (3), search 14 (4), 677–709. 1777–1786. Lim, J. S., Oppenheim, A. V., 1979. Enhancement and band- Hussain, A., Barker, J., Marxer, R., Adeel, A., Whitmer, W., width compression of noisy speech. Proceedings of the Watt, R., Derleth, P., 2017. Towards multi-modal hearing IEEE 67 (12), 1586–1604. aid design and evaluation in realistic audio-visual settings: Liu, L., Özsu, M. T., 2009. Encyclopedia of database sys- Challenges and opportunities. In: Proceedings of CHAT. tems. Vol. 6. Springer New York, NY, USA. pp. 29–34. Loizou, P. C., 2007. Speech enhancement: Theory and prac- Ioffe, S., Szegedy, C., 2015. Batch normalization: Acceler- tice. CRC press. ating deep network training by reducing internal covariate shift. In: Proceedings of ICML. pp. 448–456. Lombard, E., 1911. Le signe de l’elevation de la voix. An- nales des Maladies de L’Oreille et du Larynx 37 (2), 101– Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A., 2017. Image-to- image translation with conditional adversarial networks. In: Proceedings of CVPR. pp. 1125–1134. Lu, X., Tsao, Y., Matsuda, S., Hori, C., 2013. Speech en- hancement based on deep denoising autoencoder. In: Pro- ITU, 2003. Recommendation ITU-R BS.1534-1: Method ceedings of Interspeech. pp. 436–440. for the subjective assessment of intermediate quality level of coding systems. Lu, Y., Cooke, M., 2008. Speech production modifications produced by competing talkers, babble, and stationary ITU, 2005. Recommendation P.862.2: Wideband extension noise. The Journal of the Acoustical Society of America to recommendation P.862 for the assessment of wideband 124 (5), 3261–3275. telephone networks and speech codecs. Lu, Y., Cooke, M., 2009. Speech production modifications Jensen, J., Hendriks, R. C., 2012. Spectral magnitude min- produced in the presence of low-pass and high-pass fil- imum mean-square error estimation using binary and tered noise. The Journal of the Acoustical Society of continuous gain functions. IEEE Transactions on Audio, America 126 (3), 1495–1499. Speech, and Language Processing 20 (1), 92–102. Maas, A. L., Hannun, A. Y., Ng, A. Y., 2013. Rectifier non- Jensen, J., Taal, C. H., 2016. An algorithm for predicting linearities improve neural network acoustic models. In: the intelligibility of speech masked by modulated noise ICML Workshop on Deep Learning for Audio, Speech maskers. IEEE/ACM Transactions on Audio, Speech, and and Language Processing. Language Processing 24 (11), 2009–2022. Mallick, D. B., Magnotti, J. F., Beauchamp, M. S., 2015. Junqua, J.-C., 1993. The Lombard reflex and its role on hu- Variability and stability in the McGurk effect: Contri- man listeners and automatic speech recognizers. The Jour- butions of participants, stimuli, time, and response type. nal of the Acoustical Society of America 93 (1), 510–524. Psychonomic Bulletin & Review 22 (5), 1299–1307. Kazemi, V., Sullivan, J., 2014. One millisecond face align- Martin, R., 2005. Speech enhancement based on mini- ment with an ensemble of regression trees. In: Proceed- mum mean-square error estimation and supergaussian pri- ings of CVPR. pp. 1867–1874. ors. IEEE Transactions on Speech and Audio Processing King, D. E., 2009. Dlib-ml: A machine learning toolkit. 13 (5), 845–856. Journal of Machine Learning Research 10, 1755–1758. Martin, R., Breithaupt, C., 2003. Speech enhancement in the Kingma, D. P., Ba, J., 2015. Adam: A method for stochastic DFT domain using Laplacian speech priors. In: Proceed- optimization. In: Proceedings of ICLR. ings of IWAENC. Vol. 3. pp. 87–90. - Page 14 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Marxer, R., Barker, J., Alghamdi, N., Maddock, S., 2018. Sharma, A., 2005. Text book of correlations and regression. The impact of the Lombard effect on audio and visual Discovery Publishing House. speech recognition systems. Speech Communication 100, Sullivan, G. M., Feinn, R., 2012. Using effect size - or why 58–68. the P value is not enough. Journal of graduate medical McGurk, H., MacDonald, J., 1976. Hearing lips and seeing education 4 (3), 279–282. voices. Nature 264 (5588), 746–748. Sumby, W. H., Pollack, I., 1954. Visual contribution to Mendonça, C., Delikaris-Manias, S., 2018. Statistical tests speech intelligibility in noise. The Journal of the Acous- with mushra data. In: Audio Engineering Society Con- tical Society of America 26 (2), 212–215. vention 144. Audio Engineering Society. Summerfield, Q., 1992. Lipreading and audio-visual speech Michelsanti, D., Tan, Z.-H., 2017. Conditional generative perception. Philosophical Transactions of the Royal Soci- adversarial networks for speech enhancement and noise- ety of London. Series B: Biological Sciences 335 (1273), robust speaker verification. In: Proceedings of Inter- 71–78. speech. pp. 2008–2012. Summers, W. V., Pisoni, D. B., Bernacki, R. H., Pedlow, Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J., R. I., Stokes, M. A., 1988. Effects of noise on speech pro- 2019a. Effects of Lombard reflex on the performance duction: Acoustic and perceptual analyses. The Journal of of deep-learning-based audio-visual speech enhancement the Acoustical Society of America 84 (3), 917–928. systems. In: Proceedings of ICASSP. pp. 6615–6619. Tang, L. Y., Hannah, B., Jongman, A., Sereno, J., Wang, Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J., Y., Hamarneh, G., 2015. Examining visible articulatory 2019b. On training targets and objective functions for features in clear and plain speech. Speech Communication deep-learning-based audio-visual speech enhancement. 75, 1–13. In: Proceedings of ICASSP. pp. 8077–8081. Vargha, A., Delaney, H. D., 2000. A critique and improve- Moore, D. S., McCabe, G. P., Craig, B. A., 2012. Introduc- ment of the CL common language effect size statistics of tion to the Practice of Statistics. WH Freeman New York. McGraw and Wong. Journal of Educational and Behav- ioral Statistics 25 (2), 101–132. Morrone, G., Pasa, L., Tikhanoff, V., Bergamaschi, S., Fadiga, L., Badino, L., 2019. Face landmark-based Vatikiotis-Bateson, E., Barbosa, A. V., Chow, C. Y., Oberg, speaker-independent audio-visual speech enhancement in M., Tan, J., Yehia, H. C., 2007. Audiovisual Lombard multi-talker environments. In: Proceedings of ICASSP. speech: Reconciling production and perception. In: Pro- pp. 6900–6904. ceedings of AVSP. p. 41. Owens, A., Efros, A. A., 2018. Audio-visual scene analysis Vincent, E., 2005. MUSHRAM: A MATLAB interface for with self-supervised multisensory features. In: Proceed- MUSHRA listening tests. http://c4dm.eecs.qmul.ac.uk/ ings of ECCV. pp. 631–648. downloads/#mushram, accessed: March 20, 2019. Park, S. R., Lee, J., 2017. A fully convolutional neural net- Wang, D., Chen, J., 2018. Supervised speech separation work for speech enhancement. In: Proceedings of Inter- based on deep learning: An overview. IEEE/ACM Trans- speech. pp. 1993–1997. actions on Audio, Speech, and Language Processing 26 (10), 1702–1726. Pittman, A. L., Wiley, T. L., 2001. Recognition of speech produced in noise. Journal of Speech, Language, and Wang, Y., Narayanan, A., Wang, D., 2014. On training tar- Hearing Research. gets for supervised speech separation. IEEE/ACM Trans- actions on Audio, Speech, and Language Processing Raphael, L. J., Borden, G. J., Harris, K. S., 2007. Speech 22 (12), 1849–1858. science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins. Weninger, F., Hershey, J. R., Le Roux, J., Schuller, B., 2014. Discriminatively trained recurrent neural networks Rix, A. W., Beerends, J. G., Hollier, M. P., Hekstra, A. P., for single-channel speech separation. In: Proceedings of 2001. Perceptual evaluation of speech quality (PESQ)-a GlobalSIP. IEEE, pp. 577–581. new method for speech quality assessment of telephone networks and codecs. In: Proceedings of ICASSP. Vol. 2. Wilcoxon, F., 1945. Individual comparisons by ranking IEEE, pp. 749–752. methods. Biometrics Bulletin 1 (6), 80–83. Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, Williamson, D. S., Wang, Y., Wang, D., 2016. Complex ra- S., Pantic, M., 2016. 300 faces in-the-wild challenge: tio masking for monaural speech separation. IEEE/ACM Database and results. Image and Vision Computing 47, Transactions on Audio, Speech and Language Processing 3–18. 24 (3), 483–492. - Page 15 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Winter, F., Wierstorf, H., Hold, C., Krüger, F., Raake, A., Spors, S., 2018. Colouration in local wave field synthe- sis. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing 26 (10), 1913–1924. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H., 2014. An experimental study on speech enhancement based on deep neural net- works. IEEE Signal Processing Letters 21 (1), 65–68. Zollinger, S. A., Brumm, H., 2011. The evolution of the Lombard effect: 100 years of psychoacoustic research. Behaviour 148 (11-13), 1173–1198. - Page 16 of 16 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

Loading next page...
 
/lp/arxiv-cornell-university/deep-learning-based-audio-visual-speech-enhancement-in-presence-of-RCRWpwei5U
ISSN
0167-6393
eISSN
ARCH-3348
DOI
10.1016/j.specom.2019.10.006
Publisher site
See Article on Publisher Site

Abstract

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect a,< a b a,b Daniel Michelsanti , Zheng-Hua Tan , Sigurdur Sigurdsson and Jesper Jensen Department of Electronic Systems, Aalborg University, Denmark Oticon A/S, Denmark A R T I C L E I N F O A B S T R A C T Keywords: When speaking in presence of background noise, humans reflexively change their way of speaking Lombard effect in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Col- audio-visual speech enhancement lecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement deep learning systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially speech quality added. Since these systems are often used in situations where Lombard speech occurs, in this work speech intelligibility we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field. We conduct several experiments using an audio-visual Lombard speech corpus consisting of ut- terances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders ex- ists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of *5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech. visually, head and face motion are more pronounced and 1. Introduction the movements of the lips and jaw are amplified (Vatikiotis- Speech is perhaps the most common way that people use Bateson et al., 2007; Garnier et al., 2010, 2012); temporally, to communicate with each other. Often, this kind of com- the speech rate changes due to an increase of the vowel du- munication is harmed by several sources of disturbance that ration (Junqua, 1993; Cooke et al., 2014). may have different nature, such as the presence of competing Although Lombard effect improves the intelligibility of speakers, the loud music during a party, and the noise inside speech in noise (Summers et al., 1988; Pittman and Wiley, a car cabin. We refer to the sounds other than the speech of 2001), effective communication might still be challenged by interest as background noise. some particular conditions, e.g. the hearing impairment of Background noise is known to affect two attributes of the listener. In these situations, speech enhancement (SE) speech: intelligibility and quality (Loizou, 2007). Both of algorithms may be applied to the noisy signal aiming at im- these aspects are important in a conversation, since poor in- proving speech quality and speech intelligibility. In the lit- telligibility makes it hard to comprehend what a speaker is erature, several SE techniques have been proposed. Some saying and poor quality may affect speech naturalness and approaches consider SE as a statistical estimation problem listening effort (Loizou, 2007). Humans tend to tackle the (Loizou, 2007), and include some well-known methods, like negative effects of background noise by instinctively chang- the Wiener filtering (Lim and Oppenheim, 1979) and the ing the way of speaking, their speaking style, in a process minimum mean square error estimator of the short-time mag- known as Lombard effect (Lombard, 1911; Zollinger and nitude spectrum (Ephraim and Malah, 1984). Many improv- Brumm, 2011). The changes that can be observed vary wide- ed methods have been proposed, which primarily distinguish ly across individuals (Junqua, 1993; Marxer et al., 2018) and themselves by refined statistical speech models (Martin, 2005; affect multiple dimensions: acoustically, the average funda- Erkelens et al., 2007; Gerkmann and Martin, 2009) or noise mental frequency (F0) and the sound energy increase, the models (Martin and Breithaupt, 2003; Loizou, 2007). These spectral tilt flattens due to an energy increment at high fre- techniques, which make statistical assumptions on the distri- quencies and the centre frequency of the first and second for- butions of the signals, have been reported to be largely un- mant (F1 and F2) shifts (Junqua, 1993; Lu and Cooke, 2008); able to provide speech intelligibility improvements (Hu and Loizou, 2007; Jensen and Hendriks, 2012). As an alterna- Corresponding author danmi@es.aau.dk (D. Michelsanti); zt@es.aau.dk (Z. Tan); tive, data-driven techniques, especially deep learning, do not ssig@oticon.com (S. Sigurdsson); jje@es.aau.dk, jesj@oticon.com (J. make any assumptions on the distribution of the speech, of Jensen) the noise or on the way they are mixed: a learning algorithm ORCID(s): 0000-0002-3575-1600 (D. Michelsanti) is used to find a function that best maps features from de- - Page 1 of 16 arXiv:1905.12605v1 [eess.AS] 29 May 2019 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect graded speech to features from clean speech. Over the years, Command Colour* Preposition Letter* Digit* Adverb the speech processing community has put a considerable ef- bin blue at again fort into designing training targets and objective functions lay green by A–Z now 0–9 place red in (no W) please (Wang et al., 2014; Erdogan et al., 2015; Williamson et al., set white with soon 2016; Michelsanti et al., 2019b) for different neural network models, including deep neural networks (Xu et al., 2014; Table 1 Kolbæk et al., 2017), denoising autoencoders (Lu et al., 2013), Sentence structure for the Lombard GRID corpus (Alghamdi et al., 2018). The ‘*’ indicates a keyword. Adapted from recurrent neural networks (Weninger et al., 2014), fully con- (Cooke et al., 2006). volutional neural networks (Park and Lee, 2017), and gen- erative adversarial networks (Michelsanti and Tan, 2017). These methods represent the current state of the art in the speaker variability has on the systems is carried out, both field (Wang and Chen, 2018), and since they use only audio in relation to acoustic as well as visual features. Next, as signals, we refer to them as audio-only SE (AO-SE) systems. an example application, a system trained with both Lom- Previous studies show that observing the speaker’s facial bard and non-Lombard data using a wide signal-to-noise- and lip movements contributes to speech perception (Sumby ratio (SNR) range is compared with a system trained only on and Pollack, 1954; Erber, 1975; McGurk and MacDonald, non-Lombard speech, as it is currently done for the state-of- 1976). This finding suggests that a SE system could tolerate the-art models. Finally, especially since existing objective higher levels of background noise, if visual cues could be measures are limited to predict speech quality and intelligi- used in the enhancement process. This intuition is confirmed bility from the audio signals in isolation, listening tests using by a pioneering study on audio-visual SE (AV-SE) by Girin audio-visual stimuli have been performed. This test setup, et al. (2001), where simple geometric features extracted from which is generally not employed to evaluate SE systems, is the video of the speaker’s mouth are used. Later, more com- closer to a real-world scenario, where a listener is usually plex frameworks based on classical statistical approaches ha- able to look at the face of the talker. ve been proposed (Almajai and Milner, 2011; Abel and Hus- sain, 2014; Abel et al., 2014), and very recently deep learn- ing methods have been used for AV-SE (Hou et al., 2018; 2. Materials: Audio-Visual Speech Corpus Gabbay et al., 2018; Ephrat et al., 2018; Afouras et al., 2018; and Noise Data Owens and Efros, 2018; Morrone et al., 2019). The speech material used in this study is the Lombard It is reasonable to think that visual features are mostly GRID corpus (Alghamdi et al., 2018), which is an exten- helpful for SE when the speech is so degraded that AO-SE sion of the popular audio-visual GRID dataset (Cooke et al., systems achieve poor performance, i.e. when background 2006). It consists of 55 native speakers of British English noise heavily dominates over the speech of interest. Since in (25 males and 30 females) that are between 18 and 30 years such acoustical environment spoken communication is par- old. The sentences pronounced by the talkers adhere to the ticularly hard, we can assume that the speakers are under syntax from the GRID corpus, six-word sentences with the the influence of Lombard effect. In other words, the input following structure: <command> <color*> <preposition> to SE systems in this situation is Lombard speech. Despite <letter*> <digit*> <adverb> (Table 1). The words marked this consideration, state-of-the-art SE systems do not take with a * are keywords, whereas the others are fillers (Cooke Lombard effect into account, because collecting Lombard et al., 2006). speech is usually expensive. The training and the evaluation Each speaker was recorded while reading a unique set of of the systems are usually performed with speech recorded in 50 sentences in non-Lombard (NL) and Lombard (L) con- quiet and afterwards degraded with additive noise. Previous ditions (in total, 100 utterances per speaker). In both cases, work shows that speaker (Hansen and Varadarajan, 2009) the audio signals were recorded with a microphone placed and speech recognition (Junqua, 1993) systems that ignore in front of the speakers, while the video recordings were Lombard effect achieve sub-optimal performance, also in vi- collected with two cameras mounted on a helmet to have a sual (Heracleous et al., 2013; Marxer et al., 2018) and audio- frontal and a profile views of the talkers. visual settings (Heracleous et al., 2013). It is therefore of In order to induce the Lombard effect, speech shaped interest to conduct a similar study also in a SE context. noise (SSN) at 80 dB sound pressure level (SPL) was pre- With the objective of providing a more extensive analy- sented to the speakers, while they were reading the sentences sis of the impact of Lombard effect on deep-learning-based to a listener. The presence of a listener, who assured a nat- SE systems, the present work extends a preliminary study ural communication environment by asking the participants (Michelsanti et al., 2019a), providing the following novel to repeat the utterances from time to time, was needed, be- contributions. First, new experiments are conducted, where cause talkers usually adjust their speech to communicate bet- deep-learning-based SE systems trained with Lombard or ter with the people they are talking to (Lane and Tranel, non-Lombard speech are evaluated on Lombard speech us- 1971; Lu and Cooke, 2008), a process known as external or ing a cross-validation setting to avoid that a potential intra- public loop (Lane and Tranel, 1971). Since talkers tend to speaker variability of the adopted dataset leads to biased con- regulate their speaking style also based on the level of their clusions. Then, an investigation of the effect that the inter- own speech, in what is generally called internal or private - Page 2 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Consecutive Video Input Video Frames Face Detection Video Estimated Ideal Face Alignment Encoder Amplitude Mask Enhanced Speech Mouth Region Extraction Audio Inverse Decoder STFT Magnitude STFT Audio Input Magnitude Audio STFT Encoder Computation Figure 1: Pipeline of the audio-visual speech enhancement framework used in this study, adapted from (Gabbay et al., 2018), and identical to (Michelsanti et al., 2019a). The deep-learning-based system estimates an ideal amplitude mask from the video of the speaker’s mouth and the magnitude spectrogram of the noisy speech. The estimated mask is used to enhance the speech in time-frequency domain. STFT indicates the short-time Fourier transform. loop (Lane and Tranel, 1971), the speech signal was mixed and d.n/, respectively. Our models adopt a mask approxi- with the SSN at a carefully adjusted level, providing a self- mation approach (Michelsanti et al., 2019b), producing an monitoring feedback to the speakers. estimate M.k; l/ of the ideal amplitude mask, defined as In our study, the audio and the video signals from the M.k; l/ = ðX.k; l/ð_ðY .k; l/ð, with the following objective frontal camera were arranged as explained in Section 4 to function: build training, validation, and test sets. The audio signals 1 2 have a sampling rate of 16 kHz. The resolution of the frontal J = M.k; l/ * M.k; l/ ; (1) TF video stream is 720480 pixels with a variable frame rate of k;l around 24 frames per second (FPS). Audio and video signals with k Ë ^1;§ ; F`, l Ë ^1;§ ; T`, and T  F being the are temporally aligned. dimension of the training target. Recent preliminary exper- To generate speech in noise, SSN was added to the audio iments have shown that using this objective function leads signals of the Lombard GRID database. SSN was chosen to to better performance for AV-SE than competing methods match the kind of noise used in the database, since, as re- (Michelsanti et al., 2019b). ported by Hansen and Varadarajan (2009), Lombard effect occurs differently across noise types, although other stud- 3.2. Preprocessing ies (Lu and Cooke, 2009; Garnier and Henrich, 2014) failed In this work, each audio signal was peak-normalised. We to find such an evidence. The SSN we used was generated used a sample rate of 16 kHz and a 640-point STFT, with a as in (Kolbæk et al., 2016), by filtering white noise with a Hamming window of 640 samples (40 ms) and a hop size of low-order linear predictor, whose coefficients were found us- 160 samples (10 ms). Only the 321 bins that cover the pos- ing 100 random sentences from the Akustiske Databaser for itive frequencies were used, because of the conjugate sym- Dansk (ADFD) speech database. metry of the STFT. Each video signal was resampled at a frame rate of 25 3. Methodology FPS using motion interpolation as implemented in FFMPEG . The face of the speaker was detected in every frame using the In this study, we train and evaluate systems that perform frontal face detector implemented in the dlib toolkit (King, spectral SE using deep learning, as illustrated in Figure 1. 2009), consisting of5 histogram of oriented gradients (HOG) The processing pipeline is inspired by Gabbay et al. (2018) filters and a linear support vector machine (SVM). The bound- and the same as the one used in (Michelsanti et al., 2019a). ing box of the single-frame detections was tracked using a To have a self-contained exposition, we report the main de- Kalman filter. The face was aligned based on 5 landmarks tails of it in this section. using a model that estimated the position of the corners of the eyes and of the bottom of the nose (King, 2009) and was 3.1. Audio-Visual Speech Enhancement scaled to 256256 pixels. The mouth was extracted by crop- We assume to have access to two streams of information: ping the central lower face region of size 128  128 pixels. the video of the talker’s face, and an audio signal, y.n/ = Each segment of 5 consecutive grayscale video frames x.n/+d.n/, where x.n/ is the clean signal of interest, d.n/ is spanning a total of 200 ms was paired with the respective 20 an additive noise signal, and n indicates the discrete-time in- consecutive audio frames. dex. The additive noise model presented in time domain, can also be expressed in the time-frequency (TF) domain 3.3. Neural Network Architecture and Training as Y .k; l/ = X.k; l/ + D.k; l/, where Y .k; l/, X.k; l/, and The preprocessed audio and video signals, standardised D.k; l/ are the short-time Fourier transform (STFT) coeffi- using the mean and the variance from the training set, were cients at frequency bin k and at time frame l of y.n/, x.n/, used as input to a video and an audio encoders, respectively. https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf http://ffmpeg.org - Page 3 of 16 Fusion Sub-Network Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Training Material Both encoders consisted of 6 convolutional layers, each of them followed by leaky-ReLU activation functions (Maas Non-Lombard Lombard Speech Speech et al., 2013) and batch normalisation (Ioffe and Szegedy, System Narrow Wide Narrow Wide 2015). For the video encoder, also max-pooling and 0.25 Input SNR Range SNR Range SNR Range SNR Range dropout (Hinton et al., 2012) were adopted. The fusion of the (w) (w) Vision VO-NL VO-NL VO-L VO-L (w) (w) two modalities was accomplished using a sub-network con- Audio AO-NL AO-NL AO-L AO-L (w) (w) Audio-Visual AV-NL AV-NL AV-L AV-L sisting of 3 fully connected layers, followed by leaky-ReLU activations, on the outputs of the 2 encoders. The 321  20 Table 2 estimated mask was obtained with an audio decoder having (w) Models used in this study. The ‘ ’ is used to distinguish the 6 transposed convolutional layers followed by leaky-ReLU systems trained with a wide SNR range from the ones trained activations and a ReLU activation as output layer. Skip con- with a narrow SNR range. nections between the layers 1, 3, and 5 of the audio encoder and the corresponding decoder layers were used to avoid that the bottleneck hindered the information flow (Isola et al., opinion score (MOS) values, on a scale from approximately 2017). The values of the training target, M.k; l/, were lim- 1 to 4.64. ESTOI scores, which estimate speech intelligibil- ited in the [0;10] interval (Wang et al., 2014). ity, practically range from 0 to 1, where high values corre- The weights of the network were initialised with the Xa- spond to high speech intelligibility. vier approach (Glorot and Bengio, 2010). The training was As mentioned before (Section 2), clean speech signals performed using the Adam optimiser (Kingma and Ba, 2015) were mixed with SSN to match the noise type used in the with the objective function in Equation (1) and a batch size Lombard GRID corpus. Current state-of-the-art SE systems *4 of 64. The learning rate, initially set to 4 10 , was scaled are trained with signals at several SNRs to make them robust by a factor of 0:5 when the loss increased on the validation to various noise levels. We followed a similar methodology set. An early stopping technique was used, by selecting the and trained our models with two different SNR ranges, nar- network that performed the best on the validation set across row (between *20 dB and 5 dB) and wide (between *20 dB the 50 epochs used for training. and 30 dB). We used these two ranges because on the one hand we would like to assess the performance of SE sys- 3.4. Postprocessing tems when Lombard speech occurs, and on the other hand we The estimated ideal amplitude mask of an utterance was would like to have SNR-independent systems, i.e. systems obtained by concatenating the outputs of the network, ob- that also work well at higher SNRs. Such a setup allows us to tained by processing non-overlapping consecutive audio-vi- better understand whether Lombard speech, which is usually sual paired segments. The estimated mask was point-wise not available because it is hard to collect, should be used to multiplied with the complex-valued STFT spectrogram of train SE systems and which are the advantages and the dis- the noisy signal and the result inverted using an overlap-add advantages of various training configurations. The models procedure to get the time-domain signal (Allen, 1977; Grif- used in this work are shown in Table 2. fin and Lim, 1984). Similarly to the work by Marxer et al. (2018), the ex- periments were conducted adopting a multi-speaker setup, 3.5. Mono-Modal Speech Enhancement in which all the speakers in the database were used for both Until now, we only presented AV-SE systems. In or- training and evaluating the systems. This choice was made der to understand the relative contribution of the audio and for a practical reason. People may exhibit speech charac- the visual modalities, we also trained networks to perform teristics that differ considerably from each other when they mono-modal SE, by removing one of the two encoders from speak in presence of noise (Junqua, 1993; Marxer et al., 2018). the neural network architecture, without changing the other It is possible to model these differences by training speaker- explained settings and procedures. Both AO-SE and video- dependent systems, but this requires a large set of Lombard only SE (VO-SE) systems estimate a mask and apply it to speech for every speaker. Unfortunately, the audio-visual the noisy speech, but they differ in the signals used as input. speech corpus that we use, despite being one of the largest existing audio-visual databases for Lombard speech, only contains 50 utterances per speaker, which are not enough to 4. Experiments train a deep-learning-based model. The experiments conducted in this study compare the The experiments were performed according to a strati- performance of AO-SE, VO-SE, and AV-SE systems in terms fied five-fold cross-validation procedure (Liu and Özsu, 2009). of two widely adopted objective measures: perceptual eval- Specifically, the data was divided into five folds of approx- uation of speech quality (PESQ) (Rix et al., 2001), specifi- imately the same size, four of them used for training and cally the wideband extension (ITU, 2005) as implemented by validation, and one for testing. This process was repeated Loizou (2007), and extended short-time objective intelligi- five times for different test sets in order to evaluate the sys- bility (ESTOI) (Jensen and Taal, 2016). PESQ scores, used tems on the whole dataset. Before the split, the signals were to estimate speech quality, lie between *0:5 and 4:5, where rearranged to have about the same amount of data for each high values correspond to high speech quality. However, the speaker across the training (í 35 utterances), the validation wideband extension that we use maps these scores to mean - Page 4 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ VO-L VO-NL AO-L AO-NL AV-L AV-NL (í 5 utterances), and the test (í 10 utterances) sets. This ensured that each fold was a good representative of the inter- *20 - 5 dB 1.163 1.113 1.353 1.283 1.446 1.331 speaker variations of the whole dataset. For some speakers, ESTOI VO-L VO-NL AO-L AO-NL AV-L AV-NL some data was missing or corrupted, so we used fewer utter- *20 - 5 dB 0.372 0.335 0.448 0.423 0.528 0.488 ances. Among the 55 speakers, the recordings from speaker Table 3 s1 were discarded by the database collectors due to technical Average scores for the systems trained on a narrow SNR range. issues, and the data from speaker s51 was used only in the training set, because only 40 of the utterances could be used. Effectively, 53 speakers were used to evaluate our systems. fact that the noise level is so high that recovering the clean speech only using the noisy audio input is very challenging, 4.1. Systems Trained on a Narrow SNR Range and that the visual modality provides a richer information Since we would like to assess the performance of SE source at this noise level. systems when Lombard speech occurs, SSN is added to the For all the modalities, L systems tend to be better than speech signals from the Lombard GRID corpus at 6 differ- the respective NL systems. The only exception is AO-NL, ent SNRs, in uniform steps between *20 dB and 5 dB. This which have a higher PESQ score than AO-L at *20 dB SNR, choice was driven by the following considerations (Michel- but this difference is very modest (0:011). AV-L always out- santi et al., 2019a). Since Lombard and non-Lombard ut- performs AV-NL in terms of PESQ by a large margin, with terances from the Lombard GRID corpus have an energy more than 5 dB SNR gain, if we consider the performance difference between 3 and 13 dB (Marxer et al., 2018), the between *20 dB and *10 dB SNR. On average (Table 3), the actual SNR can be computed assuming that the conversa- performance gap in terms of PESQ between L and NL sys- tional speech level is between 60 and 70 dB sound pressure tems, is greater for the audio-visual case (0:115) than for the level (SPL) (Raphael et al., 2007; Moore et al., 2012) and the audio-only (0:070) and the video-only (0:050) cases, mean- noise level at 80 dB SPL, like in the recording conditions of ing that the speaking style mismatch is more detrimental the database. The SNR range obtained in this way is between when both the modalities are used. Regarding ESTOI, the *17 and 3 dB. In the experiments, we used a slightly wider gap between AV-L and AV-NL (0:040) is still the largest, range because of the possible speech level variations caused but the one between VO-L and VO-NL (0:037) is greater by the distance between the listener and the speaker. than the gap between AO-L and AO-NL (0:025): this sug- For all the systems, Lombard speech was used to build gests that the impact of visual differences between Lombard the test set, while for training and validation we used Lom- and non-Lombard speech on estimated speech intelligibility bard speech for VO-L, AO-L, and AV-L, and non-Lombard is higher than the impact of acoustic differences. speech for VO-NL, AO-NL, and AV-NL (Table 2). These results suggest that training systems with Lom- 4.1.1. Results and Discussion bard speech is beneficial in terms of both estimated speech Figure 2 shows the cross-validation results in terms of quality and estimated speech intelligibility. This is in line PESQ and ESTOI for all the different systems. On average, with and extends our preliminary study (Michelsanti et al., every model improves the estimated speech quality and the 2019a), where only a subset of the whole database was used estimated speech intelligibility of the unprocessed signals, to evaluate the models. with the exception of VO-NL at 5 dB SNR, which shows 4.1.2. Effects of Inter-Speaker Variability an ESTOI score comparable with the one of noisy speech. Previous work found a large inter-speaker variability for Another general trend that can be observed is that AV sys- Lombard speech, especially between male and female speak- tems outperform the respective AO and VO systems, an ex- ers (Junqua, 1993). Here, we investigate whether this vari- pected result since the information that can be exploited us- ability affects the performance of SE systems. ing two modalities is no less than the information of the sin- Figure 3 shows the average PESQ and ESTOI scores by gle modalities taken separately. gender. Since the scores are computed on different speech It is worth noting that VO systems’ performance changes material, it may be hard to make a direct comparison be- across SNR, although they do not use the audio signal to esti- tween males and females by looking at the absolute perfor- mate the ideal amplitude mask. This is because the estimated mance. Instead, we focus on the gap between L and NL mask is applied to the noisy input signal, so the performance systems averaged across SNRs for same gender. At a first depends on the noise level of the input audio signal. glance, the trends of the different conditions are as expected: PESQ scores show that the performance that can be ob- L systems are better than the respective NL ones, and AV tained with AO systems is comparable with VO systems per- systems outperform AO systems trained with speech of the formance at very low SNRs. Only for SNR g *10 dB, AO same speaking style, in terms of both estimated speech qual- models start to perform substantially better than VO mod- ity and estimated speech intelligibility. We also notice that els. The difference increases with higher SNRs. Also for the scores of VO systems are worse than the AO ones, also ESTOI, this pattern can be observed when SNR g *10 dB, for ESTOI. This is because we average across all the SNRs but for SNR f *15 dB VO systems perform better than the and VO is better than AO only at very low SNRs, but con- respective AO systems, especially at *20 dB SNR where the siderably worse for SNR g *5 dB (Figure 2). performance gap is very large. This can be explained by the - Page 5 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ ESTOI 2.2 0.8 1.16 VO-L VO-L VO-NL VO-NL 1.14 AO-L AO-L 0.7 AO-NL AO-NL 2 AV-L AV-L 1.12 AV-NL AV-NL Unproc. Unproc. 1.1 0.6 1.08 1.8 0.5 1.06 1.04 1.6 0.4 1.02 -20 -15 SNR (dB) 0.3 1.4 0.2 1.2 0.1 1 0 -20 -15 -10 -5 0 5 -20 -15 -10 -5 0 5 SNR (dB) SNR (dB) Figure 2: Cross-validation results in terms of PESQ and ESTOI for the systems trained on a narrow SNR range. At every SNR, there are three pairs of coloured bars with error bars, each of them referring to VO, AO, and AV systems (from left to right). The wide bars in dark colours represent L systems, while the narrow ones in light colours represent NL systems. The heights of each bar and the error bars indicate the average scores and the 95% confidence intervals computed on the pooled data, respectively. The transparent boxes with black edges, overlaying the bars of the other systems, and the error bars indicate the average scores of the unprocessed signals (Unproc.) and their 95% confidence intervals, respectively. PESQ ESTOI 1.5 0.6 1.45 0.5 1.4 1.35 0.4 1.3 1.25 0.3 MA 1.2 0.2 1.15 VO-L VO-NL MS AO-L 1.1 AO-NL 0.1 AV-L 1.05 Figure 4: Mouth aperture (MA) and mouth spreading (MS) AV-NL Unproc. from 4 facial landmarks. 1 0 Male Female Male Female Figure 3: Cross-validation results for male and female speakers in terms of PESQ and ESTOI. been used to study Lombard speech in previous work (Gar- nier et al., 2006, 2012; Tang et al., 2015; Alghamdi, 2017): F0, mouth aperture (MA) and mouth spreading (MS). The The difference between L and NL systems is larger for average F0 for each speaker was estimated with Praat (Boersma females than it is for males. This can be observed for all the and Weenink, 2001), using the default settings for pitch es- modalities and it is more noticeable for AV systems, most timation. The average MA and MS per speaker were com- likely because they account for both audio and visual dif- puted from 4 facial landmarks (Figure 4) obtained with the ferences. In order to better understand this behaviour, we pose estimation algorithm (Kazemi and Sullivan, 2014), train- provide a more in-depth analysis, investigating the impact ed on the iBUG 300-W database (Sagonas et al., 2016), im- that some acoustic and geometric articulatory features have plemented in the dlib toolkit (King, 2009). Let F0, MA, and MS denote the average difference on estimated speech quality and estimated speech intelligi- in audio and visual features, respectively, between Lombard bility. and non-Lombard speech. Similarly, letPESQ andESTOI We consider three different features that have already - Page 6 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect 0.3 0.3 0.3 Male Male Male Female Female Female 0.25 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0 -2 0 2 4 6 -2 -1 0 1 2 3 4 -10 0 10 20 30 40 50 MA (Pixels) MS (Pixels) F0 (Hz) 0.12 0.12 0.12 Male Male Male Female Female Female 0.1 0.1 0.1 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 -0.02 -0.02 -0.02 -2 0 2 4 6 -2 -1 0 1 2 3 4 -10 0 10 20 30 40 50 MA (Pixels) MS (Pixels) F0 (Hz) Figure 5: Scatter plots showing the relationship between the audio/visual features and PESQ/ESTOI. For each circle, which refers to a particular speaker, the y-coordinate indicates the average performance increment of AV-L with respect to AV-NL in terms of PESQ or ESTOI, while the x-coordinate indicates the average increment of audio (fundamental frequency) or visual (mouth aperture and mouth spreading) features in Lombard condition with respect to the respective feature in non-Lombard condition. The lines show the least-squares lines for male speakers (blue), female speakers (red), and all the speakers (yellow). MA, MS, and F0 indicate mouth aperture, mouth spreading, and fundamental frequency, respectively. denote the increment in PESQ and ESTOI, respectively, of Given n pairs of.x ; y / observations, with i Ë ^1;§ ; n`, i i AV-L with respect to AV-NL. Figure 5 illustrates the rela- from two variables x and y, whose sample means are denoted tionship between F0, MA, and MS and PESQ, and as x„ and y„, respectively, we refer to the Pearson’s correlation ESTOI. We notice that on average for each speaker PESQ coefficient as  .x; y/. We have that *1 f  .x; y/ f 1, P P and ESTOI are both positive, with only one exception rep- where 0 denotes the absence of a linear relationship between resented by a male speaker, whose ESTOI is slightly less the two variables, and *1 and 1 a perfect positive linear than 0. This indicates that no matter how different the speak- relationship and a perfect negative linear relationship, re- ing style of a person is in presence of noise, there is a benefit spectively. To complement the Pearson’s correlation coef- in training a system with Lombard speech. Focusing on the ficient, we also consider the Spearman’s correlation coeffi- range of the features’ variations, most of the speakers have cient,  .x; y/, defined as (Sharma, 2005): positive MA, MS, and F0. This is in accordance with .x; y/ =  .r ; r /; (2) S P x y previous research, which suggests that in Lombard condition where r and r indicate rank variables. The advantage of there is a tendency to amplify lips’ movements and rise the x y using ranks is that  allows to assess whether the relation- pitch (Garnier et al., 2010, 2012; Junqua, 1993). MA and ship between x and y is monotonic (not limited to linear). MS values lie between *2 and 6 pixels, and between *2 As shown in Table 4, for AV systems, F0 has a higher and 4 pixels, respectively, for both male and female speak- correlation withPESQ ( = 0:73,  = 0:73) andESTOI ers. Regarding the F0 range, it is wider for females, up to P S ( = 0:81,  = 0:77) than MA and MS. We observe 50 Hz, against the 25 Hz reached by males. P S that for female speakers, the correlation between the fea- Among the three features considered, F0 is the one that tures’ increments and the performance measures’ increments seems to be related the most with PESQ and ESTOI. is usually higher, especially when considering MS, sug- This can be seen by comparing the distributions of the cir- gesting that some inter-gender difference should be present cles with the least-squares lines in the plots of Figure 5 or not only for F0 (whose range is way wider for females as by analysing the correlation between PESQ/ESTOI incre- previously stated), but also for visual features. ments and audio/visual feature increments, using Pearson’s In Table 4 we also report the correlation coefficients for and Spearman’s correlation coefficients. the single modalities. The correlation of visual features’ in- - Page 7 of 16 ESTOI for AV Systems PESQ for AV Systems ESTOI for AV Systems PESQ for AV Systems ESTOI for AV Systems PESQ for AV Systems Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect (w) (w) (w) (w) (w) (w) PESQ VO-L VO-NL AO-L AO-NL AV-L AV-NL P S *20 - 5 dB 1.153 1.080 1.346 1.295 1.424 1.323 all m f all m f 10 - 30 dB 2.348 2.418 3.127 3.155 3.151 3.169 PESQ (VO) - MA :29 :32 :24 :35 :30 :29 (w) (w) (w) (w) (w) (w) ESTOI VO-L VO-NL AO-L AO-NL AV-L AV-NL PESQ (AO) - MA :43 :49 :40 :55 :49 :51 *20 - 5 dB 0.376 0.330 0.442 0.422 0.517 0.483 PESQ (AV) - MA :57 :59 :56 :65 :52 :66 10 - 30 dB 0.844 0.825 0.927 0.929 0.928 0.930 ESTOI (VO) - MA :46 :19 :57 :52 :16 :69 Table 5 ESTOI (AO) - MA :43 :47 :46 :52 :52 :50 ESTOI (AV) - MA :57 :53 :65 :67 :47 :72 Average scores for the systems trained on a wide SNR range. PESQ (VO) - MS :19 *:08 :35 :12 *:03 :31 PESQ (AO) - MS :31 :20 :45 :33 :19 :54 PESQ (AV) - MS :45 :21 :68 :44 :28 :71 Lombard and non-Lombard speech, are preferred. There are ESTOI (VO) - MS :45 *:12 :73 :22 *:21 :62 several ways to achieve this goal. For example, it is possi- ESTOI (AO) - MS :30 :05 :47 :22 :07 :48 ble to train a system (with Lombard speech) that works at ESTOI (AV) - MS :47 :02 :72 :34 *:02 :66 low SNRs, and another one (with non-Lombard speech) that PESQ (VO) - F0 :34 :26 :31 :36 :23 :35 works at high SNRs. This approach requires switching be- PESQ (AO) - F0 :62 :53 :58 :61 :52 :61 tween the two systems, which can be problematic, because PESQ (AV) - F0 :73 :58 :75 :73 :59 :80 it involves an online estimation of the SNR. An alternative ESTOI (VO) - F0 :77 :57 :77 :77 :58 :82 ESTOI (AO) - F0 :64 :55 :60 :60 :56 :61 approach is to train general systems with Lombard speech at ESTOI (AV) - F0 :81 :64 :81 :77 :61 :84 low SNRs and non-Lombard speech at high SNRs. We fol- lowed this alternative approach, building such systems and Table 4 studying their strengths and limitations. We also compared Pearson’s ( ) and Spearman’s ( ) correlation coefficients P S them with systems trained only with non-Lombard speech between PESQ/ESTOI increments and audio/visual feature for the whole SNR range, because this is what current state- increments for male speakers (m), female speakers (f), and all the speakers. MA, MS, and F0 indicate mouth aperture, of-the-art systems do. mouth spreading, and fundamental frequency, respectively. The test set was built by mixing additive SSN with Lom- bard speech at 6 SNRs between *20 and 5 dB, and with non-Lombard speech at 5 SNRs between 10 and 30 dB. For (w) (w) (w) crements with PESQ or ESTOI is sometimes higher for VO-NL , AO-NL , and AV-NL , only non-Lombard (w) AO systems than it is for VO systems. This might seem speech was used during training, while for VO-L , AO- (w) (w) counter-intuitive, because AO systems do not use visual in- L , and AV-L , Lombard speech was used with SNR f 5 formation. However, correlation does not imply causation dB and non-Lombard speech with SNR g 10 dB, to match (Field, 2013): since visual and acoustic features are corre- the speaking style of the test set (Table 2). The results in lated (Almajai et al., 2006), it is possible that other acoustic terms of PESQ and ESTOI are shown in Figure 6. features, which are not considered in this study even though The relative performance of the systems at SNR f 5 dB they might be correlated with MA and MS, play a role is similar to the one observed for the systems trained on a (w) in the enhancement. Similar considerations can be done for narrow SNR range (Section 4.1): L systems outperform (w) F0, which has a correlation with ESTOI for VO systems the respective NL systems, AV performance is higher than ( = 0:77,  = 0:77) higher than the one for AO systems AO and VO performance, and VO is considerably better than P S ( = 0:64,  = 0:60). By looking at the inter-gender dif- AO only in terms of ESTOI at very low SNRs. P S (w) ferences, we find that, in general, the correlation coefficients When SNR g 10 dB, NL systems perform better than (w) computed for female speakers are higher than the ones com- L systems in terms of PESQ. The difference is on aver- puted for male speakers, especially when considering MS. age (Table 5) larger for VO (0:070) than it is for AO (0:028) In general, a performance difference between genders and AV (0:018). This can be explained by the fact that it is (w) exists when L systems are compared with NL ones, with a harder for VO-L to recognise when non-Lombard speech gap that is larger for females. This is unlikely to be caused by occurs using only the video of the speaker. However, these (w) the small gender imbalance in the training set (23 males and performance gaps are smaller than the ones between L and (w) 30 females). Instead, it is reasonable to assume that this re- NL systems at SNR f 5 dB (0:073 for VO, 0:051 for AO, sult is due to the characteristics of the Lombard speech of fe- and 0:101 for AV). male speakers, which shows a large increment of F0, the fea- Regarding ESTOI at SNR g 10 dB, the difference be- ture that correlates the most with the estimated speech qual- tween AO and AV becomes negligible, with VO systems that ity and the estimated speech intelligibility increases, among perform considerably worse. This is because audio features the ones considered. are more informative than visual ones at high SNRs, making AO-SE systems already good to recover speech intelligibil- (w) (w) 4.2. Systems Trained on a Wide SNR Range ity. In addition, the average gaps between NL and L are The models presented in Section 4.1 have been trained to quite small: 0:002 for AO and AV, while for VO it is actually enhance signals when Lombard effect occurs, i.e. at SNRs *0:019. between *20 and 5 dB. However, from a practical perspec- In general, at SNR f 5 dB, the systems that use both tive, SNR-independent systems, capable of enhancing both Lombard and non-Lombard speech for training perform bet- - Page 8 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ (w) VO-L (w) VO-NL 3.8 (w) 3.5 AO-L (w) 3.6 AO-NL (w) AV-L 3.4 (w) AV-NL Unproc. 3.2 2.5 20 25 30 SNR (dB) 1.5 -20 -15 -10 -5 0 5 10 15 20 25 30 SNR (dB) ESTOI (w) VO-L (w) VO-NL 0.8 (w) AO-L (w) AO-NL (w) AV-L 0.6 (w) AV-NL Unproc. 0.4 0.2 -20 -15 -10 -5 0 5 10 15 20 25 30 SNR (dB) Figure 6: As Figure 2, but for the systems trained on a wide SNR range. ter than the ones that only use non-Lombard speech. At sual stimuli. Both tests were conducted in a silent room, higher SNRs, their PESQ and ESTOI scores are slightly worse where a MacBookPro11,4 equipped with an external moni- than the ones of the systems trained only with non-Lombard tor, a sound card (Focusrite Scarlett 2i2) and a set of closed speech. However, this performance gap is small, and seems headphones (Beyerdynamic DT770) was used for audio and to be larger for the estimated speech quality than for the es- video playback. The multimedia player (VLC media player timated speech intelligibility. The way we combined non- 3.0.4) was controlled by the subjects with a graphical user in- Lombard and Lombard speech for training seems to be the terface (GUI) modified from MUSHRAM (Vincent, 2005). best solution for an SNR-independent system, although a The processed signals used in this test were from the systems small performance loss may occur at high SNRs. trained on the narrow SNR range previously described (Sec- tion 4.1). All the audio stimuli were normalised according to the two-pass EBU R128 loudness normalisation proce- 5. Listening Tests dure (EBU, 2014), as implemented in ffmpeg-normalize , to Although it has been shown that visual cues have an im- guarantee that signals of different conditions were perceived pact on speech perception (Sumby and Pollack, 1954; McGurk as having the same volume. The subjects were allowed to and MacDonald, 1976), the currently available objective mea- adjust the general loudness to a comfortable level during the sures used to estimate speech quality and speech intelligibil- training session of each test. ity, e.g. PESQ and ESTOI, only take into account the audio signals. Even when listening tests are performed to evaluate 5.1. Speech Quality Test the performance of a SE system, visual stimuli are usually The quality test was carried out by 13 experienced listen- ignored and not presented to the participants (Hussain et al., ers, who volunteered to be part of the study. The participants 2017), despite the fact that visual inputs are typically avail- were between 26 and 44 years old, and had self-reported nor- able during practical deployment of SE systems. mal hearing and normal (or corrected to normal) vision. On For these reasons, and in an attempt to evaluate the pro- average, each participant spent approximately 30 minutes to posed AV enhancement systems in a setting as realistic as complete the test. possible, we performed two listening tests, one to assess the https://github.com/slhck/ffmpeg-normalize speech quality and the other to assess the speech intelligi- bility, where the processed audio signals from the Lombard GRID corpus were accompanied by their corresponding vi- - Page 9 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect MUSHRA (-5 dB SNR) MUSHRA (5 dB SNR) Small Effect Size Medium Effect Size Large Effect Size 100 100 0:11 f ðd ð < 0:28 0:28 f ðd ð < 0:43 0:43 f ðd ð f 1 90 90 C C C 80 80 Table 6 70 70 Interpretation of the effect size (Cliff’s delta, d ). Adapted from (Vargha and Delaney, 2000). 60 60 50 50 40 40 *5 dB SNR 5 dB SNR 30 30 Comparison p d p d C C 20 20 AO-L - AO-NL < :0083 :30 :0134 :22 10 10 AV-L - AV-NL < :0083 :32 < :0083 :23 0 0 AO-L - AV-L :0498 *:14 :7476 :02 AO-NL - AV-NL < :0083 *:21 :8262 *:02 AO-L - Unproc. :0479 :57 < :0083 :74 AV-L - Unproc. :0134 :59 < :0083 :79 Figure 7: Box plots showing the results of the MUSHRA ex- periments for the signals at *5 dB SNR (left) and at 5 dB SNR Table 7 (right). The red horizontal lines and the diamond markers in- p-values (p) and effect sizes (Cliff’s delta, d ) for the MUSHRA dicate the median and the mean values, respectively. Outliers experiments. The significant level (0.0083) for the p-values is (identified according to the 1.5 interquartile range rule) are corrected with the Bonferroni method. displayed as red crosses. Ref. indicates the reference signals. dopted to determine whether there exists a median differ- 5.1.1. Procedure ence between the MUSHRA scores obtained for two differ- The test used the MUlti Stimulus test with Hidden Refer- ent conditions. Differences in median are considered signif- ence and Anchor (MUSHRA) (ITU, 2003) paradigm to as- icant for p < _m = 0:0083 ( = 0:05, m = 6), where the sess the speech quality on a scale from 0 to 100, divided into significance level is corrected with the Bonferroni method to 5 equal intervals labelled as bad, poor, fair, good, and ex- compensate for multiple hypotheses tests (Field, 2013). The cellent. No definition of speech quality was provided to the use of p-values as the only analysis strategy has been heavily participants. Each subject was presented with 2 sequences criticized (Hentschke and Stüttgen, 2011) because statistical of 8 trials each, 4 to evaluate the systems at *5 dB SNR, and significance can be obtained with a big sample size (Sullivan 4 to evaluate the systems at 5 dB SNR. Lower SNRs were and Feinn, 2012; Moore et al., 2012) even if the magnitude of not considered to ensure that the perceptual quality assess- the effect is negligible (Hentschke and Stüttgen, 2011). For ment was not influenced too much by the decrease in intel- this reason, we complement p-values with a non-parametric ligibility. One trial consisted of one reference (clean speech measure of the effect size, the Cliff’s delta (Cliff, 1993): signal) and seven other signals to be rated with respect to the ³ ³ ³ ³ m n m n [x > y ] * [x < y ] reference: 1 hidden reference, 4 systems under test (AO-L, i j i j i=1 j=1 i=1 j=1 d = ; (3) AO-NL, AV-L, AV-NL), 1 unprocessed signal, and 1 hid- mn den anchor (unprocessed signal at *10 dB SNR). The par- where x and y are the observations of the samples of sizes ticipants were allowed to switch at will between any of the i j m and n to be compared and[P] indicates the Iverson bracket, signals inside the same trial. The order of presentation of which is 1 if P is true and 0 otherwise. As reported in both the trials and the conditions was randomised, and sig- Table 6, we consider the effect size to be small if 0:11 f nals from 4 different randomly chosen speakers were used ðd ð < 0:28, medium if 0:28 f ðd ð < 0:43, and large for each sequence of trials. C C if ðd ð g 0:43, according to the indication by Vargha and Before the actual test, the participants were trained in a C Delaney (2000). The p-values and the effect sizes for the special separate session, with the purpose of exposing them comparisons considered in this study are shown in Table 7. to the nature of the impairments and making them familiar At SNR = *5 dB, a significant (p < 0:0083) medium with the equipment and the grading system. (0:28 < ðd ð < 0:43) difference exists between Lombard and non-Lombard systems for both the audio-only and the 5.1.2. Results and Discussion audio-visual cases. The increment in quality when using vi- The average scores assigned by the subjects for each con- sion with respect to audio-only systems is perceived by the dition are shown in Figure 7 in the form of box plots. subjects (ðd ð > 0:11), but it has only a relatively small ef- Non-parametric approaches are used to analyse the data fect (ðd ð < 0:28). This was expected, since visual cues af- (Mendonça and Delikaris-Manias, 2018; Winter et al., 2018), fect more the intelligibility at low SNRs than quality, as also since the assumption of normal distribution of the data is in- shown by objective measures (Figure 2). More specifically, valid, given the number of participants and their different in- for non-Lombard systems, this difference is significant and terpretation of the MUSHRA scale. Specifically, the paired greater than the one found for Lombard systems, meaning two-sided Wilcoxon signed-rank test (Wilcoxon, 1945) is a- - Page 10 of 16 Ref. AO-L AO-NL AV-L AV-NL Unproc. Anchor Ref. AO-L AO-NL AV-L AV-NL Unproc. Anchor Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect that vision contributes more when the enhancement of Lom- p SNR bard speech is performed with systems that were not trained Comparison -20 dB -15 dB -10 dB -5 dB with it. We can notice that there is a large (ðd ð > 0:43) AO-L - AO-NL :3066 :4688 :0430 :2539 difference between the unprocessed signals and the version AV-L - AV-NL :0625 :8633 :0742 :1055 enhanced with Lombard systems. However, this difference AO-L - AV-L :0010 :0117 :5625 :2344 is not significant, probably due to the heterogeneous inter- AO-NL - AV-NL :0527 :0430 :3359 :2070 pretation of the MUSHRA scale by the subjects and their AO-L - Unproc. :0332 :0547 :9004 :1250 preference of the different natures of the impairment (pres- AV-L - Unproc. :1270 :0078 :8828 :8828 ence of noise or artefacts caused by the enhancement). d SNR At an SNR of 5 dB a small difference between Lombard and non-Lombard systems is observed, despite being not sig- Comparison -20 dB -15 dB -10 dB -5 dB nificant in the audio-only case (p = 0:0134). At this noise AO-L - AO-NL *:08 :06 :31 *:31 level, audio-visual systems appear to be indistinguishable AV-L - AV-NL :32 :01 :39 :28 (ðd ð < 0:11) from the respective audio-only systems. This C AO-L - AV-L *:91 *:35 *:17 *:34 confirms the intuition that vision does not help in improving AO-NL - AV-NL *:32 *:37 *:31 :21 the speech quality at high SNRs. Finally, the difference be- AO-L - Unproc. *:31 :17 *:09 *:26 tween the unprocessed signals and the respective enhanced AV-L - Unproc. :18 :46 0 :08 versions using Lombard systems is both large (ðd ð > 0:43) Table 8 and significant (p < 0:0083), which makes it clear that both p-values (p) and effect sizes (Cliff’s delta, d ) for the mean AO-L and AV-L improve the speech quality. intelligibility scores for all the keywords obtained in the listening tests. 5.2. Speech Intelligibility Test The intelligibility test was carried out by 11 listeners, who volunteered to be part of the study. The participants Table 8 shows Cliff’s deltas and p-values, computed with were between 24 and 65 years old, and had self-reported nor- the paired two-sided Wilcoxon signed-rank test, as in the mal hearing and normal (or corrected to normal) vision. On MUSHRA experiments. average, each participant spent approximately 45 minutes to The effect sizes support the observations made from Fig- complete the test. ure 8. Medium and large differences (ðd ð > 0:28) exist be- tween AO and AV systems, especially at low SNRs. While 5.2.1. Procedure AO-L and AO-NL are indistinguishable (ðd ð < 0:11) for Each subject was presented with2 sequences of80 audio- SNR < *10 dB, there is a medium (0:28 f ðd ð < 0:43) dif- visual stimuli from the Lombard GRID corpus: 8 speakers ference between AV-L and AV-NL, except for *15 dB SNR  4 SNRs (*20, *15, *10, and *5 dB)  5 processing con- (d = 0:01). Moreover, the intelligibility increase of AV-L ditions (unprocessed, AO-L, AO-NL, AV-L, AV-NL). The over the unprocessed signals is perceived by the subjects at participants were asked to listen to each stimulus only once SNR f *15 dB (ðd ð > 0:11). and, based on what they heard, they had to select the colour Regarding the p-values, if we focus on each SNR sep- and the digit from a list of options and to write the letter arately, the difference between two approaches can be con- (Table 1). The order of presentation of the stimuli was ran- sidered significant for p < 0:0083 (cf. Section 5.1.2). This domised. condition is met only when we compare AO-L with AV-L Before the actual test, the participants were trained in a at *20 dB SNR and AV-L with the noisy speech at *15 dB special separate session consisting of a sequence of40 audio- SNR. visual stimuli. There are three main sources of variability that most like- ly prevent the differences to be significant. First, the varia- 5.2.2. Results and Discussion tion in lipreading ability among individuals is large and does The mean percentage of correctly identified keywords as not directly reflect the variation found in auditory speech per- a function of the SNR is shown in Figure 8. We can see ception skills (Summerfield, 1992). Secondly, individuals that among the three fields, the colour is the easiest word to have very different fusion responses to discrepancy in the be identified by the participants. In general, the following auditory and visual syllables (Mallick et al., 2015), which trends can be observed. At low SNRs the intelligibility of in our case might occur due to the artefacts produced in the the signals enhanced with AV systems is higher than the in- enhancement process. Finally, the participants were not ex- telligibility obtained with AO systems. This difference sub- posed to the same utterances processed with the different ap- stantially diminishes when the SNR increases. There is no proaches like in MUSHRA. Since the vocabulary set of the big performance difference between L and NL systems, but Lombard GRID corpus is small and some words are easier to in general AV-L tends to have higher percentage scores than understand because they contain unambiguous visemes, the the other systems. AV-L is also the only system that does intelligibility scores are affected not only by the various pro- not decrease the mean intelligibility scores for all the fields cessing conditions, but also by the different sentences used. if compared to the unprocessed signals. - Page 11 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Colour Letter Digit Mean 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 AO-L AO-NL 20 20 20 20 AV-L AV-NL Unproc. 0 0 0 0 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5 SNR (dB) SNR (dB) SNR (dB) SNR (dB) Figure 8: Percentage of correctly identified words obtained in the listening tests for the colour, the letter, and the digit fields, averaged across 11 subjects. The mean intelligibility scores for all the fields are also reported. significant differences between Lombard and non-Lombard 6. Conclusion systems at all the used SNRs for the audio-visual case and In this paper, we presented an extensive analysis of the only at *5 dB SNR for the audio-only case. Regarding the impact of Lombard effect on audio, visual and audio-visual speech intelligibility test, we observed that on average the speech enhancement systems based on deep learning. We scores obtained with the audio-visual system trained with conducted several experiments using a database consisting Lombard speech are higher than the other processing con- of 54 speakers and showed the general benefit of training a ditions. However, we were unable to find significant differ- system with Lombard speech. ences in most of the cases, suggesting that in future works In more detail, we first trained systems with Lombard or more effort should be put into designing new paradigms for non-Lombard speech and evaluated them on Lombard speech speech intelligibility tests to control the several sources of adopting a cross-validation setup. The results showed that variability caused by the combination of auditory and visual systems trained with Lombard speech outperform the respec- stimuli. tive systems trained with non-Lombard speech in terms of both estimated speech quality and estimated speech intelli- 7. Acknowledgements gibility. We also observed a performance difference across speakers, with an evident gap between genders: the perfor- This work was supported, in part, by the Oticon Founda- mance difference between the systems trained with Lom- tion. bard speech and the ones trained with non-Lombard speech is larger for females than it is for males. The analysis that References we performed suggests that this difference might be primar- Abel, A., Hussain, A., 2014. Novel two-stage audiovisual ily due to the large increment in the fundamental frequency speech filtering in noisy environments. Cognitive Com- that female speakers exhibit from non-Lombard to Lombard putation 6 (2), 200–217. conditions. With the objective of building more general systems able Abel, A., Hussain, A., Luo, B., 2014. Cognitively inspired to deal with a wider SNR range, we then trained systems us- speech processing for multimodal hearing technology. In: ing Lombard and non-Lombard speech and compared them Proceedings of CICARE. IEEE, pp. 56–63. with systems trained only on non-Lombard speech. As in the narrow SNR case, systems that include Lombard speech Afouras, T., Chung, J. S., Zisserman, A., 2018. The conver- perform considerably better than the others at low SNRs. sation: Deep audio-visual speech enhancement. In: Pro- At high SNRs, the estimated speech quality and the esti- ceedings of Interspeech. pp. 3244–3248. mated speech intelligibility obtained with systems trained Alghamdi, N., 2017. Visual speech enhancement and its ap- only with non-Lombard speech are higher, even though the plication in speech perception training. Ph.D. thesis, Uni- performance gap is very small for the audio and the audio- versity of Sheffield. visual cases. Combining non-Lombard and Lombard speech for training in the way we did guarantees a good compromise Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, for the enhancement performance across all the SNRs. G. J., 2018. A corpus of audio-visual Lombard speech We also performed subjective listening tests with audio- with frontal and profile views. The Journal of the Acous- visual stimuli, in order to evaluate the systems in a situation tical Society of America 143 (6), EL523–EL529. closer to the real-world scenario, where the listener can see the face of the talker. For the speech quality test, we found - Page 12 of 16 Intelligibility (%) Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Allen, J., 1977. Short term spectral analysis, synthesis, Gabbay, A., Shamir, A., Peleg, S., 2018. Visual speech en- and modification by discrete Fourier transform. IEEE hancement. In: Proceedings of Interspeech. pp. 1170– Transactions on Acoustics, Speech, and Signal Process- 1174. ing 25 (3), 235–238. Garnier, M., Bailly, L., Dohen, M., Welby, P., Lœvenbruck, Almajai, I., Milner, B., 2011. Visually derived Wiener fil- H., 2006. An acoustic and articulatory study of Lombard ters for speech enhancement. IEEE Transactions on Au- speech: Global effects on the utterance. In: Proceedings dio, Speech, and Language Processing 19 (6), 1642–1651. of Interspeech/ICSLP. pp. 2246–2249. Almajai, I., Milner, B., Darch, J., 2006. Analysis of corre- Garnier, M., Henrich, N., 2014. Speaking in noise: How lation between audio and visual speech features for clean does the Lombard effect improve acoustic contrasts be- audio feature prediction in noise. In: Proceedings of In- tween speech and ambient noise? Computer Speech & terspeech/ICSLP. p. 1634. Language 28 (2), 580–597. Boersma, P., Weenink, D., 2001. Praat: doing phonet- Garnier, M., Henrich, N., Dubois, D., 2010. Influence of ics by computer. http://www.fon.hum.uva.nl/praat/, ac- sound immersion and communicative interaction on the cessed: March 20, 2019. Lombard effect. Journal of Speech, Language, and Hear- ing Research 53 (3), 588–608. Cliff, N., 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin 114 (3), Garnier, M., Ménard, L., Richard, G., 2012. Effect of be- 494. ing seen on the production of visible speech cues. A pi- lot study on Lombard speech. In: Proceedings of Inter- Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An speech/ICSLP. pp. 611–614. audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society Gerkmann, T., Martin, R., 2009. On the statistics of spec- of America 120 (5), 2421–2424. tral amplitudes after variance reduction by temporal cep- strum smoothing and cepstral nulling. IEEE Transactions Cooke, M., King, S., Garnier, M., Aubanel, V., 2014. on Signal Processing 57 (11), 4165–4174. The listening talker: A review of human and algorith- mic context-induced modifications of speech. Computer Girin, L., Schwartz, J.-L., Feng, G., 2001. Audio-visual en- Speech & Language 28 (2), 543–571. hancement of speech in noise. The Journal of the Acous- tical Society of America 109 (6), 3007–3020. EBU, 2014. EBU recommendation R128 - Loudness nor- malisation and permitted maximum level of audio signals. Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Proceed- Ephraim, Y., Malah, D., 1984. Speech enhancement using ings of AISTATS. pp. 249–256. a minimum-mean square error short-time spectral ampli- tude estimator. IEEE Transactions on Acoustics, Speech, Grancharov, V., Kleijn, W., 2008. Speech Quality Assess- and Signal Processing 32 (6), 1109–1121. ment. Springer Berlin Heidelberg, pp. 83–100. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Has- Griffin, D., Lim, J., 1984. Signal estimation from modi- sidim, A., Freeman, W. T., Rubinstein, M., 2018. Look- fied short-time Fourier transform. IEEE Transactions on ing to listen at the cocktail party: A speaker-independent Acoustics, Speech, and Signal Processing 32 (2), 236– audio-visual model for speech separation. ACM Transac- 243. tions on Graphics 37 (4), 112:1–112:11. Hansen, J. H., Varadarajan, V., 2009. Analysis and compen- Erber, N. P., 1975. Auditory-visual perception of speech. sation of Lombard speech across noise type and levels Journal of Speech and Hearing Disorders 40 (4), 481–492. with application to in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Pro- Erdogan, H., Hershey, J. R., Watanabe, S., Le Roux, J., 2015. cessing 17 (2), 366–378. Phase-sensitive and recognition-boosted speech separa- tion using deep recurrent neural networks. In: Proceed- Hentschke, H., Stüttgen, M. C., 2011. Computation of mea- ings of ICASSP. IEEE, pp. 708–712. sures of effect size for neuroscience data sets. European Journal of Neuroscience 34 (12), 1887–1894. Erkelens, J. S., Hendriks, R. C., Heusdens, R., Jensen, J., 2007. Minimum mean-square error estimation of discrete Heracleous, P., Ishi, C. T., Sato, M., Ishiguro, H., Hagita, N., Fourier coefficients with generalized Gamma priors. IEEE 2013. Analysis of the visual Lombard effect and automatic Transactions on Audio, Speech, and Language Processing recognition experiments. Computer Speech & Language 15 (6), 1741–1752. 27 (1), 288–300. Field, A., 2013. Discovering statistics using IBM SPSS statistics. Sage. - Page 13 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Kolbæk, M., Tan, Z.-H., Jensen, J., 2016. Speech enhance- Salakhutdinov, R. R., 2012. Improving neural networks ment using long short-term memory based recurrent neu- by preventing co-adaptation of feature detectors. arXiv ral networks for noise robust speaker verification. In: Pro- preprint arXiv:1207.0580. ceedings of SLT. IEEE, pp. 305–311. Hou, J.-C., Wang, S.-S., Lai, Y.-H., Lin, J.-C., Tsao, Y., Kolbæk, M., Tan, Z.-H., Jensen, J., 2017. Speech intelli- Chang, H.-W., Wang, H.-M., 2018. Audio-visual speech gibility potential of general and specialized deep neural enhancement based on multimodal deep convolutional network based speech enhancement systems. IEEE/ACM neural network. IEEE Transactions on Emerging Topics Transactions on Audio, Speech and Language Processing in Computational Intelligence 2 (2), 117–128. 25 (1), 153–167. Hu, Y., Loizou, P. C., 2007. A comparative intelligibility Lane, H., Tranel, B., 1971. The Lombard sign and the role study of single-microphone noise reduction algorithms. of hearing in speech. Journal of Speech and Hearing Re- The Journal of the Acoustical Society of America 122 (3), search 14 (4), 677–709. 1777–1786. Lim, J. S., Oppenheim, A. V., 1979. Enhancement and band- Hussain, A., Barker, J., Marxer, R., Adeel, A., Whitmer, W., width compression of noisy speech. Proceedings of the Watt, R., Derleth, P., 2017. Towards multi-modal hearing IEEE 67 (12), 1586–1604. aid design and evaluation in realistic audio-visual settings: Liu, L., Özsu, M. T., 2009. Encyclopedia of database sys- Challenges and opportunities. In: Proceedings of CHAT. tems. Vol. 6. Springer New York, NY, USA. pp. 29–34. Loizou, P. C., 2007. Speech enhancement: Theory and prac- Ioffe, S., Szegedy, C., 2015. Batch normalization: Acceler- tice. CRC press. ating deep network training by reducing internal covariate shift. In: Proceedings of ICML. pp. 448–456. Lombard, E., 1911. Le signe de l’elevation de la voix. An- nales des Maladies de L’Oreille et du Larynx 37 (2), 101– Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A., 2017. Image-to- image translation with conditional adversarial networks. In: Proceedings of CVPR. pp. 1125–1134. Lu, X., Tsao, Y., Matsuda, S., Hori, C., 2013. Speech en- hancement based on deep denoising autoencoder. In: Pro- ITU, 2003. Recommendation ITU-R BS.1534-1: Method ceedings of Interspeech. pp. 436–440. for the subjective assessment of intermediate quality level of coding systems. Lu, Y., Cooke, M., 2008. Speech production modifications produced by competing talkers, babble, and stationary ITU, 2005. Recommendation P.862.2: Wideband extension noise. The Journal of the Acoustical Society of America to recommendation P.862 for the assessment of wideband 124 (5), 3261–3275. telephone networks and speech codecs. Lu, Y., Cooke, M., 2009. Speech production modifications Jensen, J., Hendriks, R. C., 2012. Spectral magnitude min- produced in the presence of low-pass and high-pass fil- imum mean-square error estimation using binary and tered noise. The Journal of the Acoustical Society of continuous gain functions. IEEE Transactions on Audio, America 126 (3), 1495–1499. Speech, and Language Processing 20 (1), 92–102. Maas, A. L., Hannun, A. Y., Ng, A. Y., 2013. Rectifier non- Jensen, J., Taal, C. H., 2016. An algorithm for predicting linearities improve neural network acoustic models. In: the intelligibility of speech masked by modulated noise ICML Workshop on Deep Learning for Audio, Speech maskers. IEEE/ACM Transactions on Audio, Speech, and and Language Processing. Language Processing 24 (11), 2009–2022. Mallick, D. B., Magnotti, J. F., Beauchamp, M. S., 2015. Junqua, J.-C., 1993. The Lombard reflex and its role on hu- Variability and stability in the McGurk effect: Contri- man listeners and automatic speech recognizers. The Jour- butions of participants, stimuli, time, and response type. nal of the Acoustical Society of America 93 (1), 510–524. Psychonomic Bulletin & Review 22 (5), 1299–1307. Kazemi, V., Sullivan, J., 2014. One millisecond face align- Martin, R., 2005. Speech enhancement based on mini- ment with an ensemble of regression trees. In: Proceed- mum mean-square error estimation and supergaussian pri- ings of CVPR. pp. 1867–1874. ors. IEEE Transactions on Speech and Audio Processing King, D. E., 2009. Dlib-ml: A machine learning toolkit. 13 (5), 845–856. Journal of Machine Learning Research 10, 1755–1758. Martin, R., Breithaupt, C., 2003. Speech enhancement in the Kingma, D. P., Ba, J., 2015. Adam: A method for stochastic DFT domain using Laplacian speech priors. In: Proceed- optimization. In: Proceedings of ICLR. ings of IWAENC. Vol. 3. pp. 87–90. - Page 14 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Marxer, R., Barker, J., Alghamdi, N., Maddock, S., 2018. Sharma, A., 2005. Text book of correlations and regression. The impact of the Lombard effect on audio and visual Discovery Publishing House. speech recognition systems. Speech Communication 100, Sullivan, G. M., Feinn, R., 2012. Using effect size - or why 58–68. the P value is not enough. Journal of graduate medical McGurk, H., MacDonald, J., 1976. Hearing lips and seeing education 4 (3), 279–282. voices. Nature 264 (5588), 746–748. Sumby, W. H., Pollack, I., 1954. Visual contribution to Mendonça, C., Delikaris-Manias, S., 2018. Statistical tests speech intelligibility in noise. The Journal of the Acous- with mushra data. In: Audio Engineering Society Con- tical Society of America 26 (2), 212–215. vention 144. Audio Engineering Society. Summerfield, Q., 1992. Lipreading and audio-visual speech Michelsanti, D., Tan, Z.-H., 2017. Conditional generative perception. Philosophical Transactions of the Royal Soci- adversarial networks for speech enhancement and noise- ety of London. Series B: Biological Sciences 335 (1273), robust speaker verification. In: Proceedings of Inter- 71–78. speech. pp. 2008–2012. Summers, W. V., Pisoni, D. B., Bernacki, R. H., Pedlow, Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J., R. I., Stokes, M. A., 1988. Effects of noise on speech pro- 2019a. Effects of Lombard reflex on the performance duction: Acoustic and perceptual analyses. The Journal of of deep-learning-based audio-visual speech enhancement the Acoustical Society of America 84 (3), 917–928. systems. In: Proceedings of ICASSP. pp. 6615–6619. Tang, L. Y., Hannah, B., Jongman, A., Sereno, J., Wang, Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J., Y., Hamarneh, G., 2015. Examining visible articulatory 2019b. On training targets and objective functions for features in clear and plain speech. Speech Communication deep-learning-based audio-visual speech enhancement. 75, 1–13. In: Proceedings of ICASSP. pp. 8077–8081. Vargha, A., Delaney, H. D., 2000. A critique and improve- Moore, D. S., McCabe, G. P., Craig, B. A., 2012. Introduc- ment of the CL common language effect size statistics of tion to the Practice of Statistics. WH Freeman New York. McGraw and Wong. Journal of Educational and Behav- ioral Statistics 25 (2), 101–132. Morrone, G., Pasa, L., Tikhanoff, V., Bergamaschi, S., Fadiga, L., Badino, L., 2019. Face landmark-based Vatikiotis-Bateson, E., Barbosa, A. V., Chow, C. Y., Oberg, speaker-independent audio-visual speech enhancement in M., Tan, J., Yehia, H. C., 2007. Audiovisual Lombard multi-talker environments. In: Proceedings of ICASSP. speech: Reconciling production and perception. In: Pro- pp. 6900–6904. ceedings of AVSP. p. 41. Owens, A., Efros, A. A., 2018. Audio-visual scene analysis Vincent, E., 2005. MUSHRAM: A MATLAB interface for with self-supervised multisensory features. In: Proceed- MUSHRA listening tests. http://c4dm.eecs.qmul.ac.uk/ ings of ECCV. pp. 631–648. downloads/#mushram, accessed: March 20, 2019. Park, S. R., Lee, J., 2017. A fully convolutional neural net- Wang, D., Chen, J., 2018. Supervised speech separation work for speech enhancement. In: Proceedings of Inter- based on deep learning: An overview. IEEE/ACM Trans- speech. pp. 1993–1997. actions on Audio, Speech, and Language Processing 26 (10), 1702–1726. Pittman, A. L., Wiley, T. L., 2001. Recognition of speech produced in noise. Journal of Speech, Language, and Wang, Y., Narayanan, A., Wang, D., 2014. On training tar- Hearing Research. gets for supervised speech separation. IEEE/ACM Trans- actions on Audio, Speech, and Language Processing Raphael, L. J., Borden, G. J., Harris, K. S., 2007. Speech 22 (12), 1849–1858. science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins. Weninger, F., Hershey, J. R., Le Roux, J., Schuller, B., 2014. Discriminatively trained recurrent neural networks Rix, A. W., Beerends, J. G., Hollier, M. P., Hekstra, A. P., for single-channel speech separation. In: Proceedings of 2001. Perceptual evaluation of speech quality (PESQ)-a GlobalSIP. IEEE, pp. 577–581. new method for speech quality assessment of telephone networks and codecs. In: Proceedings of ICASSP. Vol. 2. Wilcoxon, F., 1945. Individual comparisons by ranking IEEE, pp. 749–752. methods. Biometrics Bulletin 1 (6), 80–83. Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, Williamson, D. S., Wang, Y., Wang, D., 2016. Complex ra- S., Pantic, M., 2016. 300 faces in-the-wild challenge: tio masking for monaural speech separation. IEEE/ACM Database and results. Image and Vision Computing 47, Transactions on Audio, Speech and Language Processing 3–18. 24 (3), 483–492. - Page 15 of 16 Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Winter, F., Wierstorf, H., Hold, C., Krüger, F., Raake, A., Spors, S., 2018. Colouration in local wave field synthe- sis. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing 26 (10), 1913–1924. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H., 2014. An experimental study on speech enhancement based on deep neural net- works. IEEE Signal Processing Letters 21 (1), 65–68. Zollinger, S. A., Brumm, H., 2011. The evolution of the Lombard effect: 100 years of psychoacoustic research. Behaviour 148 (11-13), 1173–1198. - Page 16 of 16

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: May 29, 2019

References