Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss

End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type than the search query, e.g., retrieving pictures relevant to a given text query. The state-of-the-art approach to cross-modality retrieval relies on learning a joint embedding space of the two modalities, where items from either modality are retrieved using nearest-neighbor search. In this work, we introduce a neural network layer based on canonical correlation analysis (CCA) that learns better embedding spaces by analytically computing projections that maximize correlation. In contrast to previous approaches, the CCA layer allows us to combine existing objectives for embedding space learning, such as pairwise ranking losses, with the optimal projections of CCA. We show the effectiveness of our approach for cross-modality retrieval on three different scenarios (text-to-image, audio-sheet-music and zero-shot retrieval), surpassing both Deep CCA and a multi-view network using freely learned projections optimized by a pairwise ranking loss, especially when little training data is available (the code for all three methods is released at: https://github.com/CPJKU/cca_layer). Keywords Cross-modality retrieval · Canonical correlation analysis · Ranking loss · Neural network · Joint embedding space 1 Introduction aims at decreasing the distance (a differentiable function such as Euclidean or cosine distance) between matching items, Cross-modality retrieval is the task of retrieving relevant while increasing it between mismatching ones. Specialized items of a different modality than the search query (e.g., extensions of this loss achieved state-of-the-art results in var- retrieving an image given a text query). One approach to ious domains such as natural language processing [10], image tackle this problem is to define transformations which embed captioning [12], and text-to-image retrieval [29]. samples from different modalities into a common vector In a different approach, Yan and Mikolajczyk [31]pro- space. We can then project a query into this embedding space, pose to learn a joint embedding of text and images using and retrieve, using nearest-neighbor search, a corresponding Deep canonical correlation analysis (DCCA) [2]. Instead of candidate projected from another modality. a pairwise ranking loss, DCCA directly optimizes the cor- A particularly successful class of models uses paramet- relation of learned latent representations of the two views. ric nonlinear transformations (e.g., neural networks) for the Given the correlated embedding representations of the two embedding projections, optimized via a retrieval-specific views, it is possible to perform retrieval via cosine distance. objective such as a pairwise ranking loss [15,27]. This loss The promising performance of their approach is also in line with the findings of Costa et al. [23] who state the following two hypotheses regarding the properties of efficient cross- Electronic supplementary material The online version of this article (https://doi.org/10.1007/s13735-018-0151-5) contains supplementary modal retrieval spaces: first, the embedding spaces should material, which is available to authorized users. account for low-level cross-modal correlations and second, they should enable semantic abstraction. In [31], both prop- B Matthias Dorfer erties are met by a deep neural network—learning abstract matthias.dorfer@jku.at representations—that is optimized with DCCA ensuring Department of Computational Perception, Johannes Kepler highly correlated latent representations. University Linz, 4040 Linz, Austria In summary, the optimization of pairwise ranking losses The Austrian Research Institute for Artificial Intelligence, yields embedding spaces that are useful for retrieval, and 1010 Vienna, Austria 123 118 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Fig. 1 Sketches of cross-modality retrieval networks. The proposed and B (see Eq. 4). We thus need to compute their partial derivatives with ∂A model in (c) unifies (a, b) and takes advantage of both componentwise- respect to the network’s hidden representations x and y, i.e., and ∂x,y correlated CCA projections and a pairwise ranking loss for cross- ∂B (addressed in Sect. 4). a DCCA network maximizes correlation via ∂x,y modality embedding space learning. We emphasize that our proposal in Trace Norm Objective (TNO). b Freely learned embedding projections (c) requires to backpropagate the ranking loss L through the analytical optimized with ranking loss (Learned-L ). c Canonically correlated ∗ rank computation of the optimally correlated CCA embedding projections A projection layer optimized with ranking loss (CCAL-L ) rank allows incorporating domain knowledge into the loss func- 2 Canonical correlation analysis tion. On the other hand, DCCA is designed to maximize correlation—which has already proven to be useful for cross- In this section, we review the concepts of CCA, the basis d d x y modality retrieval [31]—but does not allow to use loss for our methodology. Let x ∈ R and y ∈ R denote two formulations specialized for the task at hand. random column vectors with covariances Σ and Σ and xx yy In this paper, we propose a method to combine both cross-covariance Σ . The objective of CCA is to find two xy ∗ d ×k ∗ d ×k x y approaches in a way that retains their advantages. We develop matrices A ∈ R and B ∈ R composed of k paired a Canonical Correlation Analysis Layer (CCAL) that can column vectors A and B (with k ≤ d and k ≤ d ) that j j x y be inserted into a dual-view neural network to produce a project x and y into a common space maximizing their com- maximally correlated embedding space for its latent repre- ponentwise correlation: sentations. We can then apply task-specific loss functions, in particular the pairwise ranking loss, on the output of this ∗ ∗ (A , B ) = arg max corr(A x, B y) (1) layer. To train a network using the CCA layer, we describe j j A,B j =1 how to backpropagate the gradient of this loss function to the dual-view neural network while relying on automatic differ- k A Σ B xy j = arg max  . (2) entiation tools such as Theano [28]or Tensorflow [1]. In our A,B A Σ A B Σ B xx j yy j experiments, we show that our proposed method performs j =1 j j better than DCCA and models using pairwise ranking loss alone, especially when little training data is available. Since the objective of CCA is invariant to scaling of the pro- Figure 1 compares our proposed approach to the alterna- jection matrices, we constrain the projected dimensions to tives discussed above. DCCA defines an objective optimizing have unit variance. Furthermore, CCA seeks subsequently a dual-view neural network such that its two views will uncorrelated projection vectors, arriving at the equivalent for- be maximally correlated (Fig. 1a). Pairwise ranking losses mulation: are loss functions to optimize a dual-view neural network ∗ ∗ such that its two views are well-suited for nearest-neighbor (A , B ) = arg max tr A Σ B . (3) xy retrieval in the embedding space (Fig. 1b). In our approach, A Σ A=B Σ B=I xx yy k we boost optimization of a pairwise ranking loss based on cosine distance by placing a special-purpose layer, the CCA −1/2 −1/2 Let T = Σ Σ Σ , and let U diag(d)V be the Sin- xx xy yy projection layer, between a dual-view neural network and the gular Value Decomposition (SVD) of T with ordered singular optimization target (Fig. 1c). Our experiments in Sect. 5 will ∗ ∗ values d ≥ d .Asshown in [19], we obtain A and B i i +1 show the effectiveness of this proposal. from the top k left- and right-singular vectors of T: 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 119 −1/2 −1/2 ∗ ∗ A = Σ U B = Σ V . (4) :k :k xx yy Moreover, the correlation in the projection space is the sum of the top k singular values: ∗ ∗ corr(A x, B y) = d . (5) i ≤k In practice, the covariances and cross-covariance of x and y Fig. 2 DCCA retrieval pipeline proposed in [31]. Note that all process- are usually not known, but estimated from a training set of ing steps below the solid line are performed after network optimization d ×m m paired vectors, expressed as matrices X ∈ R , Y ∈ is complete d ×m R by: 1 1 neural networks f and g are trained using the TNO, with a ˆ ˆ Σ = XX + r I and Σ = XY . (6) xx xy m − 1 m − 1 and b representing different views of an entity (e.g. image and text); then, after the training is finished, the CCA projec- X is the centered version of X. Σ is defined analogously to tions are computed using Eq. (4), and all retrieval candidates yy are projected into the embedding space; finally, at test time, Σ . Additionally, we apply a regularization parameter r I to xx ensure that the covariance matrices are positive definite. Sub- queries of either modality are projected into the embedding stituting these estimates for Σ , Σ and Σ , respectively, space, and the best-matching sample from the other modal- xx xy yy ∗ ∗ ity is found through nearest-neighbor search using the cosine we can compute A and B using Eq. (4). distance. Figure 2 provides a summary of the entire retrieval pipeline. In our experiments, we will refer to this approach as DCCA-2015. 3 Cross-modality retrieval baselines DCCA is limited by design to use the objective func- tion described in Eq. (7), and only seeks to maximize the In this section, we review the two most related works forming correlation in the embedding space. During training, the the basis for our approach. CCA projection matrices are never computed, nor are the samples projected into the common retrieval space. All the 3.1 Deep canonical correlation analysis retrieval steps—most importantly, the computation of CCA projections—are performed only once after the networks f Andrew et al. [2] propose an extension of CCA to learn and g have been optimized. This restricts potential applica- parametric nonlinear transformations of two random vectors, a tions, because we cannot use the projected data as an input to such that their correlation is maximized. Let a ∈ R and b subsequent layers or task-specific objectives. We will show b ∈ R denote two random vectors, and let x = f (a; Θ ) how our approach overcomes this limitation in Sect. 4. and y = g(b; Θ ) denote their nonlinear transformations, parameterized by Θ and Θ . DCCA optimizes the param- f g eters Θ and Θ to maximize the correlation of the topmost 3.2 Pairwise ranking loss f g hidden representations x and y.For d = d = k, this objec- x y tive corresponds to Eq. 5, i.e., the sum of all singular values Kiros et al. [15] learn a multi-modal joint embedding space of T, also called the trace norm: for images and text. They use the cosine of the angle between two corresponding vectors x and y as a scoring function, i.e., ∗ ∗ s(x, y) = cos(x, y). Then, they optimize a pairwise ranking corr(A f (a; Θ ), B g(b; Θ )) =||T||.(7) f g tr loss Andrew et al. [2] show how to compute the gradient of this Trace Norm Objective (TNO) with respect to x and y. Assum- L = max{0,α − s(x, y) + s(x, y )} (8) rank k ing f and g are differentiable with respect to Θ and Θ (as k f g is the case for neural networks), this allows to optimize the where x is an embedded sample of the first modality, y is the nonlinear transformations via gradient-based methods. matching embedded sample of the second modality, and y Yan and Mikolajczyk [31] suggest the following pro- k are the contrastive (mismatching) embedded samples of the cedure to utilize DCCA for cross-modality retrieval: first, second modality (in practice, all mismatching samples in the current mini-batch). The hyper-parameter α defines the mar- We understand the correlation of two vectors to be defined as corr(x, y) = corr(x , y ). gin of the loss function. This loss encourages an embedding i j i j 123 120 International Journal of Multimedia Information Retrieval (2018) 7:117–128 space where the cosine distance between matching samples this section, we discuss how to establish gradient flow (back- is lower than the cosine distance of mismatching samples. propagation) through CCA’s optimal projection matrices. In ∗ ∗ ∂A ∂B In this setting, the networks f and g have to learn particular, we require the partial derivatives and ∂x,y ∂x,y the embedding projections freely from randomly initialized of the projections with respect to their input representations weights. Since the projections are learned from scratch by x and y. This will allow us to use CCA as a layer within a optimizing a ranking loss, in our experiments, we denote multi-modality neural network, instead of as a final objective this approach by Learned-L . Figure 1bshows asketch (TNO) for correlation maximization only. rank of this paradigm. 4.2 Gradient of CCA projections 4 Learning with canonically correlated As mentioned above, we can compute the canonical corre- embedding projections lation along with the optimal projection matrices from the −1/2 −1/2 singular value decomposition T = Σ Σ Σ = xx xy yy In the following, we explain how to bring both concepts— U diag(d)V . Specifically, we obtain the correlation as ∗ ∗ ∗ DCCA and Pairwise Ranking Losses—together to enhance corr(A x, B y) = d , and the projections as A = −1/2 −1/2 cross-modality embedding space learning. Σ U and B = Σ V. For DCCA, it suffices to com- xx yy pute the gradient of the total correlation wrt. x and y in order 4.1 Motivation to backpropagate it through the two networks f and g.Using the chain rule, Andrew et al. decompose this into the gradi- We start by providing an intuition on why we expect this ents of the total correlation wrt. Σ , Σ and Σ , and the xx xy yy combination to be fruitful: DCCA-2015 maximizes the cor- gradients of those wrt. x and y [2]. Their derivations of the relation between the latent representations of two different former make use of the fact that both the gradient of d neural networks via the TNO derived from classic CCA. As wrt. T and the gradient of ||T|| (the trace norm objective in tr correlation and cosine distance are related, we can also use Eq. (7)) wrt. T T have a simple form; see Section 7 in [2]for such a network for cross-modality retrieval [31]. Kiros et details. al. [15], on the other hand, learn a cross-modality retrieval In our case where we would like to backpropagate errors embedding by optimizing an objective customized for the through the CCA transformations, we instead need the gra- ∗ ∗ ∗ ∗ task at hand. The motivation for our approach is that we dients of the projected data x = A x and y = B y wrt. x ∗ ∗ ∂A ∂B want to benefit from both: a task-specific retrieval objective, and y, which requires the partial derivatives and .We ∂x,y ∂x,y and componentwise optimally correlated embedding projec- could again decompose this into the gradients wrt. T, the gra- tions. dients of T wrt. Σ , Σ and Σ and the gradients of those xx xy yy To achieve this, we devise a CCA layer that analytically wrt. x and y. However, while the gradients of U and V wrt. ∗ ∗ computes the CCA projections A and B during training, and T are known [22], they involve solving O((d d ) ) linear x y projects incoming samples into the embedding space. The 2 × 2 systems. Instead, we reformulate the solution to use projected samples can then be used in subsequent layers, or two symmetric eigendecompositions TT = U diag(e)U for computing task-specific losses such as the pairwise rank- and T T = V diag(e)V (Equation 270 in [24]). This gives ing loss. Figure 1c illustrates the central idea of our combined us the same left and right eigenvectors we would obtain from approach. Compared to Fig. 1b, we insert an additional linear the SVD, along with the squared singular values (e = d ). transformation. However, this transformation is not learned The gradients of eigenvectors of symmetric real eigensys- (otherwise it could be merged with the previous layer, which tems have a simple form [17] and both TT and T T are is not followed by a nonlinearity). Instead, it is computed to differentiable wrt. x and y. be the transformation that maximizes componentwise corre- To summarize: in order to obtain an efficiently computable ∗ ∗ lation between the two views. A and B in Fig. 1care the definition of the gradient for CCA projections, we have very projections given by Eq. (4) in Sect. 2. reformulated the forward pass (the computation of the CCA In theory, optimizing a pairwise ranking loss alone could transformations). Our formulation using two eigendecom- yield projections equivalent to the ones computed by CCA. positions translates into a series of computation steps that In practice, however, we observe that the proposed combi- are differentiable in a graph-based, auto-differentiating math nation gives much better cross-modality retrieval results (see compiler such as Theano [28], which, together with the chain Sect. 5). rule, gives an efficient implementation of the CCA layer gra- Our design requires backpropagating errors through the dient for training our network. For a detailed description of analytical computation of the CCA projection matrices. DCCA [2] does not cover this, since projecting the data is The code of our implementation of the CCA layer is available at not necessary for optimizing the TNO. In the remainder of https://github.com/CPJKU/cca_layer. 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 121 Table 1 Example images for Flickr30k (top) and IAPR TC-12 (bottom) A man in a white cowboy hat reclines in front of a window in an airport A young man rests on an airport seat with a cowboy hat over his face A woman relaxes on a couch , with a white cowboy hat over her head A man is sleeping inside on a bench with his hat over his eyes Apersonissleepingatanairportwithahat Fig. 3 Sketch of cross-modality retrieval. The blue dots are the embed- on their head ded candidate samples. The red dot is the embedding of the search query. The larger blue dot highlights the closest candidate selected as A green and brown embankment with brown the retrieval result (colour figure online) houses on the right and a light brown sandy beach at the dark blue sea on the left; a dark mountain range behind it and white clouds in a light blue sky in the background the CCA layer forward pass, we refer to Algorithm 1 in the “Appendix” of this article. As the technical implementation is not straight-forward, we also discuss the crucial steps in the “Appendix”. Thus, we now have the means to benefit from the optimal we define the MRR (higher is better) as the mean value of CCA projections but still optimize for a task-specific objec- 1/rank over all queries where rank is again the position tive. In particular, we utilize the pairwise ranking loss of of the target in the similarity-ordered list of available candi- Eq. (8) on top of an intermediate CCA embedding projection dates. layer. We denote the proposed retrieval network of Fig. 1c as CCAL-L in our experiments (CCAL refers to CCA rank 5.1 Image-text retrieval Layer). In the first part of our experiments, we consider Flickr30k and IAPR TC-12, two publicly available datasets for image-text 5 Experiments cross-modality retrieval. Flickr30k consists of image-caption pairs, where each image is annotated with five different tex- We evaluate our approach (CCAL-L ) in cross-modality tual descriptions. The train-validation-test split for Flickr30k rank retrieval experiments on two image-to-text and one audio-to- is 28000-1000-1000. In terms of evaluation setup, we fol- sheet-music dataset. Additionally, we provide results on two low Protocol 3 of [31] and concatenate the five available zero-shot text-to-image retrieval scenarios proposed in [25]. captions into one, meaning that only one, but richer text anno- For comparison, we consider the approach of [31](DCCA- tation remains per image. This is done for all three sets of 2015), our own implementation of the TNO (denoted by the split. The second image-text dataset, IAPR TC-12, con- DCCA), as well as the freely learned projection embeddings tains 20000 natural images where only one—but compared (Learned-L ) optimizing the ranking loss of [15]. to Flickr30k more detailed—caption is available for each rank The task for all three datasets is to retrieve the correct image. As no predefined train-validation-test split is pro- counterpart when given an instance of the other modality vided, we randomly select 1000 images for validation and as a search query. For retrieval, we use the cosine distance 2000 for testing, and keep the rest for training. [31]alsouse in embedding space for all approaches. First, we embed all 2000 images for testing, but did not explicitly mention hold- candidate samples of the target modality into the retrieval out images for validation. Table 1 shows an example image embedding space. Then, we embed the query element y with along with its corresponding captions or caption for either the second network and select its nearest-neighbor x of the dataset. target modality. Fig. 3 shows a sketch of this retrieval by The input to our networks is a 4096-dimensional image embedding space learning paradigm. feature vector along with a corresponding text vector repre- As evaluation measures, we consider the Recall@k (R@k sentation which has dimensionality 5793 for Flickr30k and in %) as well as the Median Rank (MR) and the Mean 2048 for IAPR TC-12. The image embedding is computed Reciprocal Rank (MRR in %).The R@k rate (higher is from the last hidden layer of a network pretrained on Ima- better) is the ratio of queries which have the correct cor- geNet [7] (layer fc7 of CNN_S by [4]). In terms of text responding counterpart in the first k retrieval results. The pre-processing, we follow [31], tokenizing and lemmatizing MR (lower is better) is the median position of the target the raw captions as the first step. Based on the lemmatized in a similarity-ordered list of available candidates. Finally, captions, we compute l2-normalized TF/IDF-vectors, omit- 123 122 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Table 2 Retrieval results on Method Image-to-text Text-to-image IAPR TC-12. “DCCA-2015” is R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR taken from [31] DCCA-2015 30.2 57.0 – – 42.6 29.5 60.0 – – 41.5 DCCA 31.0 58.7 70.4 3.6 43.9 29.5 58.2 70.5 4.0 42.7 Learned-L 22.3 50.7 63.8 5.2 35.7 21.6 50.1 63.3 5.5 35.1 rank CCAL-L 31.6 61.0 72.2 3.0 45.0 29.6 60.0 72.2 3.6 43.5 rank Table 3 Retrieval results on Method Image-to-text Text-to-image Flickr30k. “DCCA-2015” is R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR taken from [31] DCCA-2015 27.9 56.9 68.2 4 – 26.8 52.9 66.9 4 – DCCA 31.6 59.2 69.3 3.3 44.2 30.3 58.3 69.2 3.8 43.1 Learned-L 23.7 50.5 63.0 5.3 36.3 23.6 51.0 62.5 5.2 36.5 rank CCAL-L 32.0 59.2 70.4 3.2 44.8 29.9 58.8 70.2 3.7 43.3 rank ting words with an overall occurrence smaller than five for where no results are available in the literature. When look- Flickr30k and three for IAPR TC-12, respectively. The image ing at the performance of CCAL-L we further observe rank representation is processed by a linear dense layer with 128 that it outperforms all other methods, although the differ- units, which will also be the dimensionality k of the result- ence to DCCA is not pronounced for all of the measures. ing retrieval embedding. The text vector is fed through two Comparing CCAL-L with the freely learned projection rank batch-normalized [11] dense layers of 1024 units each and matrices (Learned-L ) we observe a much larger perfor- rank the ELU activation function [6]. As a last layer for the text mance gap. This is interesting, as in principle the learned representation network, we again apply a dense layer with projections could converge to exactly the same solution as 128 linear units. CCAL-L . We take this as a quantitative confirmation that rank For a fair comparison, we keep the structure and number the learning process benefits from CCA’s optimal projection of parameters of all networks in our experiments the same. matrices. The only difference between the networks are the objectives In Table 3, we list our results on the Flickr30k dataset. As and the hyper-parameters used for optimization. Optimiza- above, we show the retrieval performances of [31] as a base- tion is performed using Stochastic Gradient Descent (SGD) line along with our results and observe similar behavior as on with the adam update rule [14] (for details please see our IAPR TC-12. Again, we point out the poor performance of “Appendix”). the freely learned projections (Learned-L ) in this exper- rank Table 2 lists our results on IAPR TC-12. Along with our iment. Keeping this observation in mind, we will notice a experiments, we also show the results reported in [31]asa different behavior in the experiments in Sect. 5.2. reference (DCCA-2015). However, a direct comparison to Note that there are various other methods reporting results our results may not be fair: DCCA-2015 uses a different on Flickr30k [13,15,18,27] which partly surpass ours, for ImageNet-pretrained network for the image representation, example by using more elaborate processing of the textual and finetunes this network while we keep it fixed. This is descriptions or more powerful ImageNet models. We omit because our interest is in comparing the methods in a sta- these results as we focus on the comparison of DCCA and ble setting, not in obtaining the best possible results. Our freely learned projections with the proposed CCA projection implementation of the TNO (DCCA) uses the same objec- embedding layer. tive as DCCA-2015, but is trained using the same network architecture as our remaining models and permits a direct 5.2 Audio-sheet-music retrieval comparison. Additionally, we repeat each of the experiments 10 times with different initializations and report the mean for For the second set of experiments, we consider the Not- each of the evaluation measures. tingham piano midi dataset [3]. The dataset is a collection When taking a closer look at Table 2, we observe that of midi files split into train, validation and test set already our results achieved by optimizing the TNO (DCCA)sur- used by [8] for experiments on end-to-end score-following pass the results reported in [31]. We already discussed above in sheet-music images. Here, we tackle the problem of audio- that the two versions are not directly comparable. How- sheet-music retrieval, i.e., matching short snippets of music ever, given this result, we consider our implementation of (audio) to corresponding parts in the sheet music (image). DCCA as a valid baseline for our experiments in Sect. 5.2 Figure 4 shows examples of such correspondences. 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 123 learned embedding projections. On measures such as R@5 or R@10 it achieves similar to or better performance than DCCA. One of the reasons for this could be the fact that there is an order of magnitude more training data available for this task to learn the projection embedding from random Fig. 4 Example of the data considered for audio-sheet-music (image) initialization. Still, our proposed combination of both con- retrieval. Top: short snippets of sheet-music images. Bottom: Spectro- gram excerpts of the corresponding music audio cepts (CCAL-L ) achieves highest retrieval scores. rank 5.3 Performance in small data regime We conduct this experiment for two reasons: First, to show the advantage of the proposed method over different The above results suggest that the benefit of using a CCA pro- domains. Second, the data and application is of high practi- jection layer (CCAL-L ) over a freely learned projection rank cal relevance in the domain of Music Information Retrieval becomes most evident when few training data is available. (MIR). A system capable of linking sheet music (images) and To examine this assumption, we repeat the audio-to-sheet- the corresponding music (audio) would be useful in many music experiment of the previous section, but use only 10% content-based musical retrieval scenarios. of the original training data (≈ 27000 samples). We stress In terms of audio preparation, we compute log frequency the fact that the learned embedding projection of Learned- spectrograms with a sample rate of 22.05 kHz, a FFT win- L could converge to exactly the same solution as the rank dow size of 2048, and a computation rate of 31.25 frames CCA projections of CCAL-L . Table 5 summarizes the rank per second. These spectrograms (136 frequency bins) are low data regime results for the three methods. Consistent with then directly fed into the audio part of the cross-modality our hypothesis, we observe a larger gap between Learned- networks. Figure 4 shows a set of audio-to-sheet correspon- L and CCAL-L compared to the one obtained with rank rank dences presented to our network for training. One audio all training data in Table 4. We conclude that a network excerpt comprises 100 frames and the dimension of the sheet might be able to learn suitable embedding projections when image snippet is 40 × 100 pixels. Overall this results in sufficient training data is available. However, when having 270,705 train, 18,046 validation and 16,042 test audio-sheet- fewer training samples, the proposed CCA projection layer music pairs. This is an order of magnitude more training data strongly supports embedding space learning. In addition, we than for the image-to-text datasets of the previous section. also looked into the retrieval performance of Learned-L rank In the experiments in Sect. 5.1, we relied on pretrained and CCAL-L on the training set and observe comparable rank ImageNet features and relatively shallow fully connected performance. This indicates that the CCA layer also acts as text-feature processing networks. The model here differs a regularizer and helps to generalize to unseen samples. from this, as it consists of two deep convolutional net- works learned entirely from scratch. Our architecture is a VGG-style [26] network consisting of sequences of 3×3 con- 5.4 Zero-shot image-text retrieval volution stacks followed by 2 × 2 max pooling. To reduce the dimensionality to the desired correlation space dimen- Our last set of experiments focuses on a slightly modified sionality k (in this case 32), we insert as a final building retrieval setting, namely image-text zero-shot retrieval [25]. block a 1 × 1 convolution having k feature maps followed by Given a set of image-text pairs originating from C differ- global average pooling [16] (for further architectural details ent categories the data is split into a class-disjoint training, we again refer to the appendix of this manuscript). validation and test sets having no categorical overlap. This Table 4 lists our result on audio-to-sheet-music retrieval. implies that at test time we aim to retrieve images from tex- As in the experiments on images and text, the proposed CCA tual queries describing categories (semantic concepts) never projection embedding layer trained with pairwise ranking seen before, neither for training, nor for validation. loss outperforms the other models. Recalling the results from Reed et al. [25] collected and provided textual descriptions Sect. 5.1, we observe an increased performance of the freely for two publicly available datasets, the CUB-200 bird image Table 4 Retrieval results on Method Sheet-to-audio Audio-to-sheet Nottingham dataset R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR (audio-to-sheet-music retrieval) DCCA 42.0 88.2 93.3 2 62.2 44.6 87.9 93.2 2 63.5 Learned-L 40.7 89.6 95.6 2 61.7 41.4 88.9 95.4 2 61.9 rank CCAL-L 44.1 93.3 97.7 2 65.3 44.5 91.6 96.7 2 64.9 rank 123 124 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Table 5 Retrieval results on Method Sheet-to-audio Audio-to-sheet audio-to-sheet-music retrieval R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR when using only 10% of the train data DCCA 20.0 53.6 65.4 5 35.3 22.7 54.7 65.8 4 37.3 Learned-L 11.3 35.2 47.6 12 23.0 12.6 35.2 47.2 12 23.7 rank CCAL-L 22.2 59.2 70.7 4 38.8 25.0 59.3 70.9 4 40.4 rank Table 6 Zero-shot retrieval results on cub and flowers Method Flowers Birds Attributes [25] – 50.0 Word2Vec [25] 52.1 33.5 Word CNN [25] 56.3 43.3 Word CNN-RNN [25] 59.6 48.7 Word CNN + CCAL 62.2 52.2 Fig. 5 Example images of CUB-200 birds and Oxford Flowers along with textual descriptions collected by Reed et al. [25] for zero-shot Word CNN-RNN + CCAL 64.0 49.8 retrieval from text For the Birds dataset, as an alternative to the tex- dataset [30] and the Oxford Flowers dataset [21]. According tual descriptions, there are manually created fine-grained to the definition of zero-shot retrieval above, we follow [25] attributes available for each of the images. When relying on and split CUB into 100 train, 50 validation and 50 test cate- these attributes Reed et al. report state-of-the-art results on gories. Flowers is split into 82 train and 20 validation / test the dataset [25] not reached by their text processing neural classes respectively. Figure 5 shows some example images networks. along with their textual descriptions. In the bottom part of Table 6, we report the performance Besides the modified, harder retrieval setting there is a sec- of the same architectures optimized using our proposed ond difference to the text-image retrieval experiments carried CCA layer in combination with a pairwise ranking loss. We out in Sect. 5.1. Instead of using hand engineered textual fea- observe that the CCA layer is able to improve the perfor- tures (e.g. TF-IDF) or unsupervised textual feature learning mance of both models on both datasets. The gain in retrieval (e.g. word2vec [20]) the authors in [25] employ Convolu- performance within a model class is largest for the convo- tional Recurrent Neural Networks (CRNN) to learn the latent lution only (CNN) text processing models (≈ 9% points for text representations directly from the raw descriptions. In the Flowers dataset and ≈ 6 for CUB). For the birds dataset particular, they feed the descriptions as one-hot-word encod- the Word CNN + CCAL even outperforms the models relying ings to the text processing part of their networks. In terms on manually encoded attributes by achieving an AP@50 of of image representations, they still rely on 1024-dimensional 52.2. pretrained ImageNet features. The feature learning part and the network architectures used for our experiments follows exactly the descriptions provided in [25]. The sole difference is, that we again replace the topmost embedding layer with 6 Discussion and conclusion the proposed CCA projection layer in combination with a pairwise ranking loss. We have shown how to use the optimal projection matri- Table 6 compares the retrieval results of the respective ces of CCA as the weights of an embedding layer within a methods on the two zero-shot retrieval datasets. To allow multi-view neural network. With this CCA layer, it becomes for a direct comparison with the results reported in [25], we possible to optimize for a specialized loss function (e.g., follow their evaluation setup and report the Average Preci- related to a retrieval task) on top of this, exploiting the cor- sion (AP@50). The AP@50 is the percentage of the top-50 relation properties of a latent space provided by CCA. As scoring images whose class matches that of the text query, this requires to establish gradient flow through CCA, we for- averaged over the 50 test classes. In [25] the best retrieval per- mulate it to allow easy computation of the partial derivatives ∗ ∗ ∂A ∂B ∗ ∗ formance for both datasets (when considering only feature and of CCA’s projection matrices A and B with ∂x,y ∂x,y learning) is achieved by having a CRNN directly process- respect to the input data x and y. With this formulation, we can ing the textual descriptions. What is also interesting is the incorporate CCA as a building block within multi-modality substantial performance gain with respect to unsupervised neural networks that produces maximally correlated projec- word2vec features. tions of its inputs. In our experiments, we use this building 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 125 −1/2 −1/2 block within a cross-modality retrieval setting, optimizing a of T = Σ Σ Σ = U diag(d)V (see Sect. 2). The xy xx yy network to minimize a cosine distance-based pairwise rank- proposed model needs to backpropagate the errors through ing loss of the componentwise-correlated CCA projections. the CCA transformations, i.e., it requires the gradients of the ∗ ∗ ∗ ∗ Experimental results show that when using the cosine dis- projected data x = A x and y = B y wrt. x and y. Apply- tance for retrieval (as is common for correlated views), this ing the chain rule, this further requires the gradients of U and −1/2 −1/2 is superior to optimizing a network for maximally correlated V wrt. T, and the gradients of T, Σ , Σ and Σ wrt. xx xy yy projections (as done in DCCA), or not using CCA at all. This x and y. observation holds in our experiments on a variety of different The main technical challenge is that common auto- modality pairs as well as two different retrieval scenarios. differentiation tools such as Theano [28] or Tensor Flow [1] When investigating the experimental results in more do not provide derivatives for the inverse squared root and detail, we find that the correlation-based methods (DCCA, singular value decomposition of a matrix. To overcome this, CCAL) consistently outperform the models that learn the we replace the inverse squared root of a matrix by using its embedding projections from scratch. A direct comparison of Cholesky decomposition as described in [9]. Furthermore, DCCA with the proposed CCAL-L reveals two learn- we note that the singular value decomposition is required rank ing scenarios where CCAL-L is superior: (1) the low to obtain the matrices U and V, but in fact those matrices rank data regime, where we found that the CCA layer acts as a can alternatively be obtained by solving the eigendecompo- strong regularizer to prevent over-fitting; (2) when learning sition of TT = U diag(e)U and T T = V diag(e)V [24, the entire retrieval representation (network parameteriza- Eq. 270]. This yields the same left and right eigenvectors tion) from scratch, not relying on pretrained or hand-crafted we would obtain from the SVD (except for possibly flipped features (see Sect. 5.2). Our intuition on this is that incor- signs, which are easy to fix), along with the squared singular porating the task-specific retrieval objective already during values (e = d ). Note that TT and T T are symmetric, and training encourages the networks to learn embedding repre- that the gradients of eigenvectors of symmetric real eigensys- sentations that are beneficial for retrieval at test time. This is tems have a simple form [17, Eq. 7]. Furthermore, TT and the important conceptual difference compared to the Trace T T are differentiable wrt. x and y, enabling a sufficiently effi- Norm Objective (TNO) of DCCA, which does not focus on cient implementation in a graph-based, auto-differentiating the retrieval task. However, when using the CCA layer we math compiler. also inherit one drawback of the pairwise ranking loss, which The following section provides a detailed description of is the additional hyper-parameter (margin α) that needs to be the implementation of the CCA layer. determined on the validation set. Finally, we would like to emphasize that our CCA layer is Forward pass of CCA projection layer a general network component which could provide a useful basis for further research, e.g., as an intermediate processing For easier reproducibility, we provide a detailed descrip- step for learning binary cross-modality retrieval representa- tion of the forward pass of the proposed CCA layer in tions. Algorithm 1. To train the model, we need to propagate the gradient through the CCA layer (backward pass). We rely on Acknowledgements Open access funding provided by Johannes Kepler auto-differentiation tools (in particular, Theano) implement- University Linz. ing the gradient for each individual computation step in the Open Access This article is distributed under the terms of the Creative forward pass, and connecting them using the chain rule. Commons Attribution 4.0 International License (http://creativecomm The layer itself takes the latent feature representations (a ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, d ×m d ×m x y and reproduction in any medium, provided you give appropriate credit batch of m paired vectors X ∈ R and Y ∈ R )ofthe to the original author(s) and the source, provide a link to the Creative two network pathways f and g as input and projects them Commons license, and indicate if changes were made. with CCA’s analytical projection matrices. At train time, the layer uses the optimal projections computed from the cur- rent batch. When applying the layer at test time, it uses the statistics and projections remembered from last train- Appendix ing batch (which can of course be recomputed on a larger training batch to get more stable estimate). Implementation details Backpropagating the errors through the CCA projection Note that this is not relevant for the DCCA model introduced in [2] matrices is not trivial. The optimal CCA projection matri- because it only derives the CCA projections after optimizing the TNO. −1/2 −1/2 ∗ ∗ ces are given by A = Σ U and B = Σ V, where xx yy The code of our implementation of the CCA layer is available at U and V are derived from the singular value decomposition https://github.com/CPJKU/cca_layer. 123 126 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Algorithm 1 Forward Pass of CCA Projection Layer. d ×m d ×m x y 1: Input of layer: X ∈ R and Y ∈ R  hidden representation of current batch ∗ ∗ 2: Returns: X and Y  CCA projected hidden representation ∗ ∗ 3: Parameters of layer: μ , μ and A , B  means and CCA projection matrices x y 4: if train_time then  update statistics and CCA projections during training 5: μ ← X  update μ and μ with means of batch x i x y 6: μ ← Y y i 7: X = X − μ  mean center data 8: Y = Y − μ 9: Σ = X X + r I  estimate covariances of batch xx m−1 10: Σ = Y Y + r I yy m−1 11: Σ = X Y xy m−1 −1 −1 12: C = cholesky(Σ )  compute inverses of Cholesky factorizations xx xx −1 −1 13: C = cholesky(Σ ) yy yy −1 −1 14: T = C Σ (C )  compute matrix T xy xx yy 15: e, U = eigen(TT )  compute eigenvectors of TT and T T 16: e, V = eigen(T T) ∗ −1 17: A ← C U  compute and update CCA projection matrices xx ∗ −1 18: B ← C V yy ∗ ∗ ∗ ∗ 19: A ← A · sgn(diag(A Σ B ))  flip signs of projection matrices xy 20: else  at test time use statistics estimated during training 21: X = X − μ  mean center test data 22: Y = Y − μ 23: end if ∗ ∗ 24: X = XA  project latent representations with CCA projections ∗ ∗ 25: Y = YB ∗ ∗ return X Y As not all of the computation steps are obvious, we pro- den representation x and y of the audio-sheet-music-pairs vide further details for the crucial ones. In line 12 and 13, and estimate the canonical correlation coefficients d of the we compute the Cholesky factorization instead of the matrix respective embedding spaces. For the present example, this square root, as the latter has no gradients implemented in yields 32 coefficients which is the dimensionality k of our −1 when Theano. As a consequence, we need to transpose C retrieval embedding space. Figure 6 compares the correla- yy computing T in line 14 [9]. In line 15 and 16, we compute two tion coefficients where 1.0 is the maximum value reachable. eigendecompositions instead of one singular value decompo- The most prominent observation in Fig. 6 is the high corre- sition (which also has no gradients implemented in Theano). lation coefficients of the representation learned with DCCA. In line 19, we flip the signs of first projection matrix to match This structure is expected as the TNO focuses solely on cor- the second to only have positive correlations. This property is relation maximization. However, when recalling the results required for retrieval with cosine distance. Finally, in line 24 of Table 4 we see that this does not necessarily lead to ∗ ∗ and 25, the two views get projected using A and B .Attest the best retrieval performance. The freely learned embed- time, we apply the projections computed and stored during ding Learned-L shows overall the lowest correlation rank training (line 17). but achieves comparable results to DCCA on this dataset. In terms of overall correlation, CCAL-L is situated in- rank between the two other approaches. We have seen in all Investigations on correlation structure our experiments that combining both concepts in a unified retrieval paradigm yields best retrieval performance over dif- As an additional experiment we investigate the correla- ferent application domains as well as data regimes. We take tion structure of the learned representations for all three paradigms. For that purpose we compute the topmost hid- 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 127 Table 7 Architecture of audio-sheet-music model Sheet-image 40 × 100 Spectrogram 136 × 100 2 × Conv(3, pad-1)-16 2 × Conv(3, pad-1)-16 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-32 2 × Conv(3, pad-1)-32 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-64 2 × Conv(3, pad-1)-64 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-64 2 × Conv(3, pad-1)-64 Fig. 6 Comparison of the 32 correlation coefficients d (the dimension- BN-ELU + MP(2) BN-ELU + MP(2) ality of the retrieval space is 32) of the topmost hidden representations x and y of the audio-to-sheet-music dataset and the respective opti- Conv(1, pad-0)-32-BN Conv(1, pad-0)-32-BN mization paradigm. The maximum correlation possible is 1.0 for each Global average pooling Global average pooling coefficient Respective optimization target BN batch normalization, ELU exponential linear unit, MP max pooling, this as evidence that componentwise-correlated projections conv(3, pad-1)-16: 3 × 3 convolution, 16 feature maps and padding 1 support cosine distance-based embedding space learning. Architecture and optimization Table 8 Architecture of zero-shot retrieval CNN ImagenNet feature 1024 Text VS × 30 × 1 In the following, we proved additional details for our exper- FC(1024)-BN-ELU 1 × Conv(3, pad-same)-256 iments carried out in Sect. 5. FC(1024)-BN-ELU BN-ELU + MP(3, 1) FC(64) 2 × Conv(3, pad-valid)-256 Image-text retrieval FC(1024)-BN-ELU We start training with an initial learning rate of either 0.001 FC(64) (all models on IAPR TC-12 and Flickr30k Learned-L )or rank Respective optimization target 0.002 (Flickr30k DCCA and CCAL-L ) . In addition, we VS vocabulary size, BN batch normalization, ELU exponential linear rank unit, MP max pooling, Conv(3, pad-1)-16: 3 ×3 convolution, 16 feature apply 0.0001 L2 weight decay and set the batch size to 1000 maps and padding 1 for all models. The parameter α of the ranking loss in Eq. (8) is set to 0.5. After no improvement on the validation set for 50 epochs, we divide the learning rate by 10 and reduce the Table 9 Architecture of zero-shot retrieval CRNN patience to 10. This learning rate reduction is repeated three times. ImagenNet feature 1024 Text VS × 30 × 1 FC(1024)-BN-ELU 1 × Conv(3, pad-same)-256 Audio-sheet-music retrieval FC(1024)-BN-ELU BN-ELU + MP(3, 1) FC(64) 2 × Conv(3, pad-valid)-256 Table 7 provides details on our audio-sheet-music retrieval GRU-RNN(512) architecture. Temporal average pooling As in the experiments on images and text, we optimize our FC(64) networks using adam with an initial learning rate of 0.001 and batch size 1000. The refinement strategy is the same, but Respective optimization target no weight decay is applied and the margin parameter α of the ranking loss is set to 0.7. with a pairwise ranking loss in combination with our pro- Zero-shot retrieval posed CCA layer. The dimensionality of the retrieval space is fixed to 64 and both models are again optimized with adam Tables 8 and 9 provide details on the architectures used for our and a batch size of 1000. The learning rate is set to 0.0007 zero-shot retrieval experiments carried out in Sect. 5.4.The for the CNN and 0.01 for the CRNN and. The margin param- general architectures follow Reed et al. [25] but are optimized eter α of the ranking loss is set to 0.2. In addition, we apply a weight decay of 0.0001 on all trainable parameters of the The initial learning rate and parameter α are determined by grid search on the evaluation measure MRR on the validation set. network for regularization. 123 128 International Journal of Multimedia Information Retrieval (2018) 7:117–128 References 18. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, 19. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Prob- Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: ability and mathematical statistics. Academic Press, London large-scale machine learning on heterogeneous distributed systems. 20. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) arXiv preprint arXiv:1603.04467 Distributed representations of words and phrases and their compo- 2. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical sitionality. In: Advances in neural information processing systems, correlation analysis. In: Proceedings of the international conference pp 3111–3119 on machine learning, pp 1247–1255 21. Nilsback M-E, Zisserman A (2008) Automated flower classifica- 3. Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling tion over a large number of classes. In: Proceedings of the Indian temporal dependencies in high-dimensional sequences: application conference on computer vision, graphics and image processing to polyphonic music generation and transcription. In: Proceedings 22. Papadopoulo T, Lourakis MIA (2000) Estimating the Jacobian of the 29th international conference on machine learning (ICML- of the singular value decomposition: theory and applications. In: 12), pp 1159–1166 Proceedings of the 6th European conference on computer vision 4. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return (ECCV) of the devil in the details: delving deep into convolutional nets. In: 23. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, British machine vision conference Levy R, Vasconcelos N (2014) On the role of correlation and 5. Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation abstraction in cross-modal multimedia retrieval. IEEE Trans Pat- of gated recurrent neural networks on sequence modeling. CoRR, tern Anal Mach Intell 36(3):521–535 abs/1412.3555 24. Petersen KB, Pedersen MS (2012) The matrix cookbook, nov 2012. 6. Clevert D, Unterthiner T, Hochreiter S (2015) Fast and accu- Version 20121115 rate deep network learning by exponential linear units (elus). 25. Reed S, Akata Z, Schiele B, Lee H (2016) Deep visual-semantic In: International conference on learning representations (ICLR). alignments for generating image descriptions. In: Proceedings of arXiv:1511.07289 the IEEE conference on computer vision and pattern recognition 7. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Ima- 26. Simonyan K, Zisserman A (2014) Very deep convolutional geNet: a large-scale hierarchical image database. In: CVPR09 networks for large-scale image recognition. arXiv preprint 8. Dorfer M, Arzt A, Widmer G (2016) Towards score following in arXiv:1409.1556 sheet music images. In: Proceedings of the international society for 27. Socher R, Karpathy A, Le QV, Manning CD, Ng. AY (2014) music information retrieval conference (ISMIR) Grounded compositional semantics for finding and describing 9. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical corre- images with sentences. Trans Assoc Comput Linguist 2:207–218 lation analysis: an overview with application to learning methods. 28. Theano Development Team (2016) Theano: a Python framework Neural Comput 16(12):2639–2664 for fast computation of mathematical expressions. arXiv e-prints, 10. Hermann KM, Blunsom P (2013) Multilingual distributed abs/1605.02688, May 2016 representations without word alignment. arXiv preprint 29. Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings arXiv:1312.6173 of images and language. CoRR, abs/1511.06361 11. Ioffe S, Szegedy C (2015) Batch normalization: accelerating 30. Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, deep network training by reducing internal covariate shift. CoRR, Perona P (2010) Caltech-UCSD Birds 200. Technical report CNS- abs/1502.03167 TR-2010-001, California Institute of Technology 12. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments 31. Yan F, Mikolajczyk K (2015) Deep correlation for matching images for generating image descriptions. In: Proceedings of the IEEE and text. In: Proceedings of the IEEE conference on computer conference on computer vision and pattern recognition, pp 3128– vision and pattern recognition, pp 3441–3450 13. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897 14. Kingma D, Ba J (2014) Adam: a method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 15. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual- semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 16. Lin M, Chen Q, Yan S (2013) Network in network. CoRR, abs/1312.4400 17. Magnus JR (1985) On differentiating eigenvalues and eigenvectors. Econom Theory 1(2):179–191 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Multimedia Information Retrieval Springer Journals

End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss

Loading next page...
 
/lp/springer-journals/end-to-end-cross-modality-retrieval-with-cca-projections-and-pairwise-umFqzcxK50

References (34)

Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Multimedia Information Systems; Information Storage and Retrieval; Information Systems Applications (incl.Internet); Data Mining and Knowledge Discovery; Image Processing and Computer Vision; Database Management
ISSN
2192-6611
eISSN
2192-662X
DOI
10.1007/s13735-018-0151-5
Publisher site
See Article on Publisher Site

Abstract

Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type than the search query, e.g., retrieving pictures relevant to a given text query. The state-of-the-art approach to cross-modality retrieval relies on learning a joint embedding space of the two modalities, where items from either modality are retrieved using nearest-neighbor search. In this work, we introduce a neural network layer based on canonical correlation analysis (CCA) that learns better embedding spaces by analytically computing projections that maximize correlation. In contrast to previous approaches, the CCA layer allows us to combine existing objectives for embedding space learning, such as pairwise ranking losses, with the optimal projections of CCA. We show the effectiveness of our approach for cross-modality retrieval on three different scenarios (text-to-image, audio-sheet-music and zero-shot retrieval), surpassing both Deep CCA and a multi-view network using freely learned projections optimized by a pairwise ranking loss, especially when little training data is available (the code for all three methods is released at: https://github.com/CPJKU/cca_layer). Keywords Cross-modality retrieval · Canonical correlation analysis · Ranking loss · Neural network · Joint embedding space 1 Introduction aims at decreasing the distance (a differentiable function such as Euclidean or cosine distance) between matching items, Cross-modality retrieval is the task of retrieving relevant while increasing it between mismatching ones. Specialized items of a different modality than the search query (e.g., extensions of this loss achieved state-of-the-art results in var- retrieving an image given a text query). One approach to ious domains such as natural language processing [10], image tackle this problem is to define transformations which embed captioning [12], and text-to-image retrieval [29]. samples from different modalities into a common vector In a different approach, Yan and Mikolajczyk [31]pro- space. We can then project a query into this embedding space, pose to learn a joint embedding of text and images using and retrieve, using nearest-neighbor search, a corresponding Deep canonical correlation analysis (DCCA) [2]. Instead of candidate projected from another modality. a pairwise ranking loss, DCCA directly optimizes the cor- A particularly successful class of models uses paramet- relation of learned latent representations of the two views. ric nonlinear transformations (e.g., neural networks) for the Given the correlated embedding representations of the two embedding projections, optimized via a retrieval-specific views, it is possible to perform retrieval via cosine distance. objective such as a pairwise ranking loss [15,27]. This loss The promising performance of their approach is also in line with the findings of Costa et al. [23] who state the following two hypotheses regarding the properties of efficient cross- Electronic supplementary material The online version of this article (https://doi.org/10.1007/s13735-018-0151-5) contains supplementary modal retrieval spaces: first, the embedding spaces should material, which is available to authorized users. account for low-level cross-modal correlations and second, they should enable semantic abstraction. In [31], both prop- B Matthias Dorfer erties are met by a deep neural network—learning abstract matthias.dorfer@jku.at representations—that is optimized with DCCA ensuring Department of Computational Perception, Johannes Kepler highly correlated latent representations. University Linz, 4040 Linz, Austria In summary, the optimization of pairwise ranking losses The Austrian Research Institute for Artificial Intelligence, yields embedding spaces that are useful for retrieval, and 1010 Vienna, Austria 123 118 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Fig. 1 Sketches of cross-modality retrieval networks. The proposed and B (see Eq. 4). We thus need to compute their partial derivatives with ∂A model in (c) unifies (a, b) and takes advantage of both componentwise- respect to the network’s hidden representations x and y, i.e., and ∂x,y correlated CCA projections and a pairwise ranking loss for cross- ∂B (addressed in Sect. 4). a DCCA network maximizes correlation via ∂x,y modality embedding space learning. We emphasize that our proposal in Trace Norm Objective (TNO). b Freely learned embedding projections (c) requires to backpropagate the ranking loss L through the analytical optimized with ranking loss (Learned-L ). c Canonically correlated ∗ rank computation of the optimally correlated CCA embedding projections A projection layer optimized with ranking loss (CCAL-L ) rank allows incorporating domain knowledge into the loss func- 2 Canonical correlation analysis tion. On the other hand, DCCA is designed to maximize correlation—which has already proven to be useful for cross- In this section, we review the concepts of CCA, the basis d d x y modality retrieval [31]—but does not allow to use loss for our methodology. Let x ∈ R and y ∈ R denote two formulations specialized for the task at hand. random column vectors with covariances Σ and Σ and xx yy In this paper, we propose a method to combine both cross-covariance Σ . The objective of CCA is to find two xy ∗ d ×k ∗ d ×k x y approaches in a way that retains their advantages. We develop matrices A ∈ R and B ∈ R composed of k paired a Canonical Correlation Analysis Layer (CCAL) that can column vectors A and B (with k ≤ d and k ≤ d ) that j j x y be inserted into a dual-view neural network to produce a project x and y into a common space maximizing their com- maximally correlated embedding space for its latent repre- ponentwise correlation: sentations. We can then apply task-specific loss functions, in particular the pairwise ranking loss, on the output of this ∗ ∗ (A , B ) = arg max corr(A x, B y) (1) layer. To train a network using the CCA layer, we describe j j A,B j =1 how to backpropagate the gradient of this loss function to the dual-view neural network while relying on automatic differ- k A Σ B xy j = arg max  . (2) entiation tools such as Theano [28]or Tensorflow [1]. In our A,B A Σ A B Σ B xx j yy j experiments, we show that our proposed method performs j =1 j j better than DCCA and models using pairwise ranking loss alone, especially when little training data is available. Since the objective of CCA is invariant to scaling of the pro- Figure 1 compares our proposed approach to the alterna- jection matrices, we constrain the projected dimensions to tives discussed above. DCCA defines an objective optimizing have unit variance. Furthermore, CCA seeks subsequently a dual-view neural network such that its two views will uncorrelated projection vectors, arriving at the equivalent for- be maximally correlated (Fig. 1a). Pairwise ranking losses mulation: are loss functions to optimize a dual-view neural network ∗ ∗ such that its two views are well-suited for nearest-neighbor (A , B ) = arg max tr A Σ B . (3) xy retrieval in the embedding space (Fig. 1b). In our approach, A Σ A=B Σ B=I xx yy k we boost optimization of a pairwise ranking loss based on cosine distance by placing a special-purpose layer, the CCA −1/2 −1/2 Let T = Σ Σ Σ , and let U diag(d)V be the Sin- xx xy yy projection layer, between a dual-view neural network and the gular Value Decomposition (SVD) of T with ordered singular optimization target (Fig. 1c). Our experiments in Sect. 5 will ∗ ∗ values d ≥ d .Asshown in [19], we obtain A and B i i +1 show the effectiveness of this proposal. from the top k left- and right-singular vectors of T: 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 119 −1/2 −1/2 ∗ ∗ A = Σ U B = Σ V . (4) :k :k xx yy Moreover, the correlation in the projection space is the sum of the top k singular values: ∗ ∗ corr(A x, B y) = d . (5) i ≤k In practice, the covariances and cross-covariance of x and y Fig. 2 DCCA retrieval pipeline proposed in [31]. Note that all process- are usually not known, but estimated from a training set of ing steps below the solid line are performed after network optimization d ×m m paired vectors, expressed as matrices X ∈ R , Y ∈ is complete d ×m R by: 1 1 neural networks f and g are trained using the TNO, with a ˆ ˆ Σ = XX + r I and Σ = XY . (6) xx xy m − 1 m − 1 and b representing different views of an entity (e.g. image and text); then, after the training is finished, the CCA projec- X is the centered version of X. Σ is defined analogously to tions are computed using Eq. (4), and all retrieval candidates yy are projected into the embedding space; finally, at test time, Σ . Additionally, we apply a regularization parameter r I to xx ensure that the covariance matrices are positive definite. Sub- queries of either modality are projected into the embedding stituting these estimates for Σ , Σ and Σ , respectively, space, and the best-matching sample from the other modal- xx xy yy ∗ ∗ ity is found through nearest-neighbor search using the cosine we can compute A and B using Eq. (4). distance. Figure 2 provides a summary of the entire retrieval pipeline. In our experiments, we will refer to this approach as DCCA-2015. 3 Cross-modality retrieval baselines DCCA is limited by design to use the objective func- tion described in Eq. (7), and only seeks to maximize the In this section, we review the two most related works forming correlation in the embedding space. During training, the the basis for our approach. CCA projection matrices are never computed, nor are the samples projected into the common retrieval space. All the 3.1 Deep canonical correlation analysis retrieval steps—most importantly, the computation of CCA projections—are performed only once after the networks f Andrew et al. [2] propose an extension of CCA to learn and g have been optimized. This restricts potential applica- parametric nonlinear transformations of two random vectors, a tions, because we cannot use the projected data as an input to such that their correlation is maximized. Let a ∈ R and b subsequent layers or task-specific objectives. We will show b ∈ R denote two random vectors, and let x = f (a; Θ ) how our approach overcomes this limitation in Sect. 4. and y = g(b; Θ ) denote their nonlinear transformations, parameterized by Θ and Θ . DCCA optimizes the param- f g eters Θ and Θ to maximize the correlation of the topmost 3.2 Pairwise ranking loss f g hidden representations x and y.For d = d = k, this objec- x y tive corresponds to Eq. 5, i.e., the sum of all singular values Kiros et al. [15] learn a multi-modal joint embedding space of T, also called the trace norm: for images and text. They use the cosine of the angle between two corresponding vectors x and y as a scoring function, i.e., ∗ ∗ s(x, y) = cos(x, y). Then, they optimize a pairwise ranking corr(A f (a; Θ ), B g(b; Θ )) =||T||.(7) f g tr loss Andrew et al. [2] show how to compute the gradient of this Trace Norm Objective (TNO) with respect to x and y. Assum- L = max{0,α − s(x, y) + s(x, y )} (8) rank k ing f and g are differentiable with respect to Θ and Θ (as k f g is the case for neural networks), this allows to optimize the where x is an embedded sample of the first modality, y is the nonlinear transformations via gradient-based methods. matching embedded sample of the second modality, and y Yan and Mikolajczyk [31] suggest the following pro- k are the contrastive (mismatching) embedded samples of the cedure to utilize DCCA for cross-modality retrieval: first, second modality (in practice, all mismatching samples in the current mini-batch). The hyper-parameter α defines the mar- We understand the correlation of two vectors to be defined as corr(x, y) = corr(x , y ). gin of the loss function. This loss encourages an embedding i j i j 123 120 International Journal of Multimedia Information Retrieval (2018) 7:117–128 space where the cosine distance between matching samples this section, we discuss how to establish gradient flow (back- is lower than the cosine distance of mismatching samples. propagation) through CCA’s optimal projection matrices. In ∗ ∗ ∂A ∂B In this setting, the networks f and g have to learn particular, we require the partial derivatives and ∂x,y ∂x,y the embedding projections freely from randomly initialized of the projections with respect to their input representations weights. Since the projections are learned from scratch by x and y. This will allow us to use CCA as a layer within a optimizing a ranking loss, in our experiments, we denote multi-modality neural network, instead of as a final objective this approach by Learned-L . Figure 1bshows asketch (TNO) for correlation maximization only. rank of this paradigm. 4.2 Gradient of CCA projections 4 Learning with canonically correlated As mentioned above, we can compute the canonical corre- embedding projections lation along with the optimal projection matrices from the −1/2 −1/2 singular value decomposition T = Σ Σ Σ = xx xy yy In the following, we explain how to bring both concepts— U diag(d)V . Specifically, we obtain the correlation as ∗ ∗ ∗ DCCA and Pairwise Ranking Losses—together to enhance corr(A x, B y) = d , and the projections as A = −1/2 −1/2 cross-modality embedding space learning. Σ U and B = Σ V. For DCCA, it suffices to com- xx yy pute the gradient of the total correlation wrt. x and y in order 4.1 Motivation to backpropagate it through the two networks f and g.Using the chain rule, Andrew et al. decompose this into the gradi- We start by providing an intuition on why we expect this ents of the total correlation wrt. Σ , Σ and Σ , and the xx xy yy combination to be fruitful: DCCA-2015 maximizes the cor- gradients of those wrt. x and y [2]. Their derivations of the relation between the latent representations of two different former make use of the fact that both the gradient of d neural networks via the TNO derived from classic CCA. As wrt. T and the gradient of ||T|| (the trace norm objective in tr correlation and cosine distance are related, we can also use Eq. (7)) wrt. T T have a simple form; see Section 7 in [2]for such a network for cross-modality retrieval [31]. Kiros et details. al. [15], on the other hand, learn a cross-modality retrieval In our case where we would like to backpropagate errors embedding by optimizing an objective customized for the through the CCA transformations, we instead need the gra- ∗ ∗ ∗ ∗ task at hand. The motivation for our approach is that we dients of the projected data x = A x and y = B y wrt. x ∗ ∗ ∂A ∂B want to benefit from both: a task-specific retrieval objective, and y, which requires the partial derivatives and .We ∂x,y ∂x,y and componentwise optimally correlated embedding projec- could again decompose this into the gradients wrt. T, the gra- tions. dients of T wrt. Σ , Σ and Σ and the gradients of those xx xy yy To achieve this, we devise a CCA layer that analytically wrt. x and y. However, while the gradients of U and V wrt. ∗ ∗ computes the CCA projections A and B during training, and T are known [22], they involve solving O((d d ) ) linear x y projects incoming samples into the embedding space. The 2 × 2 systems. Instead, we reformulate the solution to use projected samples can then be used in subsequent layers, or two symmetric eigendecompositions TT = U diag(e)U for computing task-specific losses such as the pairwise rank- and T T = V diag(e)V (Equation 270 in [24]). This gives ing loss. Figure 1c illustrates the central idea of our combined us the same left and right eigenvectors we would obtain from approach. Compared to Fig. 1b, we insert an additional linear the SVD, along with the squared singular values (e = d ). transformation. However, this transformation is not learned The gradients of eigenvectors of symmetric real eigensys- (otherwise it could be merged with the previous layer, which tems have a simple form [17] and both TT and T T are is not followed by a nonlinearity). Instead, it is computed to differentiable wrt. x and y. be the transformation that maximizes componentwise corre- To summarize: in order to obtain an efficiently computable ∗ ∗ lation between the two views. A and B in Fig. 1care the definition of the gradient for CCA projections, we have very projections given by Eq. (4) in Sect. 2. reformulated the forward pass (the computation of the CCA In theory, optimizing a pairwise ranking loss alone could transformations). Our formulation using two eigendecom- yield projections equivalent to the ones computed by CCA. positions translates into a series of computation steps that In practice, however, we observe that the proposed combi- are differentiable in a graph-based, auto-differentiating math nation gives much better cross-modality retrieval results (see compiler such as Theano [28], which, together with the chain Sect. 5). rule, gives an efficient implementation of the CCA layer gra- Our design requires backpropagating errors through the dient for training our network. For a detailed description of analytical computation of the CCA projection matrices. DCCA [2] does not cover this, since projecting the data is The code of our implementation of the CCA layer is available at not necessary for optimizing the TNO. In the remainder of https://github.com/CPJKU/cca_layer. 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 121 Table 1 Example images for Flickr30k (top) and IAPR TC-12 (bottom) A man in a white cowboy hat reclines in front of a window in an airport A young man rests on an airport seat with a cowboy hat over his face A woman relaxes on a couch , with a white cowboy hat over her head A man is sleeping inside on a bench with his hat over his eyes Apersonissleepingatanairportwithahat Fig. 3 Sketch of cross-modality retrieval. The blue dots are the embed- on their head ded candidate samples. The red dot is the embedding of the search query. The larger blue dot highlights the closest candidate selected as A green and brown embankment with brown the retrieval result (colour figure online) houses on the right and a light brown sandy beach at the dark blue sea on the left; a dark mountain range behind it and white clouds in a light blue sky in the background the CCA layer forward pass, we refer to Algorithm 1 in the “Appendix” of this article. As the technical implementation is not straight-forward, we also discuss the crucial steps in the “Appendix”. Thus, we now have the means to benefit from the optimal we define the MRR (higher is better) as the mean value of CCA projections but still optimize for a task-specific objec- 1/rank over all queries where rank is again the position tive. In particular, we utilize the pairwise ranking loss of of the target in the similarity-ordered list of available candi- Eq. (8) on top of an intermediate CCA embedding projection dates. layer. We denote the proposed retrieval network of Fig. 1c as CCAL-L in our experiments (CCAL refers to CCA rank 5.1 Image-text retrieval Layer). In the first part of our experiments, we consider Flickr30k and IAPR TC-12, two publicly available datasets for image-text 5 Experiments cross-modality retrieval. Flickr30k consists of image-caption pairs, where each image is annotated with five different tex- We evaluate our approach (CCAL-L ) in cross-modality tual descriptions. The train-validation-test split for Flickr30k rank retrieval experiments on two image-to-text and one audio-to- is 28000-1000-1000. In terms of evaluation setup, we fol- sheet-music dataset. Additionally, we provide results on two low Protocol 3 of [31] and concatenate the five available zero-shot text-to-image retrieval scenarios proposed in [25]. captions into one, meaning that only one, but richer text anno- For comparison, we consider the approach of [31](DCCA- tation remains per image. This is done for all three sets of 2015), our own implementation of the TNO (denoted by the split. The second image-text dataset, IAPR TC-12, con- DCCA), as well as the freely learned projection embeddings tains 20000 natural images where only one—but compared (Learned-L ) optimizing the ranking loss of [15]. to Flickr30k more detailed—caption is available for each rank The task for all three datasets is to retrieve the correct image. As no predefined train-validation-test split is pro- counterpart when given an instance of the other modality vided, we randomly select 1000 images for validation and as a search query. For retrieval, we use the cosine distance 2000 for testing, and keep the rest for training. [31]alsouse in embedding space for all approaches. First, we embed all 2000 images for testing, but did not explicitly mention hold- candidate samples of the target modality into the retrieval out images for validation. Table 1 shows an example image embedding space. Then, we embed the query element y with along with its corresponding captions or caption for either the second network and select its nearest-neighbor x of the dataset. target modality. Fig. 3 shows a sketch of this retrieval by The input to our networks is a 4096-dimensional image embedding space learning paradigm. feature vector along with a corresponding text vector repre- As evaluation measures, we consider the Recall@k (R@k sentation which has dimensionality 5793 for Flickr30k and in %) as well as the Median Rank (MR) and the Mean 2048 for IAPR TC-12. The image embedding is computed Reciprocal Rank (MRR in %).The R@k rate (higher is from the last hidden layer of a network pretrained on Ima- better) is the ratio of queries which have the correct cor- geNet [7] (layer fc7 of CNN_S by [4]). In terms of text responding counterpart in the first k retrieval results. The pre-processing, we follow [31], tokenizing and lemmatizing MR (lower is better) is the median position of the target the raw captions as the first step. Based on the lemmatized in a similarity-ordered list of available candidates. Finally, captions, we compute l2-normalized TF/IDF-vectors, omit- 123 122 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Table 2 Retrieval results on Method Image-to-text Text-to-image IAPR TC-12. “DCCA-2015” is R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR taken from [31] DCCA-2015 30.2 57.0 – – 42.6 29.5 60.0 – – 41.5 DCCA 31.0 58.7 70.4 3.6 43.9 29.5 58.2 70.5 4.0 42.7 Learned-L 22.3 50.7 63.8 5.2 35.7 21.6 50.1 63.3 5.5 35.1 rank CCAL-L 31.6 61.0 72.2 3.0 45.0 29.6 60.0 72.2 3.6 43.5 rank Table 3 Retrieval results on Method Image-to-text Text-to-image Flickr30k. “DCCA-2015” is R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR taken from [31] DCCA-2015 27.9 56.9 68.2 4 – 26.8 52.9 66.9 4 – DCCA 31.6 59.2 69.3 3.3 44.2 30.3 58.3 69.2 3.8 43.1 Learned-L 23.7 50.5 63.0 5.3 36.3 23.6 51.0 62.5 5.2 36.5 rank CCAL-L 32.0 59.2 70.4 3.2 44.8 29.9 58.8 70.2 3.7 43.3 rank ting words with an overall occurrence smaller than five for where no results are available in the literature. When look- Flickr30k and three for IAPR TC-12, respectively. The image ing at the performance of CCAL-L we further observe rank representation is processed by a linear dense layer with 128 that it outperforms all other methods, although the differ- units, which will also be the dimensionality k of the result- ence to DCCA is not pronounced for all of the measures. ing retrieval embedding. The text vector is fed through two Comparing CCAL-L with the freely learned projection rank batch-normalized [11] dense layers of 1024 units each and matrices (Learned-L ) we observe a much larger perfor- rank the ELU activation function [6]. As a last layer for the text mance gap. This is interesting, as in principle the learned representation network, we again apply a dense layer with projections could converge to exactly the same solution as 128 linear units. CCAL-L . We take this as a quantitative confirmation that rank For a fair comparison, we keep the structure and number the learning process benefits from CCA’s optimal projection of parameters of all networks in our experiments the same. matrices. The only difference between the networks are the objectives In Table 3, we list our results on the Flickr30k dataset. As and the hyper-parameters used for optimization. Optimiza- above, we show the retrieval performances of [31] as a base- tion is performed using Stochastic Gradient Descent (SGD) line along with our results and observe similar behavior as on with the adam update rule [14] (for details please see our IAPR TC-12. Again, we point out the poor performance of “Appendix”). the freely learned projections (Learned-L ) in this exper- rank Table 2 lists our results on IAPR TC-12. Along with our iment. Keeping this observation in mind, we will notice a experiments, we also show the results reported in [31]asa different behavior in the experiments in Sect. 5.2. reference (DCCA-2015). However, a direct comparison to Note that there are various other methods reporting results our results may not be fair: DCCA-2015 uses a different on Flickr30k [13,15,18,27] which partly surpass ours, for ImageNet-pretrained network for the image representation, example by using more elaborate processing of the textual and finetunes this network while we keep it fixed. This is descriptions or more powerful ImageNet models. We omit because our interest is in comparing the methods in a sta- these results as we focus on the comparison of DCCA and ble setting, not in obtaining the best possible results. Our freely learned projections with the proposed CCA projection implementation of the TNO (DCCA) uses the same objec- embedding layer. tive as DCCA-2015, but is trained using the same network architecture as our remaining models and permits a direct 5.2 Audio-sheet-music retrieval comparison. Additionally, we repeat each of the experiments 10 times with different initializations and report the mean for For the second set of experiments, we consider the Not- each of the evaluation measures. tingham piano midi dataset [3]. The dataset is a collection When taking a closer look at Table 2, we observe that of midi files split into train, validation and test set already our results achieved by optimizing the TNO (DCCA)sur- used by [8] for experiments on end-to-end score-following pass the results reported in [31]. We already discussed above in sheet-music images. Here, we tackle the problem of audio- that the two versions are not directly comparable. How- sheet-music retrieval, i.e., matching short snippets of music ever, given this result, we consider our implementation of (audio) to corresponding parts in the sheet music (image). DCCA as a valid baseline for our experiments in Sect. 5.2 Figure 4 shows examples of such correspondences. 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 123 learned embedding projections. On measures such as R@5 or R@10 it achieves similar to or better performance than DCCA. One of the reasons for this could be the fact that there is an order of magnitude more training data available for this task to learn the projection embedding from random Fig. 4 Example of the data considered for audio-sheet-music (image) initialization. Still, our proposed combination of both con- retrieval. Top: short snippets of sheet-music images. Bottom: Spectro- gram excerpts of the corresponding music audio cepts (CCAL-L ) achieves highest retrieval scores. rank 5.3 Performance in small data regime We conduct this experiment for two reasons: First, to show the advantage of the proposed method over different The above results suggest that the benefit of using a CCA pro- domains. Second, the data and application is of high practi- jection layer (CCAL-L ) over a freely learned projection rank cal relevance in the domain of Music Information Retrieval becomes most evident when few training data is available. (MIR). A system capable of linking sheet music (images) and To examine this assumption, we repeat the audio-to-sheet- the corresponding music (audio) would be useful in many music experiment of the previous section, but use only 10% content-based musical retrieval scenarios. of the original training data (≈ 27000 samples). We stress In terms of audio preparation, we compute log frequency the fact that the learned embedding projection of Learned- spectrograms with a sample rate of 22.05 kHz, a FFT win- L could converge to exactly the same solution as the rank dow size of 2048, and a computation rate of 31.25 frames CCA projections of CCAL-L . Table 5 summarizes the rank per second. These spectrograms (136 frequency bins) are low data regime results for the three methods. Consistent with then directly fed into the audio part of the cross-modality our hypothesis, we observe a larger gap between Learned- networks. Figure 4 shows a set of audio-to-sheet correspon- L and CCAL-L compared to the one obtained with rank rank dences presented to our network for training. One audio all training data in Table 4. We conclude that a network excerpt comprises 100 frames and the dimension of the sheet might be able to learn suitable embedding projections when image snippet is 40 × 100 pixels. Overall this results in sufficient training data is available. However, when having 270,705 train, 18,046 validation and 16,042 test audio-sheet- fewer training samples, the proposed CCA projection layer music pairs. This is an order of magnitude more training data strongly supports embedding space learning. In addition, we than for the image-to-text datasets of the previous section. also looked into the retrieval performance of Learned-L rank In the experiments in Sect. 5.1, we relied on pretrained and CCAL-L on the training set and observe comparable rank ImageNet features and relatively shallow fully connected performance. This indicates that the CCA layer also acts as text-feature processing networks. The model here differs a regularizer and helps to generalize to unseen samples. from this, as it consists of two deep convolutional net- works learned entirely from scratch. Our architecture is a VGG-style [26] network consisting of sequences of 3×3 con- 5.4 Zero-shot image-text retrieval volution stacks followed by 2 × 2 max pooling. To reduce the dimensionality to the desired correlation space dimen- Our last set of experiments focuses on a slightly modified sionality k (in this case 32), we insert as a final building retrieval setting, namely image-text zero-shot retrieval [25]. block a 1 × 1 convolution having k feature maps followed by Given a set of image-text pairs originating from C differ- global average pooling [16] (for further architectural details ent categories the data is split into a class-disjoint training, we again refer to the appendix of this manuscript). validation and test sets having no categorical overlap. This Table 4 lists our result on audio-to-sheet-music retrieval. implies that at test time we aim to retrieve images from tex- As in the experiments on images and text, the proposed CCA tual queries describing categories (semantic concepts) never projection embedding layer trained with pairwise ranking seen before, neither for training, nor for validation. loss outperforms the other models. Recalling the results from Reed et al. [25] collected and provided textual descriptions Sect. 5.1, we observe an increased performance of the freely for two publicly available datasets, the CUB-200 bird image Table 4 Retrieval results on Method Sheet-to-audio Audio-to-sheet Nottingham dataset R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR (audio-to-sheet-music retrieval) DCCA 42.0 88.2 93.3 2 62.2 44.6 87.9 93.2 2 63.5 Learned-L 40.7 89.6 95.6 2 61.7 41.4 88.9 95.4 2 61.9 rank CCAL-L 44.1 93.3 97.7 2 65.3 44.5 91.6 96.7 2 64.9 rank 123 124 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Table 5 Retrieval results on Method Sheet-to-audio Audio-to-sheet audio-to-sheet-music retrieval R@1 R@5 R@10 MR MRR R@1 R@5 R@10 MR MRR when using only 10% of the train data DCCA 20.0 53.6 65.4 5 35.3 22.7 54.7 65.8 4 37.3 Learned-L 11.3 35.2 47.6 12 23.0 12.6 35.2 47.2 12 23.7 rank CCAL-L 22.2 59.2 70.7 4 38.8 25.0 59.3 70.9 4 40.4 rank Table 6 Zero-shot retrieval results on cub and flowers Method Flowers Birds Attributes [25] – 50.0 Word2Vec [25] 52.1 33.5 Word CNN [25] 56.3 43.3 Word CNN-RNN [25] 59.6 48.7 Word CNN + CCAL 62.2 52.2 Fig. 5 Example images of CUB-200 birds and Oxford Flowers along with textual descriptions collected by Reed et al. [25] for zero-shot Word CNN-RNN + CCAL 64.0 49.8 retrieval from text For the Birds dataset, as an alternative to the tex- dataset [30] and the Oxford Flowers dataset [21]. According tual descriptions, there are manually created fine-grained to the definition of zero-shot retrieval above, we follow [25] attributes available for each of the images. When relying on and split CUB into 100 train, 50 validation and 50 test cate- these attributes Reed et al. report state-of-the-art results on gories. Flowers is split into 82 train and 20 validation / test the dataset [25] not reached by their text processing neural classes respectively. Figure 5 shows some example images networks. along with their textual descriptions. In the bottom part of Table 6, we report the performance Besides the modified, harder retrieval setting there is a sec- of the same architectures optimized using our proposed ond difference to the text-image retrieval experiments carried CCA layer in combination with a pairwise ranking loss. We out in Sect. 5.1. Instead of using hand engineered textual fea- observe that the CCA layer is able to improve the perfor- tures (e.g. TF-IDF) or unsupervised textual feature learning mance of both models on both datasets. The gain in retrieval (e.g. word2vec [20]) the authors in [25] employ Convolu- performance within a model class is largest for the convo- tional Recurrent Neural Networks (CRNN) to learn the latent lution only (CNN) text processing models (≈ 9% points for text representations directly from the raw descriptions. In the Flowers dataset and ≈ 6 for CUB). For the birds dataset particular, they feed the descriptions as one-hot-word encod- the Word CNN + CCAL even outperforms the models relying ings to the text processing part of their networks. In terms on manually encoded attributes by achieving an AP@50 of of image representations, they still rely on 1024-dimensional 52.2. pretrained ImageNet features. The feature learning part and the network architectures used for our experiments follows exactly the descriptions provided in [25]. The sole difference is, that we again replace the topmost embedding layer with 6 Discussion and conclusion the proposed CCA projection layer in combination with a pairwise ranking loss. We have shown how to use the optimal projection matri- Table 6 compares the retrieval results of the respective ces of CCA as the weights of an embedding layer within a methods on the two zero-shot retrieval datasets. To allow multi-view neural network. With this CCA layer, it becomes for a direct comparison with the results reported in [25], we possible to optimize for a specialized loss function (e.g., follow their evaluation setup and report the Average Preci- related to a retrieval task) on top of this, exploiting the cor- sion (AP@50). The AP@50 is the percentage of the top-50 relation properties of a latent space provided by CCA. As scoring images whose class matches that of the text query, this requires to establish gradient flow through CCA, we for- averaged over the 50 test classes. In [25] the best retrieval per- mulate it to allow easy computation of the partial derivatives ∗ ∗ ∂A ∂B ∗ ∗ formance for both datasets (when considering only feature and of CCA’s projection matrices A and B with ∂x,y ∂x,y learning) is achieved by having a CRNN directly process- respect to the input data x and y. With this formulation, we can ing the textual descriptions. What is also interesting is the incorporate CCA as a building block within multi-modality substantial performance gain with respect to unsupervised neural networks that produces maximally correlated projec- word2vec features. tions of its inputs. In our experiments, we use this building 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 125 −1/2 −1/2 block within a cross-modality retrieval setting, optimizing a of T = Σ Σ Σ = U diag(d)V (see Sect. 2). The xy xx yy network to minimize a cosine distance-based pairwise rank- proposed model needs to backpropagate the errors through ing loss of the componentwise-correlated CCA projections. the CCA transformations, i.e., it requires the gradients of the ∗ ∗ ∗ ∗ Experimental results show that when using the cosine dis- projected data x = A x and y = B y wrt. x and y. Apply- tance for retrieval (as is common for correlated views), this ing the chain rule, this further requires the gradients of U and −1/2 −1/2 is superior to optimizing a network for maximally correlated V wrt. T, and the gradients of T, Σ , Σ and Σ wrt. xx xy yy projections (as done in DCCA), or not using CCA at all. This x and y. observation holds in our experiments on a variety of different The main technical challenge is that common auto- modality pairs as well as two different retrieval scenarios. differentiation tools such as Theano [28] or Tensor Flow [1] When investigating the experimental results in more do not provide derivatives for the inverse squared root and detail, we find that the correlation-based methods (DCCA, singular value decomposition of a matrix. To overcome this, CCAL) consistently outperform the models that learn the we replace the inverse squared root of a matrix by using its embedding projections from scratch. A direct comparison of Cholesky decomposition as described in [9]. Furthermore, DCCA with the proposed CCAL-L reveals two learn- we note that the singular value decomposition is required rank ing scenarios where CCAL-L is superior: (1) the low to obtain the matrices U and V, but in fact those matrices rank data regime, where we found that the CCA layer acts as a can alternatively be obtained by solving the eigendecompo- strong regularizer to prevent over-fitting; (2) when learning sition of TT = U diag(e)U and T T = V diag(e)V [24, the entire retrieval representation (network parameteriza- Eq. 270]. This yields the same left and right eigenvectors tion) from scratch, not relying on pretrained or hand-crafted we would obtain from the SVD (except for possibly flipped features (see Sect. 5.2). Our intuition on this is that incor- signs, which are easy to fix), along with the squared singular porating the task-specific retrieval objective already during values (e = d ). Note that TT and T T are symmetric, and training encourages the networks to learn embedding repre- that the gradients of eigenvectors of symmetric real eigensys- sentations that are beneficial for retrieval at test time. This is tems have a simple form [17, Eq. 7]. Furthermore, TT and the important conceptual difference compared to the Trace T T are differentiable wrt. x and y, enabling a sufficiently effi- Norm Objective (TNO) of DCCA, which does not focus on cient implementation in a graph-based, auto-differentiating the retrieval task. However, when using the CCA layer we math compiler. also inherit one drawback of the pairwise ranking loss, which The following section provides a detailed description of is the additional hyper-parameter (margin α) that needs to be the implementation of the CCA layer. determined on the validation set. Finally, we would like to emphasize that our CCA layer is Forward pass of CCA projection layer a general network component which could provide a useful basis for further research, e.g., as an intermediate processing For easier reproducibility, we provide a detailed descrip- step for learning binary cross-modality retrieval representa- tion of the forward pass of the proposed CCA layer in tions. Algorithm 1. To train the model, we need to propagate the gradient through the CCA layer (backward pass). We rely on Acknowledgements Open access funding provided by Johannes Kepler auto-differentiation tools (in particular, Theano) implement- University Linz. ing the gradient for each individual computation step in the Open Access This article is distributed under the terms of the Creative forward pass, and connecting them using the chain rule. Commons Attribution 4.0 International License (http://creativecomm The layer itself takes the latent feature representations (a ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, d ×m d ×m x y and reproduction in any medium, provided you give appropriate credit batch of m paired vectors X ∈ R and Y ∈ R )ofthe to the original author(s) and the source, provide a link to the Creative two network pathways f and g as input and projects them Commons license, and indicate if changes were made. with CCA’s analytical projection matrices. At train time, the layer uses the optimal projections computed from the cur- rent batch. When applying the layer at test time, it uses the statistics and projections remembered from last train- Appendix ing batch (which can of course be recomputed on a larger training batch to get more stable estimate). Implementation details Backpropagating the errors through the CCA projection Note that this is not relevant for the DCCA model introduced in [2] matrices is not trivial. The optimal CCA projection matri- because it only derives the CCA projections after optimizing the TNO. −1/2 −1/2 ∗ ∗ ces are given by A = Σ U and B = Σ V, where xx yy The code of our implementation of the CCA layer is available at U and V are derived from the singular value decomposition https://github.com/CPJKU/cca_layer. 123 126 International Journal of Multimedia Information Retrieval (2018) 7:117–128 Algorithm 1 Forward Pass of CCA Projection Layer. d ×m d ×m x y 1: Input of layer: X ∈ R and Y ∈ R  hidden representation of current batch ∗ ∗ 2: Returns: X and Y  CCA projected hidden representation ∗ ∗ 3: Parameters of layer: μ , μ and A , B  means and CCA projection matrices x y 4: if train_time then  update statistics and CCA projections during training 5: μ ← X  update μ and μ with means of batch x i x y 6: μ ← Y y i 7: X = X − μ  mean center data 8: Y = Y − μ 9: Σ = X X + r I  estimate covariances of batch xx m−1 10: Σ = Y Y + r I yy m−1 11: Σ = X Y xy m−1 −1 −1 12: C = cholesky(Σ )  compute inverses of Cholesky factorizations xx xx −1 −1 13: C = cholesky(Σ ) yy yy −1 −1 14: T = C Σ (C )  compute matrix T xy xx yy 15: e, U = eigen(TT )  compute eigenvectors of TT and T T 16: e, V = eigen(T T) ∗ −1 17: A ← C U  compute and update CCA projection matrices xx ∗ −1 18: B ← C V yy ∗ ∗ ∗ ∗ 19: A ← A · sgn(diag(A Σ B ))  flip signs of projection matrices xy 20: else  at test time use statistics estimated during training 21: X = X − μ  mean center test data 22: Y = Y − μ 23: end if ∗ ∗ 24: X = XA  project latent representations with CCA projections ∗ ∗ 25: Y = YB ∗ ∗ return X Y As not all of the computation steps are obvious, we pro- den representation x and y of the audio-sheet-music-pairs vide further details for the crucial ones. In line 12 and 13, and estimate the canonical correlation coefficients d of the we compute the Cholesky factorization instead of the matrix respective embedding spaces. For the present example, this square root, as the latter has no gradients implemented in yields 32 coefficients which is the dimensionality k of our −1 when Theano. As a consequence, we need to transpose C retrieval embedding space. Figure 6 compares the correla- yy computing T in line 14 [9]. In line 15 and 16, we compute two tion coefficients where 1.0 is the maximum value reachable. eigendecompositions instead of one singular value decompo- The most prominent observation in Fig. 6 is the high corre- sition (which also has no gradients implemented in Theano). lation coefficients of the representation learned with DCCA. In line 19, we flip the signs of first projection matrix to match This structure is expected as the TNO focuses solely on cor- the second to only have positive correlations. This property is relation maximization. However, when recalling the results required for retrieval with cosine distance. Finally, in line 24 of Table 4 we see that this does not necessarily lead to ∗ ∗ and 25, the two views get projected using A and B .Attest the best retrieval performance. The freely learned embed- time, we apply the projections computed and stored during ding Learned-L shows overall the lowest correlation rank training (line 17). but achieves comparable results to DCCA on this dataset. In terms of overall correlation, CCAL-L is situated in- rank between the two other approaches. We have seen in all Investigations on correlation structure our experiments that combining both concepts in a unified retrieval paradigm yields best retrieval performance over dif- As an additional experiment we investigate the correla- ferent application domains as well as data regimes. We take tion structure of the learned representations for all three paradigms. For that purpose we compute the topmost hid- 123 International Journal of Multimedia Information Retrieval (2018) 7:117–128 127 Table 7 Architecture of audio-sheet-music model Sheet-image 40 × 100 Spectrogram 136 × 100 2 × Conv(3, pad-1)-16 2 × Conv(3, pad-1)-16 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-32 2 × Conv(3, pad-1)-32 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-64 2 × Conv(3, pad-1)-64 BN-ELU + MP(2) BN-ELU + MP(2) 2 × Conv(3, pad-1)-64 2 × Conv(3, pad-1)-64 Fig. 6 Comparison of the 32 correlation coefficients d (the dimension- BN-ELU + MP(2) BN-ELU + MP(2) ality of the retrieval space is 32) of the topmost hidden representations x and y of the audio-to-sheet-music dataset and the respective opti- Conv(1, pad-0)-32-BN Conv(1, pad-0)-32-BN mization paradigm. The maximum correlation possible is 1.0 for each Global average pooling Global average pooling coefficient Respective optimization target BN batch normalization, ELU exponential linear unit, MP max pooling, this as evidence that componentwise-correlated projections conv(3, pad-1)-16: 3 × 3 convolution, 16 feature maps and padding 1 support cosine distance-based embedding space learning. Architecture and optimization Table 8 Architecture of zero-shot retrieval CNN ImagenNet feature 1024 Text VS × 30 × 1 In the following, we proved additional details for our exper- FC(1024)-BN-ELU 1 × Conv(3, pad-same)-256 iments carried out in Sect. 5. FC(1024)-BN-ELU BN-ELU + MP(3, 1) FC(64) 2 × Conv(3, pad-valid)-256 Image-text retrieval FC(1024)-BN-ELU We start training with an initial learning rate of either 0.001 FC(64) (all models on IAPR TC-12 and Flickr30k Learned-L )or rank Respective optimization target 0.002 (Flickr30k DCCA and CCAL-L ) . In addition, we VS vocabulary size, BN batch normalization, ELU exponential linear rank unit, MP max pooling, Conv(3, pad-1)-16: 3 ×3 convolution, 16 feature apply 0.0001 L2 weight decay and set the batch size to 1000 maps and padding 1 for all models. The parameter α of the ranking loss in Eq. (8) is set to 0.5. After no improvement on the validation set for 50 epochs, we divide the learning rate by 10 and reduce the Table 9 Architecture of zero-shot retrieval CRNN patience to 10. This learning rate reduction is repeated three times. ImagenNet feature 1024 Text VS × 30 × 1 FC(1024)-BN-ELU 1 × Conv(3, pad-same)-256 Audio-sheet-music retrieval FC(1024)-BN-ELU BN-ELU + MP(3, 1) FC(64) 2 × Conv(3, pad-valid)-256 Table 7 provides details on our audio-sheet-music retrieval GRU-RNN(512) architecture. Temporal average pooling As in the experiments on images and text, we optimize our FC(64) networks using adam with an initial learning rate of 0.001 and batch size 1000. The refinement strategy is the same, but Respective optimization target no weight decay is applied and the margin parameter α of the ranking loss is set to 0.7. with a pairwise ranking loss in combination with our pro- Zero-shot retrieval posed CCA layer. The dimensionality of the retrieval space is fixed to 64 and both models are again optimized with adam Tables 8 and 9 provide details on the architectures used for our and a batch size of 1000. The learning rate is set to 0.0007 zero-shot retrieval experiments carried out in Sect. 5.4.The for the CNN and 0.01 for the CRNN and. The margin param- general architectures follow Reed et al. [25] but are optimized eter α of the ranking loss is set to 0.2. In addition, we apply a weight decay of 0.0001 on all trainable parameters of the The initial learning rate and parameter α are determined by grid search on the evaluation measure MRR on the validation set. network for regularization. 123 128 International Journal of Multimedia Information Retrieval (2018) 7:117–128 References 18. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, 19. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Prob- Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: ability and mathematical statistics. Academic Press, London large-scale machine learning on heterogeneous distributed systems. 20. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) arXiv preprint arXiv:1603.04467 Distributed representations of words and phrases and their compo- 2. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical sitionality. In: Advances in neural information processing systems, correlation analysis. In: Proceedings of the international conference pp 3111–3119 on machine learning, pp 1247–1255 21. Nilsback M-E, Zisserman A (2008) Automated flower classifica- 3. Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling tion over a large number of classes. In: Proceedings of the Indian temporal dependencies in high-dimensional sequences: application conference on computer vision, graphics and image processing to polyphonic music generation and transcription. In: Proceedings 22. Papadopoulo T, Lourakis MIA (2000) Estimating the Jacobian of the 29th international conference on machine learning (ICML- of the singular value decomposition: theory and applications. In: 12), pp 1159–1166 Proceedings of the 6th European conference on computer vision 4. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return (ECCV) of the devil in the details: delving deep into convolutional nets. In: 23. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, British machine vision conference Levy R, Vasconcelos N (2014) On the role of correlation and 5. Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation abstraction in cross-modal multimedia retrieval. IEEE Trans Pat- of gated recurrent neural networks on sequence modeling. CoRR, tern Anal Mach Intell 36(3):521–535 abs/1412.3555 24. Petersen KB, Pedersen MS (2012) The matrix cookbook, nov 2012. 6. Clevert D, Unterthiner T, Hochreiter S (2015) Fast and accu- Version 20121115 rate deep network learning by exponential linear units (elus). 25. Reed S, Akata Z, Schiele B, Lee H (2016) Deep visual-semantic In: International conference on learning representations (ICLR). alignments for generating image descriptions. In: Proceedings of arXiv:1511.07289 the IEEE conference on computer vision and pattern recognition 7. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Ima- 26. Simonyan K, Zisserman A (2014) Very deep convolutional geNet: a large-scale hierarchical image database. In: CVPR09 networks for large-scale image recognition. arXiv preprint 8. Dorfer M, Arzt A, Widmer G (2016) Towards score following in arXiv:1409.1556 sheet music images. In: Proceedings of the international society for 27. Socher R, Karpathy A, Le QV, Manning CD, Ng. AY (2014) music information retrieval conference (ISMIR) Grounded compositional semantics for finding and describing 9. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical corre- images with sentences. Trans Assoc Comput Linguist 2:207–218 lation analysis: an overview with application to learning methods. 28. Theano Development Team (2016) Theano: a Python framework Neural Comput 16(12):2639–2664 for fast computation of mathematical expressions. arXiv e-prints, 10. Hermann KM, Blunsom P (2013) Multilingual distributed abs/1605.02688, May 2016 representations without word alignment. arXiv preprint 29. Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings arXiv:1312.6173 of images and language. CoRR, abs/1511.06361 11. Ioffe S, Szegedy C (2015) Batch normalization: accelerating 30. Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, deep network training by reducing internal covariate shift. CoRR, Perona P (2010) Caltech-UCSD Birds 200. Technical report CNS- abs/1502.03167 TR-2010-001, California Institute of Technology 12. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments 31. Yan F, Mikolajczyk K (2015) Deep correlation for matching images for generating image descriptions. In: Proceedings of the IEEE and text. In: Proceedings of the IEEE conference on computer conference on computer vision and pattern recognition, pp 3128– vision and pattern recognition, pp 3441–3450 13. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897 14. Kingma D, Ba J (2014) Adam: a method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 15. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual- semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 16. Lin M, Chen Q, Yan S (2013) Network in network. CoRR, abs/1312.4400 17. Magnus JR (1985) On differentiating eigenvalues and eigenvectors. Econom Theory 1(2):179–191

Journal

International Journal of Multimedia Information RetrievalSpringer Journals

Published: Mar 7, 2018

There are no references for this article.