Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Independence test and canonical correlation analysis based on the alignment between kernel matrices for multivariate functional data

Independence test and canonical correlation analysis based on the alignment between kernel... In the case of vector data, Gretton et al. (Algorithmic learning theory. Springer, Berlin, pp 63– 77, 2005) defined Hilbert–Schmidt independence criterion, and next Cortes et al. (J Mach Learn Res 13:795–828, 2012) introduced concept of the centered kernel target alignment (KTA). In this paper we generalize these measures of dependence to the case of multivariate functional data. In addition, based on these measures between two kernel matrices (we use the Gaussian kernel), we constructed independence test and nonlinear canonical variables for multivariate functional data. We show that it is enough to work only on the coefficients of a series expansion of the underlying processes. In order to provide a comprehensive comparison, we conducted a set of experiments, testing effectiveness on two real examples and artificial data. Our experiments show that using functional variants of the proposed measures, we obtain much better results in recognizing nonlinear dependence. Keywords Multivariate functional data · Functional data analysis · Correlation analysis · Canonical correlation analysis 1 Introduction The theory and practice of statistical methods in situations where the available data are functions (instead of real numbers or vectors) is often referred to as Functional Data Analysis (FDA). The term Functional Data Analysis was already used by Ramsay and Dalzell (1991) B Tomasz Górecki tomasz.gorecki@amu.edu.pl Mirosław Krzysk ´ o mkrzysko@amu.edu.pl Waldemar Wołynski ´ wolynski@amu.edu.pl Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poznan, ´ Poland Faculty of Management, President Stanisław Wojciechowski Higher Vocational State School, Nowy Swiat 4, 62-800 Kalisz, Poland 123 476 T. Górecki et al. two decades ago. This subject has become increasingly popular from the end of the 1990s and is now a major research field in statistics (Cuevas 2014). Good access to the large literature in this field comes from the books by Ramsay and Silverman (2002, 2005), Ferraty and Vieu (2006), and Horváth and Kokoszka (2012). Special issues devoted to FDA topics have been published by different journals, including Statistica Sinica 14(3) (2004), Computational Statistics 22(3) (2007), Computational Statistics and Data Analysis 51(10) (2007), Journal of Multivariate Analysis 101(2) (2010), Advances in Data Analysis and Classification 8(3) (2014). The range of real world applications, where the objects can be thought of as functions, is as diverse as speech recognition, spectrometry, meteorology, medicine or clients segmentation, to cite just a few (Ferraty and Vieu 2003; James et al. 2009; Martin-Baragan et al. 2014; Devijver 2017). The uncentered kernel alignment originally was introduced by Cristianini et al. (2001). Gretton et al. (2005) defined Hilbert–Schmidt Independence Criterion (HSIC) and the empir- ical HSIC. Centered kernel target alignment (KTA) was introduced by Cortes et al. (2012). This measure is a normalized version of HSIC. Zhang et al. (2011) gave an interesting kernel- based independence test. This independence testing method is closely related to the one based on the Hilbert–Schmidt independence criterion (HSIC) proposed by Gretton et al. (2008). Gretton et al. (2005) described a permutation-based kernel independence test. There is a lot of work in the literature for kernel alignment and its applications (good overview can be found in Wang et al. 2015). This work is devoted to a generalization of these measures of dependence to the case of multivariate functional data. In addition, based on these measures we constructed indepen- dence test and nonlinear canonical correlation variables for multivariate functional data. These results are based on the assumption that the applied kernel function is Gaussian. Functional HSIC and KTA canonical correlation analysis can be viewed as natural nonlinear extension of functional canonical correlation analysis (FCCA). So, we propose two nonlinear functional CCA extensions that capture nonlinear relationship. Moreover, both algorithms are capable of extracting also linear dependency. Additionally, we show that functional KTA approach is only a normalized variant of HSIC coefficient also for functional data. Finally, we propose some interpretation of module weighting functions for functional canonical correlations. Section 2 provides an overview of centered measures alignment for random vectors. They are defined by such concepts as: kernel function alignment, kernel matrix alignment, and Hilbert–Schmidt Independence Criterion (HSIC) and associations between them have been shown. Functional data can be seen as values of random processes. In our paper, the multivari- X Y ate random function X X and Y Y have special representation (8) in finite dimensional subspaces of the spaces of square integrable functions on the given intervals. In Sect. 3 we present kernel-based independence test. Section 4 discusses the concept of alignment for multivari- ate functional data. The kernel function, the alignment between two kernels functions, the centered kernel alignment (KTA) between two kernel matrices and the empirical Hilbert– Schmidt Independence Criterion (HSIC) are defined. The HSIC was used as the basis for an independence test. In Sect. 5 we present kernel-based independence test for multivari- ate functional data. In Sect. 5, based on the concept of alignment between kernel matrices, nonlinear canonical variables were constructed. It is a generalization of the results of Chang et al. (2013) for random vectors. In Sect. 5 we present an one artificial and two real examples which confirm the usefulness of proposed coefficients in detection of nonlinear dependency for group of variables. 123 Independence test and CCA for multivariate functional data 477 2 An overview of kernel alignment and its applications We introduce the following notational convention. Throughout this section, X X X and Y Y Y are p q random vectors, with domains R and R , respectively. Let P be a joint probability X X X ,Y Y Y p q p q measure on (R × R ,  × )(here  and  are the Borel σ -algebras on R and R , respectively), with associated marginal probability measures P and P . X X X Y Y Y Definition 1 (Kernel functions, Shawe-Taylor and Cristianini 2004) A kernel is a function k that for all xxx , xxx ∈ R satisfies k(xxx , xxx ) =ϕ ϕ ϕ(xxx ), ϕ ϕ ϕ(xxx ) , where ϕ is a mapping from R to an inner product feature space H ϕ : xxx → ϕ(xxx ) ∈ H. We call ϕ ϕ ϕ a feature map. A kernel function can be interpreted as a kind of similarity measure between the vectors xx and xxx . Definition 2 (Gram matrix, Mercer 1909;Riesz 1909; Aronszajn 1950) Given a kernel k and inputs xxx ,..., xxx ∈ R ,the n × n matrix K K K with entries K = k(xxx , xxx ) is called the Gram 1 n ij i j x x matrix (kernel matrix) of k with respect to xx ,..., xx . 1 n Definition 3 (Positive semi-definite matrix, Hofmann et al. 2008) A real n × n symmetric matrix K K K with entries K satisfying ij n n c c K ≥ 0 i j ij i =1 j =1 for all c ∈ R is called positive semi-definite. Definition 4 (Positive semi-definite kernel, Mercer 1909; Hofmann et al. 2008) A function p p p k : R × R → R which for all n ∈ N, xxx ∈ R , i = 1,..., n gives rise to a positive semi-definite Gram matrix is called a positive semi-definite kernel. This raises an interesting question: given a function of two variables k(xxx , xxx ), does there ϕ x x x ϕ x ϕ x exist a function ϕ ϕ(xx ) such that k(xx , xx ) =ϕ ϕ(xx ), ϕ ϕ(xx ) ? The answer is provided by Mercer’s theorem (1909) which says, roughly, that if k is positive semi-definite then such a ϕ exists. p p Often, we will not known φ φ φ, but a kernel function k : R × R → R that encodes the inner product in H, instead. Popular positive semi-definite kernel functions on R include the polynomial kernel of d   2 degree d > 0, k(xxx , xxx ) = (1 + xxx xxx ) , the Gaussian kernel k(xxx , xxx ) = exp(−λ xxx − xxx ), λ> 0, and the Laplace kernel k(xxx , xxx ) = exp(−λ xxx − xxx ), λ> 0. In this paper we use, the Gaussian kernel. We start with the definition of centering and the analysis of its relevant properties. 123 478 T. Górecki et al. 2.1 Centered kernel functions A feature mapping φ φ : R → H is centered by subtracting from it its expectation, that is transforming φ φ φ(xxx ) to φ φ φ(xxx ) = φ φ φ(xxx ) − E [φ φ φ(X X X )],where E denotes the expected value X X X X X X of φ φ φ(X X X ) when X X X is distributed according to P . Centering a positive semi-definite kernel X X X p p function k : R × R → R consists centering in the feature mapping φ φ φ associated to k. Thus, the the centered kernel k associated to k is defined by k(xxx , xxx ) =φ φ φ(xxx ) − E [φ φ φ(X X X )],φ φ φ(xxx ) − E [φ φ φ(X X X )] X X X X X X = k(xxx , xxx ) − E [k(X X X , xxx )]− E  [k(xxx , X X X )]+ E  [k(X X X , X X X )], X X X X X X X X X ,X X X assuming the expectations exist. Here, the expectation is taken over independent copies X X, X X X distributed according to P .Wesee that, k is also a positive semi-definite kernel. Note X X X ˜ ˜ X X also that for a centered kernel k,E [k(X X , X X )]= 0, that is, centering the feature mapping X X X ,X X X implies centering the kernel function. 2.2 Centered kernel matrices Let {xxx ,..., xxx } be a finite subset of R . A feature mapping φ φ φ(xxx ), i = 1,..., n, is centered 1 n i by subtracting from it its empirical expectation, i.e., leading to φ φ φ(xxx ) = φ φ φ(xxx ) − φ φ φ,where i i φ φ φ = φ φ φ(xxx ). The kernel matrix K K K = (K ) associated to the kernel function k and the i ij n i =1 x x K set {xx ,..., xx } is centered by replacing it with K K = (K ) defined for all i , j = 1, 2,..., n 1 n ij by n n n 1 1 1 K = K − K − K + K , (1) ij ij ij ij ij n n n i =1 j =1 i , j =1 where K = k(xxx , xxx ), i , j = 1,..., n. ij i j The centered kernel matrix K K is a positive semi-definite matrix. Also, as with the kernel 1 n function K = 0. 2 ij i , j Let ·, · denote the Frobenius product and · the Frobenius norm defined for all F F n×n A A A, B B B ∈ R by A A A, B B B = tr(A A A B B B), 1/2 A A A A A = (A A, A A ) . F F n×n Then, for any kernel matrix K K K ∈ R , the centered kernel matrix K K K can be expressed as follows (Schölkopf et al. 1998): K K K = H H HK K KH H H , (2) 1  n×1 where H H H = III − 1 1 1 1 1 1 ,1 1 1 ∈ R denote the vector with all entries equal to one, and III n n n n the identity matrix of order n. The matrix H H H is called “centering matrix”. Since H H H is the idempotent matrix (H H H = H H H), then we get for any two kernel matrices K K K p q and L L L based on the subset {xxx ,..., xxx } of R and the subset {yyy ,..., yyy } of R , respectively, 1 n 1 n K K K , L L L =K K K , L L L =K K K , L L L . (3) F F F 123 Independence test and CCA for multivariate functional data 479 2.3 Centered kernel alignment Definition 5 (Kernel function alignment, Cristianini et al. 2001;Cortesetal. 2012)Let k p p q q and l be two kernel functions defined over R × R and R × R , respectively, such that 2  2 ˜ ˜ 0 < E  [k (X X X , X X X )] < ∞ and 0 < E  [l (Y Y Y , Y Y Y )] < ∞,where X X X , X X X and Y Y Y , Y Y Y are X X X ,X X X Y Y Y ,Y Y Y independent copies distributed according to P and P , respectively. Then the alignment X X X Y Y Y between k and l is defined by ˜ ˜ X X Y Y E [k(X X , X X )l(Y Y , Y Y )] X X X ,X X X ,Y Y Y ,Y Y Y ρ(k, l) =  . 2  2 ˜ ˜ E  [k (X X X , X X X )] E  [l (Y Y Y , Y Y Y )] X X X X ,X X Y Y Y ,Y Y Y We can define similarly the alignment between two kernel matrices K K K and L L L based on the finite subset {xxx ,..., xxx } and {yyy ,..., yyy }, respectively. 1 n 1 n n×n n×n K L Definition 6 (Kernel matrix alignment, Cortes et al. 2012)Let K K ∈ R and L L ∈ R be two kernel matrices such that K K K = 0and L L L = 0. Then, the centered kernel target F F K L alignment (KTA) between K K and L L is defined by K K K , L L L ρ( ˆ K K K , L L L) = . (4) K L K K L L F F K L K L Here, by the Cauchy–Schwarz inequality, ρ( ˆ K K , L L) ∈[−1, 1] and in fact ρ( ˆ K K , L L) ∈[0, 1] ˜ ˜ when K K K and L L L are the kernel matrices of the positive semi-definite kernel k and l. Gretton et al. (2005) defined Hilbert–Schmidt Independence Criterion (HSIC) as a test statistic to distinguish between null hypothesis H : P = P P (equivalently we may 0 X X X ,Y Y Y X X X Y Y Y write X X X ⊥ ⊥Y Y Y ) and alternative hypothesis H : P = P P . 1 X X X ,Y Y Y X X X Y Y Y Definition 7 (Reproducing kernel Hilbert space,Riesz 1909; Mercer 1909; Aronszajn 1950) Consider a Hilbert space H of functions from R to R.Then H is a reproducing kernel Hilbert space (RKHS) if for each xx ∈ R , the Dirac evaluation operator δ : H → R,which xxx maps f ∈ H to f (xxx ) ∈ R, is a bounded linear functional. p  p x x φ x φ x x x Let ϕ : R → H be a map such that for all xx , xx ∈ R we have φ φ(xx ), φ φ(xx ) = k(xx , xx ), p p where k : R ×R → R is a unique positive semi-definite kernel. We will require in particular that H be separable (it must have a complete, countable orthonormal system). We likewise define a second separable RKHS G, with kernel l(·, ·) and feature map ψ ψ ψ, on the separable space R . We may now define the mean elements μ and μ with respect to the measures P and X X X Y Y Y X X X P as those members of H and G, respectively, for which Y Y Y μ , f  = E [φ φ φ(X X X ), f  ]= E [ f (X X X )], X H X H X X X X X X X μ , g = E [ψ ψ ψ(Y Y Y ), g ]= E [g(Y Y Y )], Y Y Y G Y Y Y G Y Y Y for all functions f ∈ H, g ∈ G,where φ φ φ is the feature map from R to the RKHS H,and ψ ψ ψ maps from R to G and assuming the expectations exist. Finally, μ can be computed by applying the expectation twice via X X μ = E  [φ φ φ(X X X ), φ φ φ(X X X ) ]= E  [k(X X X , X X X )], X X X H X X X X X X ,X X X X ,X X assuming the expectations exist. The expectation is taken over independent copies X X X , X X X distributed according to P . The means μ , μ exist when positive semi-definite kernels k X X X X X X Y Y Y and l are bounded. We are now in a position to define the cross-covariance operator. 123 480 T. Górecki et al. Definition 8 (Cross-covariance operator, Gretton et al. 2005) The cross-covariance operator p q C C C : G → H associated with the joint probability measure P on (R × R ,  × )is X X X ,Y Y Y X X X ,Y Y Y a linear operator C C : G → H defined as X X X ,Y Y Y C X Y C C = E [φ(X X ) ⊗ ψ(Y Y )]− μ ⊗ μ , X X X ,Y Y Y X X X ,Y Y Y X X X Y Y Y for all f ∈ H and g ∈ G, where the tensor product operator f ⊗ g : G → H, f ∈ H, g ∈ G, is defined as ( f ⊗ g)h = f g, h , for all h ∈ G. This is a generalization of the cross-covariance matrix between random vectors. Moreover, by the definition of the Hilbert–Schmidt (HS) norm, we can compute the HS norm of f ⊗ g via 2 2 2 f ⊗ g = f g . HS H G Definition 9 (Hilbert–Schmidt Independence Criterion, Gretton et al. 2005) Hilbert–Schmidt Independence Criterion (HSIC) is the squared Hilbert–Schmidt norm (or Frobenius norm) p q of the cross-covariance operator associated with the probability measure P on (R × R , X X X ,Y Y Y × ): HSIC(P ) = C C C . X X X ,Y Y Y X X X ,Y Y Y To compute it we need to express HSIC in terms of kernel functions (Gretton et al. 2005): HSIC(P ) = E   [k(X X X , X X X )l(Y Y Y , Y Y Y )] X X X ,Y Y Y X X X ,X X X ,Y Y Y ,Y Y Y + E  [k(X X X , X X X )] E  [l(Y Y Y , Y Y Y )] X X X ,X X X Y Y Y ,Y Y Y − 2E [E  [k(X X X , X X X )] E  [l(Y Y Y , Y Y Y )]]. (5) X X X ,Y Y Y X Y X X Y Y Here E   denotes the expectation over independent pairs (X X X , Y Y Y ) and (X X X , Y Y Y ) dis- X X X ,X X X ,Y Y Y ,Y Y Y tributed according to P . X X X ,Y Y Y It follows from (5) that the Frobenius norm of C C C exists when the various expectations X X X ,Y Y Y over the kernels are bounded, which is true as long as the kernels k and l are bounded. Definition 10 (Empirical HSIC, Gretton et al. 2005)Let S ={(xxx , yyy ), ...,(xxx , yyy )}⊆ 1 1 n n p q R × R be a series of n independent observations drawn from P . An estimator of HSIC, X X X ,Y Y Y written HSIC(S), is given by HSIC(S) = K K K , L L L , (6) n×n where K K K = (k(xxx , xxx )), L L L = (l(yyy , yyy )) ∈ R . i j i j Comparing (4)and (6) and using (3), we see that the centered kernel target alignment (KTA) is simply a normalized version of HSIC(S). In two seminar papers on Székely et al. (2007) and Székely and Rizzo (2009) introduced the distance covariance (dCov) and distance correlation (dCor) as powerful measures of dependence. p q s t s t For column vectors ss ∈ R and tt ∈ R ,denoteby ss and tt the standard Euclidean p q norms on the corresponding spaces. For jointly distributed random vectors X X X ∈ R and Y Y ∈ R ,let s t s X t Y f (ss, tt ) = E {exp[i ss, X X  + i tt , Y Y  ]}, X X X ,Y Y Y X X X ,Y Y Y p q 123 Independence test and CCA for multivariate functional data 481 X Y s s 0 t be the joint characteristic function of (X X , Y Y ),and let f (ss) = f (ss, 00) and f (tt ) = X X X X X X ,Y Y Y Y Y Y p q ϕ (0 0 0, ttt ) be the marginal characteristic functions of X X X and Y Y Y,where sss ∈ R and ttt ∈ R . X X X ,Y Y Y X Y X Y The distance covariance between X X and Y Y is the nonnegative number ν(X X , Y Y ) defined by s t s t 1 | f (ss, tt ) − f (ss) f (tt )| X X X ,Y Y Y X X X Y Y Y ν (X X X , Y Y Y ) = dsssdttt , p+1 q+1 C C p+q p q R sss ttt p q and |z| denotes the modulus of z ∈ C and (p+1) C = . ( (p + 1)) X Y The distance correlation between X X and Y Y is the nonnegative number defined by ν(X X X , Y Y Y ) X Y R(X X , Y Y ) = √ ν(X X X , X X X )ν(Y Y Y , Y Y Y ) if both ν(X X X , X X X ) and ν(Y Y Y , Y Y Y ) are strictly positive, and defined to be zero otherwise. For distributions with finite first moments, the distance correlation characterizes independence in that 0 ≤ R(X X X , Y Y Y ) ≤ 1 with R(X X X , Y Y Y ) = 0 if and only if X X X and Y Y Y are independent. Sejdinovic et al. (2013) demonstrated that distance covariance is an instance of the Hilbert– Schmidt Independence Criterion. Górecki et al. (2016, 2017) showed an extension of the distance covariance and distance correlation coefficients to the functional case. 2.4 Kernel-based independence test Statistical tests of independence have been associated with a broad variety of dependence measures. Classical tests such as Spearman’s ρ and Kendall’s τ are widely applied, however they are not guaranteed to detect all modes of dependence between the random variables. Contingency table-based methods, and in particular the power-divergence family of test statistics (Read and Cressie 1988) are the best known general purpose tests of independence, but are limited to relatively low dimensions, since they require a partitioning of the space in which random variable resides. Characteristic function-based tests (Feuerverger 1993; Kankainen 1995) have also been proposed. They are more general than kernel-based tests, although to our knowledge they have been used only to compare univariate random vari- ables. Now, we describe how HSIC can be used as an independence measure, and as the basis for an independence test. We begin by demonstrating that the Hilbert–Schmidt norm can be used as a measure of independence, as long as the associated RKHSs are universal. A continuous kernel k on a compact metric space is called universal if the corresponding RKHS H is dense in the class of continuous functions of the space. Denote by H, G RKHSs with universal kernels k, l on the compact domains X and Y respectively. We assume without loss of generality that f ≤ 1and g ≤ 1for all ∞ ∞ f ∈ H and g ∈ G. Then Gretton et al. (2005) proved that C C C = 0 if and only if X X X ,Y Y Y HS X X X and Y Y Y are independent. Examples of universal kernels are Gaussian kernel and Laplacian kernel, while the linear kernel k(xxx , xxx ) = xxx xxx is not universal—the corresponding HSIC tests only linear relationships, and a zero cross-covariance matrix characterizes independence only for multivariate Gaussian distributions. Working with the infinite dimensional operator with universal kernels, allows us to identify any general nonlinear dependence (in the limit) between any pair of vectors, not just Gaussians. 123 482 T. Górecki et al. We recall that in this paper we use the Gaussian kernel. We now consider the asymptotic distribution of statistics (6). X Y X Y We introduce the null hypothesis H : X X ⊥ ⊥Y Y (X X is independent of Y Y , i.e., P = P P ). 0 X X X ,Y Y Y X X X Y Y Y Suppose that we are given the i.i.d. samples S ={xxx ,..., xxx } and S ={yyy ,..., yyy } x y xx 1 n yy 1 n X Y K L for X X and Y Y , respectively. Let K K and L L be the centered kernel matrices associated to the kernel function k and the sets S and S , respectively. Let λ ≥ λ ≥ ··· ≥ λ ≥ 0be xxx yyy 1 2 n the eigenvalues of the matrix K K K and let v v v ,...,v v v be a set of orthonormal eigenvectors 1 n corresponding to these eigenvalues. Let λ ≥ λ ≥ ··· ≥ λ ≥ 0 be the eigenvalues of 1 2 the matrix L L L and let v v v ,...,v v v be a set of orthonormal eigenvectors corresponding to these 1 n eigenvalues. Let  = diag(λ ,...,λ ),  = diag(λ ,...,λ ), V V V = (v v v ,...,v v v ) and 1 n 1 n V V V = (v v v ,...,v v v ). Suppose further that we have the eigenvalue decomposition (EVD) of 1 n the centered kernel matrices K K K and L L L, i.e., K K K = V V V   V V V and L L L = V V V    (V V V ) . 1/2      1/2 Let    = (   ,...,   ) = V V V    and    = (   ,...,   ) = V V V (   ) , i.e.,    = 1 n i 1 n v  v λ v v ,   = λ v v , i = 1,..., n. i i i i i The following result is true (Zhang et al. 2011): under the null hypothesis that X X X and Y Y Y are independent, the statistic (6) has the same asymptotic distribution as Z = λ λ Z , (7) n i ,n j ,n ij i , j =1 2 2 where Z are i.i.d. χ -distributed variables, n →∞. ij 1 Note that the data-based test statistic HSIC (or its probabilistic counterpart) is sensible to dependence/independence and therefore can be used as a test statistic. Also important is the knowledge of its asymptotic distribution. These facts inspire the following depen- dence/independence testing procedure. Given the sample S and S , one first calculates x y xx yy K L the centered kernel matrices K K and L L and their eigenvalues λ and λ , and then evalu- ates the statistic HSIC(S) according to (6). Next, the empirical null distribution of Z under the null hypothesis can be simulated in the following way: one draws i.i.d. random sam- 2 2 ples from the χ -distributed variables Z , and then generates samples for Z according to 1 ij (7). Finally the p value can be found by locating HSIC(S) in the simulated null distribu- tion. A permutation-based test is described in Gretton et al. (2005). In the first step they propose to calculate the test statistic T (HSIC or KTA) for the given data. Next, keeping the order of the first sample we randomly permute the second sample a large number of times, and recompute the selected statistic each time. This destroy any dependence between samples simulating a draw from the product of marginals, making the empirical distribution of the permuted statistics behave like the null distribution of the test statistic. For a specified significance level α, we calculate threshold t in the right tail of the null distribution. We reject H if α 0 T > t . This test was proved to be consistent against any fixed alternative. It means that for any fixed significance level α, the power goes to 1 as the sample size tends to infinity. 2.5 Functional data In recent years methods for representing data by functions or curves have received much attention. Such data are known in the literature as the functional data (Ramsay and Silverman 2005; Horváth and Kokoszka 2012; Hsing and Eubank 2015). Examples of functional data can be found in various application domains, such as medicine, economics, meteorology and many others. Functional data can be seen as the values of random process X (t ). In practice, the 123 Independence test and CCA for multivariate functional data 483 values of the observed random process X (t ) are always recorded at discrete times t ,..., t , 1 J less frequently or more densely spaced in the range of variability of the argument t.Sowe have a time series {x (t ), ..., y(t )}. However, there are many reasons to model these series 1 J as elements of functional space., because the functional data has many advantages over other ways of representing the time series. 1. They easily cope with the problem of missing observations, an inevitable problem in many areas of research. Unfortunately, most data analysis methods require complete time series. One solution is to delete a time series that has missing values from the data, but this can lead to , and generally leads to, loss of information. Another option is to use one of many statistical methods to predict the missing values, but then the results will depend on the interpolation method. In contrast to this type of solutions, in the case of functional data, the problem of missing observations is solved by expressing time series in the form of a set of continuous functions. 2. The functional data naturally preserve the structure of observations, i.e. they maintain the time dependence of the observations and take into account the information about each measurement. 3. The moments of observations do not have to be evenly spaced in individual time series. 4. Functional data avoids the curse of dimensionality. When the number of time points is greater than the number of time series considered, most statistical methods will not give satisfactory results due to overparametrization. In the case of functional data, this problem can be avoided because the time series are replaced with a set of continuous functions independent of the number of time points in which observations are measured. In most of the papers on functional data analysis, objects are characterized by only one feature observed at many time points. In several applications there is a need to use statistical methods for objects characterized by many features observed at many time points (double multivariate data). In this case, such data are transformed into multivariate functional data. Let us assume that X X X = (X ,..., X ) ={X X X (s), s ∈ I }∈ L (I ) and Y Y Y = 1 p 1 1 (Y ,..., Y ) ={Y Y Y (t ), t ∈ I }∈ L (I ) are random processes, where L (I ) is a space 1 q 2 2 2 of square integrable functions on the interval I.Wealsoassume that E(X X X (s)) = 0 0 0, s ∈ I , E(Y Y Y (t )) = 0 0 0, t ∈ I . 1 2 We will further assume that each component X of the random process X X X and Y of the g h random process Y Y can be represented by a finite number of orthonormal basis functions {ϕ } and {ϕ } of space L (I ) and L (I ), respectively: f 2 1 2 2 X (s) = α ϕ (s), s ∈ I , g = 1, 2,..., p, g ge e 1 e=0 Y (t ) = β ϕ (t ), t ∈ I , h = 1, 2,..., q, h hf f 2 f =0 where α and β are the random coefficients. The degree of smoothness of processes X ge hf g and Y depends on the values E and F respectively (small values imply more smoothing). h g h The optimum values for E and F are selected using Bayesian Information Criterion (BIC) g h (see Górecki et al. 2018). As basis functions we can use e.g. the Fourier basis system or spline functions. We introduce the following notation: α α = (α , ..., α , ..., α , ..., α ) , 10 1E p0 pE 1 p 123 484 T. Górecki et al. β β = (β , ..., β , ..., β , ..., β ) , 10 1F q0 qF 1 q ϕ ϕ ϕ (s) = (ϕ (s), ...,ϕ (s)) , s ∈ I , g = 1, 2, ..., p, E 0 E 1 g g ϕ ϕ ϕ (t ) = (ϕ (t), ...,ϕ (t )) , F 0 F h h ⎡ ⎤ ϕ ϕ ϕ (s) 0 0 0 ... 0 0 0 ⎢ ⎥ 0 ϕ 0 00 ϕ ϕ (s) ... 00 ⎢ 2 ⎥ t ∈ I , h = 1, 2, ..., q,   (s) = , 2 1 ⎣ ⎦ ... ... ... ... 0 000 0 0 ... ϕ ϕ ϕ (s) ⎡ ⎤ ϕ ϕ ϕ (t ) 0 0 0 ... 0 0 0 ⎢ ⎥ 0 0 0 ϕ ϕ ϕ (t) ... 0 0 0 ⎢ 2 ⎥ (t ) = , ⎣ ⎦ ... ... ... ... 0 000 0 0 ... ϕ ϕ ϕ (t ) K +p K +q p+(K +p) q+(K +q) 1 2 1 2 α β where α α ∈ R , β β ∈ R ,   ∈ R ,  ∈ R , K = E + ··· + E , 1 2 1 1 p K = F + ··· + F . 2 1 p X Y Using the above matrix notation the random processes X X and Y Y can be represented as: X  α Y  β X X (s) =   (s)α α, s ∈ I , Y Y (t ) =   (t )β β, t ∈ I , (8) 1 1 2 2 where E(α α α) = 0 00, E(β β β) = 0 00. This means that the values of random processes X X X andY Y Y are in finite dimensional subspaces p q p q of L (I ) and L (I ), respectively. We will denote these subspaces by L (I ) and L (I ). 1 2 1 2 2 2 2 2 Typically data are recorded at discrete moments in time. The process of transformation of discrete data to functional data is performed for each realization and each variable separately. Let x denote an observed value of the feature X , g = 1, 2,... p at the jth time point s , gj g j where j = 1, 2, ..., J . Similarly, let y denote an observed value of feature Y , h = 1, 2,... q hj h at the jth time point t ,where j = 1, 2, ..., J . Then our data consist of pJ pairs of (s , x ) j j gj X X Y Y and of qJ pairs of (t , y ).Let X X ,..., X X and Y Y ,..., Y Y be independent trajectories of j hj 1 n 1 n random processes X X X and Y Y Y having the representation (8). The coefficients α α α and β β β are estimated by the least squares method. Let us denote these i i estimates by a a a and b b b , i = 1, 2,..., n. i i As a result, we obtain functional data of the form: X X X (s) =    (s)a a a , Y Y Y (t ) =    (t )b b b , (9) i 1 i i 2 i K +p K +q 1 2 a b where s ∈ I , t ∈ I , aa ∈ R , bb ∈ R , K = E + ··· + E , K = F + ··· + F , 1 2 i i 1 1 p 2 1 q and i = 1, 2,..., n. Górecki and Smaga (2017) described a multivariate analysis of variance (MANOVA) for functional data. In the paper by Górecki et al. (2018), three basic methods of dimension reduc- tion for multidimensional functional data are given: principal component analysis, canonical correlation analysis, and discriminant coordinates. 3 Alignment for multivariate functional data 3.1 The alignment between two kernel functions and two kernel matrices for multivariate functional data p p Let xxx (s) ∈ L (I ), s ∈ I ,where L (I ) is a finite-dimensional space of continuous square- 1 1 1 2 2 integrable vector functions over interval I . 123 Independence test and CCA for multivariate functional data 485 Let p p k : L (I ) × L (I ) → R 1 1 2 2 be a kernel function on L (I ). As already mentioned, in this paper we use the Gaussian kernel. For the multivariate functional data this kernel has the form: k (xxx (s), xxx (s)) = exp(−λ xxx (s) − xxx (s) ), λ > 0. 1 1 But from (9), and by the orthonormality of the basis functions, we have: xxx (s) − xxx (s) = (xxx (s) − xxx (s)) (xxx (s) − xxx (s))ds = a a a − a a a . Hence k (xxx (s), xxx (s)) = k(a a a, a a a ) and k (yyy(t ), yyy (t )) = k(b b b, b b b ), a b x where aa and bb are vectors occurring in the representation (9) of vector functions xx (s), s ∈ I , yyy(t ), t ∈ I . x x For a given subset {xx (s), ..., xx (s)} of L (I ) and the given kernel function k on 1 n 1 p p L (I ) × L (I ), the matrix K K K of size n × n, which has its (i , j )th element K (s),given 1 1 2 2 ij by K (s) = k (xxx (s), xxx (s)), s ∈ I , is called the kernel matrix of the kernel function k i j 1 ij with respect to the set {xxx (s), ..., xxx (s)}, s ∈ I . 1 n 1 ˜ ˜ Definition 11 (Kernel function alignment for functional data)Let k and l be two kernel p p q q functions defined over L (I ) × L (I ) and L (I ) × L (I ), respectively, such that 0 < 1 1 2 2 2 2 2 2 2 2 ˜ ˜ E  [k (X X X , X X X )] < ∞ and 0 < E  [l (Y Y Y , Y Y Y )] < ∞,where X X X , X X X and Y Y Y , Y Y Y are X X X ,X X X Y Y Y ,Y Y Y independent copies distributed according to P and P , respectively. Then the alignment X X X Y Y Y ˜ ˜ between k and l is defined by ˜ ˜ E [k (X X X , X X X )l (Y Y Y , Y Y Y )] X X X ,Y Y Y ˜ ˜ ρ(k , l ) =  . (10) 2 2 ˜ ˜ X X Y Y E [k (X X , X X )] E [l (Y Y , Y Y )] X X X Y Y Y K L We can define similarly the alignment between two kernel matrices K K and L L based on p q the subset {xxx (s), ..., xxx (s)}, s ∈ I ,and {yyy (t), ..., yyy (t )}, t ∈ I ,of L (I ) and L (I ), 1 n 1 1 n 2 1 2 2 2 respectively. n×n n×n K L Definition 12 (Kernel matrix alignment for functional data)Let K K ∈ R and L L ∈ R be two kernel matrices such that K K K = 0and L L L = 0. Then, the centered kernel F F target alignment (KTA) between K K K and L L L is defined: K K K , L L L ρ( ˆ K K K , L L L ) = . (11) K K K L L L F F K L K L If K K and L L are positive semi-definite matrices, then ρ( ˆ K K , L L ) ∈[0, 1].Wehave ρ( ˆ K K K , L L L ) =ˆ ρ(K K K , L L L), where K K K is the matrix of size n × n, which has its (i , j )th element K ,given by K = ij ij a a k(aa , aa ). i j 123 486 T. Górecki et al. 3.2 Kernel-based independence test for multivariate functional data Definition 13 (Empirical HSIC for functional data) The empirical HSIC for functional data is defined as HSIC(S ) = K K K , L L L  , where S ={(xxx (s), yyy (t )), . . . , (xxx (s), yyy (t ))}, s ∈ I , t ∈ I , K K K and L L L are kernel 1 1 n n 1 2 matrices based on the subsets {xxx (s), ..., xxx (s)}, s ∈ I ,and {yyy (t), ..., yyy (t )}, t ∈ I of 1 n 1 1 n 2 p q L (I ) and L (I ), respectively. 1 2 2 2 But K K K = K K K,where K K K is the kernel matrix of size n × n, which has its (i , j )th element K ij given by K = k(a a a , a a a ),where a a a ,..., a a a are vectors occurring in the representation (9) ij i j 1 n vector functions X X X (s), s ∈ I . Analogously, L L L = L L L,where L L L is the kernel matrix of size n × n, which has its (i , j )th element L given by L = l(b b b , b b b ),where b b b ,..., b b b are ij ij i j 1 n vectors occurring in the representation (9) vector functions Y Y Y (t ), t ∈ I . Hence HSIC(S ) = HSIC(S ), where S ={(a a a , b b b ), ...,(a a a , b b b )}. v 1 1 n n Note also that the null hypothesis H : X X X ⊥Y Y Y of independence of the random processes X X X and Y Y Y is equivalent to the null hypothesis H : α α α⊥β β β of independence of random vectors α α α and β β β occurring in the representation (8) random processes X X X and Y Y Y . We can therefore use the tests described in Section 2.4, replacing xxx and yyy by a a a and b b b. 3.3 Canonical correlation analysis based on the alignment between kernel matrices for multivariate functional data In classical canonical correlation analysis (Hotelling 1936), we are interested in the relation- ship between two random vectors X X X and Y Y Y . In the functional case we are interested in the relationship between two random functions X X X and Y Y Y . Functional canonical variables U and V for random processes X X X and Y Y Y are defined as follows U =u u u, X X X= u u u (s)X X X (s)ds, (12) v Y v Y V =v v, Y Y= v v (t )Y Y (t )dt , (13) u v where the vector functions uu and v v are called the vector weight functions and are of the form K +p K +q 1 2 u u u(s) =    (s)u u u,v v v(t ) =    (t )v v v where u u u ∈ R ,v v v ∈ R . (14) 1 2 Classically the weight functions u u u and v v v are chosen to maximize the sample correlation coefficient (Górecki et al. 2018): Cov(U , V ) ρ = √ . (15) Var(U ) Var(V ) The sample correlation coefficient between the variables U and V is now replaced by a centered kernel target alignment (KTA) between kernel matrices K K K and L L L based on the u x v y projected data uu(s), xx (s) and v v(t ), yy (t ) , i.e. their (i , j )th entry are i H i H 123 Independence test and CCA for multivariate functional data 487 u x u x K = k(uu(s), xx (s) , uu(s), xx (s) ), s ∈ I , i , j i H j H 1 and L = l(v v v(t ), yyy (t ) , v v v(t ), yyy (t ) ), t ∈ I , i , j i H j H 2 respectively, i , j = 1,..., n: K L tr(K K L L) ρ( ˆ u u u(s), v v v(t )) =  (16) tr(K K K K K K ) tr(L L L L L L) subject to u u u(s) = v v v(t ) = 1. (17) But K K K = k u u u (s)xxx (s)ds, u u u (s)xxx (s)ds i , j i j I I 1 1 (u u u) = k(u u u a a a , u u u a a a ) = K K K i j i , j and L L L = l v v v (t )yyy (t )dt , v v v (t )yyy (t )dt i , j i j I I 2 2 (v v v) v b v b L = l(v v bb ,v v bb ) = L L , i j i , j where a a a and b b b are vectors occuring in the representation (9) vectors functions xxx (s), s ∈ I , i i 1 K +p K +q 1 2 yyy(t ), t ∈ I , i = 1,..., n, u u u ∈ R , v v v ∈ R . Thus, the choice of weighting functions u u u(s) and v v v(t ) so that the coefficient (16)has K +p a maximum value subject to (17) is equivalent to the choice of vectors u u u ∈ R and K +q v v ∈ R such that the coefficient tr(K K K L L L ) u v v uu ρ( ˆ u u u,v v v) =  (18) tr(K K K K K K ) tr(L L L L L L ) u v uu v v u u u v v v has a maximum value subject to u u u = v v v = 1, (19) u v (uu) (v v) where K K K = (K ), L L L = (L ), i , j = 1,..., n. u u u v v v i , j i , j In order to maximize the coefficient of (18) we can use the result of Chang et al. (2013). Authors used a gradient descent algorithm, with modified gradient to ensure the unit length constraint is satisfied at each step (Edelman et al. 1998). Optimal step-sizes were found numerically using the Nelder-Mead method. This article employs the Gaussian kernel exclu- sively while other kernels are available. The bandwidth parameter λ of the Gaussian kernel was chosen using the “median trick” (Song et al. 2010), i.e. the median Euclidean distance between all pairs of points. The coefficients of the projection of the ith value xxx (t ) of random process X X X on the kth functional canonical variable are equal to U =u u u , xxx = u u u (s)xxx (s)ds = a a a u u u , ik k i i k k i 123 488 T. Górecki et al. y Y analogously the coefficients of the projection of the ith value yy (t ) of random process Y Y on i t the kth functional canonical variable are equal to V = b b b v v v , ik k n× p n×q where i = 1,..., n, k = 1,..., min(rank(A A A), rank(B B B)),where A A A ∈ R and B B B ∈ R , where the ith rows are a a a and b b b , respectively, which have column means of zero. i i As we mentioned earlier KTA is a normalized variant of HSIC. Hence, we can repeat the above reasoning for HSIC criterion. However, we should remember that both approaches are not equivalent and we can obtain different results. 4 Experiments Let us recall some and introduce another symbols: – KTA—centered kernel target alignment, – HSIC—Hilbert–Schmidt Independence Criterion, – FCCA—classical functional canonical correlation analysis (Ramsay and Silverman 2005; Horváth and Kokoszka 2012), – HSIC.FCCA—functional canonical correlation analysis based on HSIC, – HSIC.KTA—functional canonical correlation analysis based on KTA. 4.1 Simulation We generated random processes along with some noises to test the performance of the intro- duced measures. Random processes are specified by X = ε , t t Y = 3X + η , t t t Z = X + ξ , t t where ε ,η and ξ are jointly independent random variables from Gaussian distribution t t t with 0 mean and 0.25 variance. We generated processes of length 100. N = 10000 samples are generated for all processes. The objective is to examine how well functional variants (Fourier basis with 15 basis functions) of KTA and HSIC measures perform compared to measures used on raw data (artificially generated at discrete time stamps). Here, raw data are represented as vector data of generated trajectories, so raw data are three 10000 by 100 dimensional matrices(one for X , second for Y and third for Z ). On the other hand, functional t t t data are three 10000 by 15 dimensional matrices (coefficients of Fourier basis). Here, X and Y are linearly dependent, whereas X , Z and Y , Z are nonlinearly dependent (Fig. 1). t t t t t From Fig. 2 and Table 1, we see that the proposed extension of HSIC and KTA coefficients to functional data gives larger values of coefficients than the variants for raw time series. Unfortunately, it is not possible to perform inference based only on the values of coefficients. We have to apply tests. In Fig. 3 andinTable 2, we observe that when we use functional variants of the proposed measures, we obtain much better results in recognizing nonlinear dependence. Linear dependence between X and Y was easily recognized by each method t t (100% of correct decisions—p values below 5%). Results of functional KTA and HSIC are very similar. Non-functional measures HSIC and KTA give only 7.2% and 6.7% correct decisions (p values below 5%) for relationship X , Z and Y , Z , respectively. On the t t t t 123 Independence test and CCA for multivariate functional data 489 0 20406080 100 0 20406080 100 Time Time Fig. 1 Sample trajectories of X , Y and Z time series for raw (left plot) and functional (right plot) represen- t t t tation HSIC KTA 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Fig. 2 Raw and functional HSIC and KTA coefficients for artificial time series other hand, functional variants recognize dependency (p values below 5%) in 63.3% (both measures) for X , Y and in 47.8% (HSIC), 63.3% (KTA) for Y , Z . t t t t 4.2 Univariate example As a first real example we used average daily temperature (in Celsius degrees) for each day of the year and average daily rainfall (in mm) for each day of the year rounded to 0.1 mm at -1 0 1 2 Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t -0.2 0.0 0.2 0.4 0.6 Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t 490 T. Górecki et al. Table 1 Average raw and (X , Y)(X , Z)(Y , Z ) t t t t t t functional HSIC and KTA coefficients for artificial time Raw series (number in brackets means HSIC 0.795 (0.015) 0.672 (0.027) 0.825 (0.014) standard deviation) KTA 0.758 (0.019) 0.601 (0.028) 0.789 (0.019) Functional HSIC 0.986 (0.000) 0.984 (0.001) 0.988 (0.000) KTA 0.999 (0.000) 0.999 (0.000) 0.999 (0.000) HSIC KTA 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 Fig. 3 p Values from permutation-based tests for raw and functional variants of HSIC and KTA coefficients Table 2 Average p values from (X , Y)(X , Z)(Y , Z ) t t t t t t permutation-based tests for raw and functional variants of HSIC Raw and KTA coefficients (number in HSIC 0.000 (0.000) 0.445 (0.290) 0.458 (0.282) brackets means standard deviation) KTA 0.000 (0.000) 0.445 (0.290) 0.458 (0.282) Functional HSIC 0.000 (0.000) 0.077 (0.129) 0.125 (0.169) KTA 0.000 (0.000) 0.077 (0.129) 0.077 (0.129) 35 different weather stations in Canada from 1960 to 1994. Each station belongs to one of four climate zone: Arctic (3 stations—blue color on plots), Atlantic (15—red color on plots), Continental (12— black color on plots) or Pacific (5—green color on plots) zone (Fig. 4). This data set comes from Ramsay and Silverman (2005). In the first step, we smoothed data. We used the Fourier basis with various values of the smoothing parameter (number of basis functions) from 3 to 15. We can observe the effect Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t Independence test and CCA for multivariate functional data 491 Climate zone Arctic Atlantic Continental Pacific -125 -100 -75 -50 Longtitude Fig. 4 Location of Canadian weather stations Raw data Functional data 0 100 200 300 0 100 200 300 Day of the year Day of the year Fig. 5 Raw and functional temperature for Canadian weather stations of smoothing in Figs. 5 and 6 (for Fourier basis with 15 basis functions). We decided to use the Fourier basis for two reasons: it has excellent computational properties, especially if the observations are equally spaced, and it is natural for describing periodic data, such as the annual weather cycles. Here, raw data are two 35 by 100 dimensional matrices (one for temperature and second for precipitation). On the other hand, functional data are two 35 by 15 dimensional matrices (coefficients of Fourier basis). From the plots we can observe that the level of smoothness seems big enough. Addi- tionally, we can observe some relationship between average temperature and precipitation. Namely, for weather stations with large average temperature, we observe relatively bigger average precipitation while for Arctic stations with lowest average temperatures we observe Average temperature Latitude -30 -20 -10 0 10 20 Average temperature -30 -20 -10 0 10 20 492 T. Górecki et al. Raw data Functional data 0 100 200 300 0 100 200 300 Day of the year Day of the year Fig. 6 Raw and functional precipitation for Canadian weather stations Fig. 7 Absolute Spearman correlation coefficient for the first set of functional canonical variables FCCA HSIC.FCCA KTA.FCCA the smallest average precipitation. So we can expect some relationship between average temperature and average precipitation for Canadian weather stations. In the next step, we calculated the values of described earlier coefficients, the values of which are presented in Fig. 8. We observe quite big values of HSIC and KTA, but it is impossible to infer dependency from these values. We see that the values of HSIC and KTA coefficients are stable (both do not depend on basis size). To statistically confirm the association between temperature and precipitation we per- formed some simulation study. This study based on Chang et al. (2013) simulation. Finding a good nonlinear dependency measure is not trivial. KTA and HSIC are not on the same scale. As Chang et al. (2013) we used Spearman correlation coefficient. We performed 50 random splits with the inclusion of 25 samples to identify models. The Spearman correlation coefficient was then calculated using the remaining 10 samples for each of 50 splits. As we know the strongest signal between the temperature and precipitation for Canadian weather stations is nonlinear (Chang et al. 2013). From Fig. 7 we can observe that HSIC.FCCA and KTA.FCCA have produced larger absolute Spearman coefficients than FCCA. Such results suggest that HSIC.FCCA & KTA.FCCA can be viewed as natural nonlinear extensions of CCA also in the case of multivariate functional data. Finally, we performed permutation-based tests for HSIC and KTA coefficients. The results are presented in Fig. 8. All tests rejected H (p values close to 0) for all basis sizes, so we can infer that we have some relationship between average temperature and average precipitation for Canadian weather stations. Unfortunately, we know nothing about the strength and direc- tion of the dependency. Only a visual inspection of the plots suggests that there is a strong and positive relationship. Average precipitation 0 5 10 15 Absolute Spearman correlation 0.2 0.4 0.6 0.8 1.0 Average precipitation 02468 10 12 Independence test and CCA for multivariate functional data 493 HSIC coefficient HSIC test p-values KTA coefficient KTA test p-values 0.007 0.007 0.997 0.006 0.006 0.981 0.971 0.971 0.971 0.971 0.971 0.971 0.969 0.003 0.002 0.002 3579 11 13 15 3579 11 13 15 Basis size Basis size Fig. 8 HSIC and KTA coefficients and p values of permutation-based tests for Canadian weather data HSIC KTA -10 -5 0 5 10 -20 0 20 40 60 X X ˆ ˆ Fig. 9 Projection of the 35 Canadian weather stations on the plane (U , V ) 1 1 ˆ ˆ The relative positions of the 35 Canadian weather stations in the system (U , V ) of 1 1 functional canonical variables are shown in Fig. 9. It seems that for both coefficients the weather stations group reasonably. 4.3 Multivariate example The described method was employed here to cluster the twelve groups (pillars) of variables of 38 European countries in the period 2008-2015. The list of countries used in the dependency analysis is contained in Table 3.Table 4 describes the pillars used in the analysis. For this purpose, use was made of data published by the World Economic Forum (WEF) in its annual reports (http://www.weforum.org). Those are comprehensive data, describing exhaustively various socio-economic conditions or spheres of individual states (Górecki et al. 2016). The data were transformed into functional data. Calculations were performed using the Fourier basis. In view of a small number of time periods (J = 7), for each variable the maximum number of basis components was taken to be equal to five. Here, raw data are twelve matrices (one for each pillar). Dimensions of matrices are different and depend on the number of variables in the pillar. Eg. for the first pillar we have 16 (number of variables) * 7 (number of time points) = 112 columns, hence dimensionality of the matrix for this pillar is 38 by 112. Similarly for the others. On the other hand, functional data are twelve matrices with 38 rows and appropriate number of columns (coefficients of Fourier basis). Number of columns for functional data eg. for the first pillar we calculate as 16 (number of variables) * 5 (number of basis elements) = 80. 0.90 0.95 1.00 1.05 1.10 -80 -60 -40 -20 0 20 40 0.000 0.002 0.004 0.006 0.008 0.010 -150 -100 -50 0 50 100 494 T. Górecki et al. Table 3 Countries used in analysis, 2008–2015 1 Albania (AL) 14 Greece (GR) 27 Poland (PL) 2 Austria (AT) 15 Hungary (HU) 28 Portugal (PT) 3 Belgium (BE) 16 Iceland (IS) 29 Romania (RO) 4 Bosnia and Herzegovina (BA) 17 Ireland (IE) 30 Russian Federation (RU) 5 Bulgaria (BG) 18 Italy (IT) 31 Serbia (XS) 6 Croatia (HR) 19 Latvia (LV) 32 Slovak Republic (SK) 7 Cyprus (CY) 20 Lithuania(LT) 33 Slovenia (SI) 8 Czech Republic (CZ) 21 Luxembourg (LU) 34 Spain (ES) 9 Denmark (DK) 22 Macedonia FYR (MK) 35 Sweden (SE) 10 Estonia (EE) 23 Malta (MT) 36 Switzerland (CH) 11 Finland (FI) 24 Montenegro (ME) 37 Ukraine (UA) 12 France (FR) 25 Netherlands (NL) 38 United Kingdom (GB) 13 Germany (DE) 26 Norway (NO) Table 4 Pillars used in analysis, Pillar Number of variables 2008–2015 G1 Institutions 16 G2 Infrastructure 6 G3 Macroeconomic environment 2 G4 Health and primary education 7 G5 Higher education and training 6 G6 Goods market efficiency 10 G7 Labor market efficiency 6 G8 Financial market development 5 G9 Technological readiness 4 G10 Market size 4 G11 Business sophistication 9 G12 Innovation 5 Tables 5 and 6 contain the values of functional HSIC and KTA coefficients. As expected, they are all close to one. But high values of these coefficients do not necessarily mean that there is a significant relationship between the two groups of variables. We can expect association between groups of pillars. However, it is really hard to guess what groups are associated. Similarly to the Canadian weather example we performed small simulation study for pillars G5 and G6. From Fig. 10 we can observe that HSIC.FCCA and KTA.FCCA have produced larger absolute Spearman coefficients than FCCA. This result suggest that proposed measures have better characteristic in discovering nonlinear relationship for this example. We performed permutation-based tests for the HSIC and KTA coefficients discussed above. For most of tests, p values were close to zero, on the basis of which it can be inferred that there is some significant relationship between the groups (pillars) of variables. Table 7 contains the p values obtained for each test. We have exactly the same p values for both methods. Now, 123 Independence test and CCA for multivariate functional data 495 Table 5 Functional HSIC coefficients 12345678910 11 2 0.9736 3 0.9736 0.9737 4 0.9736 0.9737 0.9737 5 0.9708 0.9706 0.9706 0.9706 6 0.9728 0.9727 0.9727 0.9727 0.9753 7 0.9687 0.9683 0.9683 0.9683 0.9799 0.9780 8 0.9730 0.9730 0.9730 0.9730 0.9725 0.9740 0.9721 9 0.9736 0.9737 0.9737 0.9737 0.9706 0.9727 0.9683 0.9730 10 0.9736 0.9737 0.9737 0.9737 0.9706 0.9727 0.9683 0.9730 0.9737 11 0.9714 0.9711 0.9711 0.9711 0.9785 0.9755 0.9828 0.9726 0.9711 0.9711 12 0.9688 0.9683 0.9683 0.9683 0.9778 0.9741 0.9897 0.9715 0.9783 0.9683 0.9830 Table 6 Functional KTA coefficients 12345678910 11 2 1.0000 3 1.0000 1.0000 4 1.0000 1.0000 1.0000 5 0.9918 0.9916 0.9916 0.9916 6 0.9980 0.9978 0.9978 0.9978 0.9951 7 0.9741 0.9736 0.9736 0.9936 0.9801 0.9821 8 0.9991 0.9990 0.9990 0.9990 0.9933 0.9989 0.9772 9 1.0000 1.0000 1.0000 1.0000 0.9916 0.9978 0.9736 0.9990 10 1.0000 1.0000 1.0000 1.0000 0.9916 0.9978 0.9736 0.9990 1.0000 11 0.9927 0.9924 0.9924 0.9924 0.9947 0.9957 0.9833 0.9936 0.9924 0.9924 12 0.9793 0.9788 0.9788 0.9788 0.9831 0.9834 0.9794 0.9917 0.9788 0.9788 0.9887 Fig. 10 Absolute Spearman correlation coefficient for the first set of functional canonical variables for pillars G5 & G6 FCCA HSIC.FCCA KTA.FCCA we can observe that some groups are independent (α = 0.05): G1 & G3, G3 & G6, G3 & G8, G3 & G11, G3 & G12, G4 & G9. The graphs of the components of the vector weight function for the first functional canon- ical variables of the processes are shown in Fig. 11.FromFig. 11 (left) it can be seen that the greatest contribution in the structure of the first functional canonical correlation (U )comes Absolute Spearman correlation 0.0 0.2 0.4 0.6 0.8 1.0 496 T. Górecki et al. Table 7 Functional HSIC & KTA p values permutation-based tests (only non-zero) 1 2 3 4 5678910 11 2 0.0142 3 0.0714 0.0332 4 0.0042 0.0343 5 0.0001 0.0268 6 0.0157 0.0772 7 0.0009 0.0061 8 0.0294 0.0636 9 0.0030 0.0055 0.0198 0.0640 0.0002 0.0003 0.0009 0.0040 10 0.0059 0.0294 0.0021 0.0055 11 0.0039 0.1034 0.0008 12 0.0008 0.0563 0.0044 p-values greater than usual level of significance 5% are given in bold 1234567 12 34 567 Time Time Fig. 11 Weight functions for first functional canonical variable U (left) and V (right) 1 1 from “black” process, and this holds for all of the observation years considered. Figure 11 (right) shows that, on specified time intervals, the greatest contribution in the structure of the first second functional canonical correlation (V comes alternately from the processes “black” and “red dotted”. The total contribution of a particular original process in the structure of a given functional canonical correlation is equal to the area under the module weighting function corresponding to this process. These contributions for the components are given in Table 8. Figure 12 contains the relative positions of the 38 European countries in the system ˆ ˆ (U , V ) of functional canonical variables for selected groups of variables. The high corre- 1 1 lation of the first two functional canonical variables can be seen in Fig. 12 for two pillars G5 and G6. For KTA criterion, the countries with the highest value of functional canonical variables U and V are: Finland (FI), France (FR), Hungary (HU), Greece (GR), Estonia 1 1 (EE), Germany (DE), Iceland (IS), Czech Republic (CZ) and Denmark (DK). The countries with the lowest value of functional canonical variables U and V are: Romania (RO), Poland 1 1 (PL), Norway (NO), Portugal (PT), Netherlands (NL) and Russian Federation (RU). Other countries belong to the intermediate group. During the numerical calculation process we used R software (R Core Team 2018)and packages fda (Ramsay et al. 2018)and hsicCCA (Chang 2013). -1.0 -0.5 0.0 0.5 -1.0 -0.5 0.0 0.5 Independence test and CCA for multivariate functional data 497 Table 8 Sorted areas under No. Area Proportion (in %) module weighting functions First functional canonical variable (G5) 15.008 51.74 21.724 17.81 31.567 16.19 40.713 7.36 50.351 3.63 60.317 3.27 First functional canonical variable (G6) 15.187 44.77 23.194 27.56 31.287 11.11 40.580 5.00 50.511 4.41 60.323 2.79 70.206 1.77 80.152 1.31 90.091 0.78 10 0.057 0.49 G4 and G5 (KTA) G4 and G6 (KTA) AL HU AT IS GR FR BE BA FI CH SE EE DE HR XS ES LV BG EE CY FI LT DK IT CZ FR SK LU GR UA SE ES IR HU DE MK CH CZ RU UA SI GB DK GB SI IS ME NO SK NL RO PT MT CY PL HR NL NO XS PL IR ME BG AL BA IT AT MT MK RU BE LT LU RO LV PT -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1 0 1 2 X X G5 and G6 (KTA) GR FR HU FI DE EE IS DK CZ ITCY IR HR BA AL AT GBLT LU CHUA BG SE BE LV ES MK ME MT SIXS SK RO RU PL NO NL PT -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 ˆ ˆ Fig. 12 Selected projections of the 38 European countries on the plane (U , V ). Regions used for statistical 1 1 processing purposes by the United Nations Statistics Division: blue square—Northern Europe, cyan square— Western Europe, red square—Eastern Europe, green square—Southern Europe. (Color figure online) 5 Conclusions We proposed an extension of two dependency measures for two sets of variables for multivari- ate functional data. We proposed to use tests to examine the significance of results because the values of proposed coefficients are rather hard to interpret. Additionally, we presented the methods of constructing nonlinear canonical variables for multivariate functional data using HSIC and KTA coefficients. Tested on two real examples, the proposed method has proven -2 -1 0 1 2 -1.5 -0.5 0.5 1.5 -2 -1 0 1 498 T. Górecki et al. useful in investigating the dependency between two sets of variables. Examples confirm use- fulness of our approach in revealing the hidden structure of co-dependence between groups of variables. During the study of proposed coefficients we discovered that the size of basis (smoothing parameter) is rather unimportant, the values (and p values for tests) do not depend on the basis size. Of course, the performance of the methods needs to be further evaluated on additional real and artificial data sets. Moreover, we can examine the behavior of coefficients (and tests) for different bases like B-splines or wavelets (when data are not periodic, the Fourier basis could fail). This constitutes the direction of our future research. Acknowledgements The authors are grateful to editor and two anonymous reviewers for giving many insight- ful and constructive comments and suggestions which led to the improvement of the earlier manuscript. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and repro- duction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. References Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404 Chang B (2013) hsicCCA: Canonical Correlation Analysis based on Kernel Independence Measures. R package version 1.0. https://CRAN.R-project.org/package=hsicCCA Chang B, Kruger U, Kustra R, Zhang J (2013) Canonical correlation analysis based on hilbert-schmidt indepen- dence criterion and centered kernel target alignment. In: Proceedings of the 30th international conference on machine learning, Atlanta, Georgia. JMLR: W and CP 28(2), 316–324 Cortes C, Mohri M, Rostamizadeh A (2012) Algorithms for learning kernels based on centered alignment. J Mach Learn Res 13:795–828 Cristianini N, Shawe-Taylor J, Elisseeff A, Kandola JS (2001) On kernel-target alignment. In: NIPS-2001, 367–373 Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23 Devijver E (2017) Model-based regression clustering for high-dimensional data: application to functional data. Adv Data Anal Classif 11(2):243–279 Edelman A, Arias TA, Smith S (1998) The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal Appl 20(2):303–353 Ferraty F, Vieu P (2003) Curves discrimination: a nonparametric functional approach. Comput Stat Data Anal 44(1–2):161–173 Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, Berlin Feuerverger A (1993) A consistent test for bivariate dependence. Int Stat Rev 61(3):419–433 Górecki T, Krzysk ´ o M, Ratajczak W, Wołynski ´ W (2016) An extension of the classical distance correlation coefficient for multivariate functional data with applications. Stat Transit 17(3):449–9466 Górecki T, Krzysk ´ o M, Wołynski ´ W (2017) Correlation analysis for multivariate functional data. In: Palumbo F, Montanari A, Montanari M (eds) Data science. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 243–258 Górecki T, Krzysk ´ o M, Waszak Ł, Wołynski ´ W (2018) Selected statistical methods of data analysis fir multi- variate functional data. Stat Papers 59:153–182 Górecki T, Smaga Ł (2017) Multivariate analysis of variance for functional data. J Appl Stat 44:2172–2189 Gretton A., Bousquet O., Smola A., and Schölkopf B., (2005): Measuring statistical dependence with Hilbert– Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Algorithmic learning theory. Lecture notes in computer science 3734, 63–77. Springer Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola AJ (2008) A kernel statistical test of inde- pendence. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems. Curran, Red Hook, pp 585–592 Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220 123 Independence test and CCA for multivariate functional data 499 Horváth L, Kokoszka P (2012) Inference for functional data with applications. Springer, Berlin Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377 Hsing T, Eubank R (2015) Theoretical foundations of functional data analysis, with an introduction to linear operators. Wiley, Hoboken James GM, Wang JW, Zhu J (2009) Functional linear regression that’s interpretable. Ann Stat 37(5):2083–2108 Kankainen A (1995) Consistent testing of total independence based on the empirical charecteristic function, Ph.D. thesis, University of Jyväskylä Martin-Baragan B, Lillo R, Romo J (2014) Interpretable support vector machines for functional data. Eur J Oper Res 232:146–155 Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc Lond Ser A 209:415–446 R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ Ramsay JO, Dalzell CJ (1991) Some tools for functional data analysis (with discission). J R Stat Soc Ser B 53(3):539–572 Ramsay JO, Silverman BW (2002) Applied functional data analysis. Springer, New York Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, Berlin Ramsay JO, Wickham H, Graves S, Hooker G (2018) fda: Functional data analysis. R package version 2.4.8. https://CRAN.R-project.org/package=fda Read T, Cressie N (1988) Goodness-of-fit statistics for discrete multivariate analysis. Springer, Berlin Riesz F (1909) Sur les opérations functionnelles linéaires. Comptes rendus hebdomadaires des séances de l’Académie des sciences 149:974–977 Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K (2013) Equivalence of distance-based and RKHS- based statistics in hypothesis testing. Ann Stat 41(5):2263–2291 Schölkopf B, Smola AJ, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319 Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cam- bridge Song L, Boots B, Siddiqi S, Gordon G, Somla A (2010) Hilbert space embeddings of hidden Markov models. In: Proceedings of the 26th international conference on machine learning (ICML2010) Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794 Székely GJ, Rizzo ML (2009) Brownian distance covariance. Ann Appl Stat 3(4):1236–1265 Wang T, Zhao D, Tian S (2015) An overview of kernel alignment and its applications. Artif Intell Rev 43(2):179–192 Zhang K, Peters J, Janzing D, Schölkopf B (2011) Kernel-based conditional independence test and application in causal discovery. In: Cozman FG, Pfeffer A (eds) Proceedings of the 27th conference on uncertainty in artificial intelligence, AUAI Press, Corvallis, OR, USA, 804–813 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Artificial Intelligence Review Springer Journals

Independence test and canonical correlation analysis based on the alignment between kernel matrices for multivariate functional data

Loading next page...
 
/lp/springer-journals/independence-test-and-canonical-correlation-analysis-based-on-the-U6CipWyyoq

References (30)

Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Artificial Intelligence; Computer Science, general
ISSN
0269-2821
eISSN
1573-7462
DOI
10.1007/s10462-018-9666-7
Publisher site
See Article on Publisher Site

Abstract

In the case of vector data, Gretton et al. (Algorithmic learning theory. Springer, Berlin, pp 63– 77, 2005) defined Hilbert–Schmidt independence criterion, and next Cortes et al. (J Mach Learn Res 13:795–828, 2012) introduced concept of the centered kernel target alignment (KTA). In this paper we generalize these measures of dependence to the case of multivariate functional data. In addition, based on these measures between two kernel matrices (we use the Gaussian kernel), we constructed independence test and nonlinear canonical variables for multivariate functional data. We show that it is enough to work only on the coefficients of a series expansion of the underlying processes. In order to provide a comprehensive comparison, we conducted a set of experiments, testing effectiveness on two real examples and artificial data. Our experiments show that using functional variants of the proposed measures, we obtain much better results in recognizing nonlinear dependence. Keywords Multivariate functional data · Functional data analysis · Correlation analysis · Canonical correlation analysis 1 Introduction The theory and practice of statistical methods in situations where the available data are functions (instead of real numbers or vectors) is often referred to as Functional Data Analysis (FDA). The term Functional Data Analysis was already used by Ramsay and Dalzell (1991) B Tomasz Górecki tomasz.gorecki@amu.edu.pl Mirosław Krzysk ´ o mkrzysko@amu.edu.pl Waldemar Wołynski ´ wolynski@amu.edu.pl Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poznan, ´ Poland Faculty of Management, President Stanisław Wojciechowski Higher Vocational State School, Nowy Swiat 4, 62-800 Kalisz, Poland 123 476 T. Górecki et al. two decades ago. This subject has become increasingly popular from the end of the 1990s and is now a major research field in statistics (Cuevas 2014). Good access to the large literature in this field comes from the books by Ramsay and Silverman (2002, 2005), Ferraty and Vieu (2006), and Horváth and Kokoszka (2012). Special issues devoted to FDA topics have been published by different journals, including Statistica Sinica 14(3) (2004), Computational Statistics 22(3) (2007), Computational Statistics and Data Analysis 51(10) (2007), Journal of Multivariate Analysis 101(2) (2010), Advances in Data Analysis and Classification 8(3) (2014). The range of real world applications, where the objects can be thought of as functions, is as diverse as speech recognition, spectrometry, meteorology, medicine or clients segmentation, to cite just a few (Ferraty and Vieu 2003; James et al. 2009; Martin-Baragan et al. 2014; Devijver 2017). The uncentered kernel alignment originally was introduced by Cristianini et al. (2001). Gretton et al. (2005) defined Hilbert–Schmidt Independence Criterion (HSIC) and the empir- ical HSIC. Centered kernel target alignment (KTA) was introduced by Cortes et al. (2012). This measure is a normalized version of HSIC. Zhang et al. (2011) gave an interesting kernel- based independence test. This independence testing method is closely related to the one based on the Hilbert–Schmidt independence criterion (HSIC) proposed by Gretton et al. (2008). Gretton et al. (2005) described a permutation-based kernel independence test. There is a lot of work in the literature for kernel alignment and its applications (good overview can be found in Wang et al. 2015). This work is devoted to a generalization of these measures of dependence to the case of multivariate functional data. In addition, based on these measures we constructed indepen- dence test and nonlinear canonical correlation variables for multivariate functional data. These results are based on the assumption that the applied kernel function is Gaussian. Functional HSIC and KTA canonical correlation analysis can be viewed as natural nonlinear extension of functional canonical correlation analysis (FCCA). So, we propose two nonlinear functional CCA extensions that capture nonlinear relationship. Moreover, both algorithms are capable of extracting also linear dependency. Additionally, we show that functional KTA approach is only a normalized variant of HSIC coefficient also for functional data. Finally, we propose some interpretation of module weighting functions for functional canonical correlations. Section 2 provides an overview of centered measures alignment for random vectors. They are defined by such concepts as: kernel function alignment, kernel matrix alignment, and Hilbert–Schmidt Independence Criterion (HSIC) and associations between them have been shown. Functional data can be seen as values of random processes. In our paper, the multivari- X Y ate random function X X and Y Y have special representation (8) in finite dimensional subspaces of the spaces of square integrable functions on the given intervals. In Sect. 3 we present kernel-based independence test. Section 4 discusses the concept of alignment for multivari- ate functional data. The kernel function, the alignment between two kernels functions, the centered kernel alignment (KTA) between two kernel matrices and the empirical Hilbert– Schmidt Independence Criterion (HSIC) are defined. The HSIC was used as the basis for an independence test. In Sect. 5 we present kernel-based independence test for multivari- ate functional data. In Sect. 5, based on the concept of alignment between kernel matrices, nonlinear canonical variables were constructed. It is a generalization of the results of Chang et al. (2013) for random vectors. In Sect. 5 we present an one artificial and two real examples which confirm the usefulness of proposed coefficients in detection of nonlinear dependency for group of variables. 123 Independence test and CCA for multivariate functional data 477 2 An overview of kernel alignment and its applications We introduce the following notational convention. Throughout this section, X X X and Y Y Y are p q random vectors, with domains R and R , respectively. Let P be a joint probability X X X ,Y Y Y p q p q measure on (R × R ,  × )(here  and  are the Borel σ -algebras on R and R , respectively), with associated marginal probability measures P and P . X X X Y Y Y Definition 1 (Kernel functions, Shawe-Taylor and Cristianini 2004) A kernel is a function k that for all xxx , xxx ∈ R satisfies k(xxx , xxx ) =ϕ ϕ ϕ(xxx ), ϕ ϕ ϕ(xxx ) , where ϕ is a mapping from R to an inner product feature space H ϕ : xxx → ϕ(xxx ) ∈ H. We call ϕ ϕ ϕ a feature map. A kernel function can be interpreted as a kind of similarity measure between the vectors xx and xxx . Definition 2 (Gram matrix, Mercer 1909;Riesz 1909; Aronszajn 1950) Given a kernel k and inputs xxx ,..., xxx ∈ R ,the n × n matrix K K K with entries K = k(xxx , xxx ) is called the Gram 1 n ij i j x x matrix (kernel matrix) of k with respect to xx ,..., xx . 1 n Definition 3 (Positive semi-definite matrix, Hofmann et al. 2008) A real n × n symmetric matrix K K K with entries K satisfying ij n n c c K ≥ 0 i j ij i =1 j =1 for all c ∈ R is called positive semi-definite. Definition 4 (Positive semi-definite kernel, Mercer 1909; Hofmann et al. 2008) A function p p p k : R × R → R which for all n ∈ N, xxx ∈ R , i = 1,..., n gives rise to a positive semi-definite Gram matrix is called a positive semi-definite kernel. This raises an interesting question: given a function of two variables k(xxx , xxx ), does there ϕ x x x ϕ x ϕ x exist a function ϕ ϕ(xx ) such that k(xx , xx ) =ϕ ϕ(xx ), ϕ ϕ(xx ) ? The answer is provided by Mercer’s theorem (1909) which says, roughly, that if k is positive semi-definite then such a ϕ exists. p p Often, we will not known φ φ φ, but a kernel function k : R × R → R that encodes the inner product in H, instead. Popular positive semi-definite kernel functions on R include the polynomial kernel of d   2 degree d > 0, k(xxx , xxx ) = (1 + xxx xxx ) , the Gaussian kernel k(xxx , xxx ) = exp(−λ xxx − xxx ), λ> 0, and the Laplace kernel k(xxx , xxx ) = exp(−λ xxx − xxx ), λ> 0. In this paper we use, the Gaussian kernel. We start with the definition of centering and the analysis of its relevant properties. 123 478 T. Górecki et al. 2.1 Centered kernel functions A feature mapping φ φ : R → H is centered by subtracting from it its expectation, that is transforming φ φ φ(xxx ) to φ φ φ(xxx ) = φ φ φ(xxx ) − E [φ φ φ(X X X )],where E denotes the expected value X X X X X X of φ φ φ(X X X ) when X X X is distributed according to P . Centering a positive semi-definite kernel X X X p p function k : R × R → R consists centering in the feature mapping φ φ φ associated to k. Thus, the the centered kernel k associated to k is defined by k(xxx , xxx ) =φ φ φ(xxx ) − E [φ φ φ(X X X )],φ φ φ(xxx ) − E [φ φ φ(X X X )] X X X X X X = k(xxx , xxx ) − E [k(X X X , xxx )]− E  [k(xxx , X X X )]+ E  [k(X X X , X X X )], X X X X X X X X X ,X X X assuming the expectations exist. Here, the expectation is taken over independent copies X X, X X X distributed according to P .Wesee that, k is also a positive semi-definite kernel. Note X X X ˜ ˜ X X also that for a centered kernel k,E [k(X X , X X )]= 0, that is, centering the feature mapping X X X ,X X X implies centering the kernel function. 2.2 Centered kernel matrices Let {xxx ,..., xxx } be a finite subset of R . A feature mapping φ φ φ(xxx ), i = 1,..., n, is centered 1 n i by subtracting from it its empirical expectation, i.e., leading to φ φ φ(xxx ) = φ φ φ(xxx ) − φ φ φ,where i i φ φ φ = φ φ φ(xxx ). The kernel matrix K K K = (K ) associated to the kernel function k and the i ij n i =1 x x K set {xx ,..., xx } is centered by replacing it with K K = (K ) defined for all i , j = 1, 2,..., n 1 n ij by n n n 1 1 1 K = K − K − K + K , (1) ij ij ij ij ij n n n i =1 j =1 i , j =1 where K = k(xxx , xxx ), i , j = 1,..., n. ij i j The centered kernel matrix K K is a positive semi-definite matrix. Also, as with the kernel 1 n function K = 0. 2 ij i , j Let ·, · denote the Frobenius product and · the Frobenius norm defined for all F F n×n A A A, B B B ∈ R by A A A, B B B = tr(A A A B B B), 1/2 A A A A A = (A A, A A ) . F F n×n Then, for any kernel matrix K K K ∈ R , the centered kernel matrix K K K can be expressed as follows (Schölkopf et al. 1998): K K K = H H HK K KH H H , (2) 1  n×1 where H H H = III − 1 1 1 1 1 1 ,1 1 1 ∈ R denote the vector with all entries equal to one, and III n n n n the identity matrix of order n. The matrix H H H is called “centering matrix”. Since H H H is the idempotent matrix (H H H = H H H), then we get for any two kernel matrices K K K p q and L L L based on the subset {xxx ,..., xxx } of R and the subset {yyy ,..., yyy } of R , respectively, 1 n 1 n K K K , L L L =K K K , L L L =K K K , L L L . (3) F F F 123 Independence test and CCA for multivariate functional data 479 2.3 Centered kernel alignment Definition 5 (Kernel function alignment, Cristianini et al. 2001;Cortesetal. 2012)Let k p p q q and l be two kernel functions defined over R × R and R × R , respectively, such that 2  2 ˜ ˜ 0 < E  [k (X X X , X X X )] < ∞ and 0 < E  [l (Y Y Y , Y Y Y )] < ∞,where X X X , X X X and Y Y Y , Y Y Y are X X X ,X X X Y Y Y ,Y Y Y independent copies distributed according to P and P , respectively. Then the alignment X X X Y Y Y between k and l is defined by ˜ ˜ X X Y Y E [k(X X , X X )l(Y Y , Y Y )] X X X ,X X X ,Y Y Y ,Y Y Y ρ(k, l) =  . 2  2 ˜ ˜ E  [k (X X X , X X X )] E  [l (Y Y Y , Y Y Y )] X X X X ,X X Y Y Y ,Y Y Y We can define similarly the alignment between two kernel matrices K K K and L L L based on the finite subset {xxx ,..., xxx } and {yyy ,..., yyy }, respectively. 1 n 1 n n×n n×n K L Definition 6 (Kernel matrix alignment, Cortes et al. 2012)Let K K ∈ R and L L ∈ R be two kernel matrices such that K K K = 0and L L L = 0. Then, the centered kernel target F F K L alignment (KTA) between K K and L L is defined by K K K , L L L ρ( ˆ K K K , L L L) = . (4) K L K K L L F F K L K L Here, by the Cauchy–Schwarz inequality, ρ( ˆ K K , L L) ∈[−1, 1] and in fact ρ( ˆ K K , L L) ∈[0, 1] ˜ ˜ when K K K and L L L are the kernel matrices of the positive semi-definite kernel k and l. Gretton et al. (2005) defined Hilbert–Schmidt Independence Criterion (HSIC) as a test statistic to distinguish between null hypothesis H : P = P P (equivalently we may 0 X X X ,Y Y Y X X X Y Y Y write X X X ⊥ ⊥Y Y Y ) and alternative hypothesis H : P = P P . 1 X X X ,Y Y Y X X X Y Y Y Definition 7 (Reproducing kernel Hilbert space,Riesz 1909; Mercer 1909; Aronszajn 1950) Consider a Hilbert space H of functions from R to R.Then H is a reproducing kernel Hilbert space (RKHS) if for each xx ∈ R , the Dirac evaluation operator δ : H → R,which xxx maps f ∈ H to f (xxx ) ∈ R, is a bounded linear functional. p  p x x φ x φ x x x Let ϕ : R → H be a map such that for all xx , xx ∈ R we have φ φ(xx ), φ φ(xx ) = k(xx , xx ), p p where k : R ×R → R is a unique positive semi-definite kernel. We will require in particular that H be separable (it must have a complete, countable orthonormal system). We likewise define a second separable RKHS G, with kernel l(·, ·) and feature map ψ ψ ψ, on the separable space R . We may now define the mean elements μ and μ with respect to the measures P and X X X Y Y Y X X X P as those members of H and G, respectively, for which Y Y Y μ , f  = E [φ φ φ(X X X ), f  ]= E [ f (X X X )], X H X H X X X X X X X μ , g = E [ψ ψ ψ(Y Y Y ), g ]= E [g(Y Y Y )], Y Y Y G Y Y Y G Y Y Y for all functions f ∈ H, g ∈ G,where φ φ φ is the feature map from R to the RKHS H,and ψ ψ ψ maps from R to G and assuming the expectations exist. Finally, μ can be computed by applying the expectation twice via X X μ = E  [φ φ φ(X X X ), φ φ φ(X X X ) ]= E  [k(X X X , X X X )], X X X H X X X X X X ,X X X X ,X X assuming the expectations exist. The expectation is taken over independent copies X X X , X X X distributed according to P . The means μ , μ exist when positive semi-definite kernels k X X X X X X Y Y Y and l are bounded. We are now in a position to define the cross-covariance operator. 123 480 T. Górecki et al. Definition 8 (Cross-covariance operator, Gretton et al. 2005) The cross-covariance operator p q C C C : G → H associated with the joint probability measure P on (R × R ,  × )is X X X ,Y Y Y X X X ,Y Y Y a linear operator C C : G → H defined as X X X ,Y Y Y C X Y C C = E [φ(X X ) ⊗ ψ(Y Y )]− μ ⊗ μ , X X X ,Y Y Y X X X ,Y Y Y X X X Y Y Y for all f ∈ H and g ∈ G, where the tensor product operator f ⊗ g : G → H, f ∈ H, g ∈ G, is defined as ( f ⊗ g)h = f g, h , for all h ∈ G. This is a generalization of the cross-covariance matrix between random vectors. Moreover, by the definition of the Hilbert–Schmidt (HS) norm, we can compute the HS norm of f ⊗ g via 2 2 2 f ⊗ g = f g . HS H G Definition 9 (Hilbert–Schmidt Independence Criterion, Gretton et al. 2005) Hilbert–Schmidt Independence Criterion (HSIC) is the squared Hilbert–Schmidt norm (or Frobenius norm) p q of the cross-covariance operator associated with the probability measure P on (R × R , X X X ,Y Y Y × ): HSIC(P ) = C C C . X X X ,Y Y Y X X X ,Y Y Y To compute it we need to express HSIC in terms of kernel functions (Gretton et al. 2005): HSIC(P ) = E   [k(X X X , X X X )l(Y Y Y , Y Y Y )] X X X ,Y Y Y X X X ,X X X ,Y Y Y ,Y Y Y + E  [k(X X X , X X X )] E  [l(Y Y Y , Y Y Y )] X X X ,X X X Y Y Y ,Y Y Y − 2E [E  [k(X X X , X X X )] E  [l(Y Y Y , Y Y Y )]]. (5) X X X ,Y Y Y X Y X X Y Y Here E   denotes the expectation over independent pairs (X X X , Y Y Y ) and (X X X , Y Y Y ) dis- X X X ,X X X ,Y Y Y ,Y Y Y tributed according to P . X X X ,Y Y Y It follows from (5) that the Frobenius norm of C C C exists when the various expectations X X X ,Y Y Y over the kernels are bounded, which is true as long as the kernels k and l are bounded. Definition 10 (Empirical HSIC, Gretton et al. 2005)Let S ={(xxx , yyy ), ...,(xxx , yyy )}⊆ 1 1 n n p q R × R be a series of n independent observations drawn from P . An estimator of HSIC, X X X ,Y Y Y written HSIC(S), is given by HSIC(S) = K K K , L L L , (6) n×n where K K K = (k(xxx , xxx )), L L L = (l(yyy , yyy )) ∈ R . i j i j Comparing (4)and (6) and using (3), we see that the centered kernel target alignment (KTA) is simply a normalized version of HSIC(S). In two seminar papers on Székely et al. (2007) and Székely and Rizzo (2009) introduced the distance covariance (dCov) and distance correlation (dCor) as powerful measures of dependence. p q s t s t For column vectors ss ∈ R and tt ∈ R ,denoteby ss and tt the standard Euclidean p q norms on the corresponding spaces. For jointly distributed random vectors X X X ∈ R and Y Y ∈ R ,let s t s X t Y f (ss, tt ) = E {exp[i ss, X X  + i tt , Y Y  ]}, X X X ,Y Y Y X X X ,Y Y Y p q 123 Independence test and CCA for multivariate functional data 481 X Y s s 0 t be the joint characteristic function of (X X , Y Y ),and let f (ss) = f (ss, 00) and f (tt ) = X X X X X X ,Y Y Y Y Y Y p q ϕ (0 0 0, ttt ) be the marginal characteristic functions of X X X and Y Y Y,where sss ∈ R and ttt ∈ R . X X X ,Y Y Y X Y X Y The distance covariance between X X and Y Y is the nonnegative number ν(X X , Y Y ) defined by s t s t 1 | f (ss, tt ) − f (ss) f (tt )| X X X ,Y Y Y X X X Y Y Y ν (X X X , Y Y Y ) = dsssdttt , p+1 q+1 C C p+q p q R sss ttt p q and |z| denotes the modulus of z ∈ C and (p+1) C = . ( (p + 1)) X Y The distance correlation between X X and Y Y is the nonnegative number defined by ν(X X X , Y Y Y ) X Y R(X X , Y Y ) = √ ν(X X X , X X X )ν(Y Y Y , Y Y Y ) if both ν(X X X , X X X ) and ν(Y Y Y , Y Y Y ) are strictly positive, and defined to be zero otherwise. For distributions with finite first moments, the distance correlation characterizes independence in that 0 ≤ R(X X X , Y Y Y ) ≤ 1 with R(X X X , Y Y Y ) = 0 if and only if X X X and Y Y Y are independent. Sejdinovic et al. (2013) demonstrated that distance covariance is an instance of the Hilbert– Schmidt Independence Criterion. Górecki et al. (2016, 2017) showed an extension of the distance covariance and distance correlation coefficients to the functional case. 2.4 Kernel-based independence test Statistical tests of independence have been associated with a broad variety of dependence measures. Classical tests such as Spearman’s ρ and Kendall’s τ are widely applied, however they are not guaranteed to detect all modes of dependence between the random variables. Contingency table-based methods, and in particular the power-divergence family of test statistics (Read and Cressie 1988) are the best known general purpose tests of independence, but are limited to relatively low dimensions, since they require a partitioning of the space in which random variable resides. Characteristic function-based tests (Feuerverger 1993; Kankainen 1995) have also been proposed. They are more general than kernel-based tests, although to our knowledge they have been used only to compare univariate random vari- ables. Now, we describe how HSIC can be used as an independence measure, and as the basis for an independence test. We begin by demonstrating that the Hilbert–Schmidt norm can be used as a measure of independence, as long as the associated RKHSs are universal. A continuous kernel k on a compact metric space is called universal if the corresponding RKHS H is dense in the class of continuous functions of the space. Denote by H, G RKHSs with universal kernels k, l on the compact domains X and Y respectively. We assume without loss of generality that f ≤ 1and g ≤ 1for all ∞ ∞ f ∈ H and g ∈ G. Then Gretton et al. (2005) proved that C C C = 0 if and only if X X X ,Y Y Y HS X X X and Y Y Y are independent. Examples of universal kernels are Gaussian kernel and Laplacian kernel, while the linear kernel k(xxx , xxx ) = xxx xxx is not universal—the corresponding HSIC tests only linear relationships, and a zero cross-covariance matrix characterizes independence only for multivariate Gaussian distributions. Working with the infinite dimensional operator with universal kernels, allows us to identify any general nonlinear dependence (in the limit) between any pair of vectors, not just Gaussians. 123 482 T. Górecki et al. We recall that in this paper we use the Gaussian kernel. We now consider the asymptotic distribution of statistics (6). X Y X Y We introduce the null hypothesis H : X X ⊥ ⊥Y Y (X X is independent of Y Y , i.e., P = P P ). 0 X X X ,Y Y Y X X X Y Y Y Suppose that we are given the i.i.d. samples S ={xxx ,..., xxx } and S ={yyy ,..., yyy } x y xx 1 n yy 1 n X Y K L for X X and Y Y , respectively. Let K K and L L be the centered kernel matrices associated to the kernel function k and the sets S and S , respectively. Let λ ≥ λ ≥ ··· ≥ λ ≥ 0be xxx yyy 1 2 n the eigenvalues of the matrix K K K and let v v v ,...,v v v be a set of orthonormal eigenvectors 1 n corresponding to these eigenvalues. Let λ ≥ λ ≥ ··· ≥ λ ≥ 0 be the eigenvalues of 1 2 the matrix L L L and let v v v ,...,v v v be a set of orthonormal eigenvectors corresponding to these 1 n eigenvalues. Let  = diag(λ ,...,λ ),  = diag(λ ,...,λ ), V V V = (v v v ,...,v v v ) and 1 n 1 n V V V = (v v v ,...,v v v ). Suppose further that we have the eigenvalue decomposition (EVD) of 1 n the centered kernel matrices K K K and L L L, i.e., K K K = V V V   V V V and L L L = V V V    (V V V ) . 1/2      1/2 Let    = (   ,...,   ) = V V V    and    = (   ,...,   ) = V V V (   ) , i.e.,    = 1 n i 1 n v  v λ v v ,   = λ v v , i = 1,..., n. i i i i i The following result is true (Zhang et al. 2011): under the null hypothesis that X X X and Y Y Y are independent, the statistic (6) has the same asymptotic distribution as Z = λ λ Z , (7) n i ,n j ,n ij i , j =1 2 2 where Z are i.i.d. χ -distributed variables, n →∞. ij 1 Note that the data-based test statistic HSIC (or its probabilistic counterpart) is sensible to dependence/independence and therefore can be used as a test statistic. Also important is the knowledge of its asymptotic distribution. These facts inspire the following depen- dence/independence testing procedure. Given the sample S and S , one first calculates x y xx yy K L the centered kernel matrices K K and L L and their eigenvalues λ and λ , and then evalu- ates the statistic HSIC(S) according to (6). Next, the empirical null distribution of Z under the null hypothesis can be simulated in the following way: one draws i.i.d. random sam- 2 2 ples from the χ -distributed variables Z , and then generates samples for Z according to 1 ij (7). Finally the p value can be found by locating HSIC(S) in the simulated null distribu- tion. A permutation-based test is described in Gretton et al. (2005). In the first step they propose to calculate the test statistic T (HSIC or KTA) for the given data. Next, keeping the order of the first sample we randomly permute the second sample a large number of times, and recompute the selected statistic each time. This destroy any dependence between samples simulating a draw from the product of marginals, making the empirical distribution of the permuted statistics behave like the null distribution of the test statistic. For a specified significance level α, we calculate threshold t in the right tail of the null distribution. We reject H if α 0 T > t . This test was proved to be consistent against any fixed alternative. It means that for any fixed significance level α, the power goes to 1 as the sample size tends to infinity. 2.5 Functional data In recent years methods for representing data by functions or curves have received much attention. Such data are known in the literature as the functional data (Ramsay and Silverman 2005; Horváth and Kokoszka 2012; Hsing and Eubank 2015). Examples of functional data can be found in various application domains, such as medicine, economics, meteorology and many others. Functional data can be seen as the values of random process X (t ). In practice, the 123 Independence test and CCA for multivariate functional data 483 values of the observed random process X (t ) are always recorded at discrete times t ,..., t , 1 J less frequently or more densely spaced in the range of variability of the argument t.Sowe have a time series {x (t ), ..., y(t )}. However, there are many reasons to model these series 1 J as elements of functional space., because the functional data has many advantages over other ways of representing the time series. 1. They easily cope with the problem of missing observations, an inevitable problem in many areas of research. Unfortunately, most data analysis methods require complete time series. One solution is to delete a time series that has missing values from the data, but this can lead to , and generally leads to, loss of information. Another option is to use one of many statistical methods to predict the missing values, but then the results will depend on the interpolation method. In contrast to this type of solutions, in the case of functional data, the problem of missing observations is solved by expressing time series in the form of a set of continuous functions. 2. The functional data naturally preserve the structure of observations, i.e. they maintain the time dependence of the observations and take into account the information about each measurement. 3. The moments of observations do not have to be evenly spaced in individual time series. 4. Functional data avoids the curse of dimensionality. When the number of time points is greater than the number of time series considered, most statistical methods will not give satisfactory results due to overparametrization. In the case of functional data, this problem can be avoided because the time series are replaced with a set of continuous functions independent of the number of time points in which observations are measured. In most of the papers on functional data analysis, objects are characterized by only one feature observed at many time points. In several applications there is a need to use statistical methods for objects characterized by many features observed at many time points (double multivariate data). In this case, such data are transformed into multivariate functional data. Let us assume that X X X = (X ,..., X ) ={X X X (s), s ∈ I }∈ L (I ) and Y Y Y = 1 p 1 1 (Y ,..., Y ) ={Y Y Y (t ), t ∈ I }∈ L (I ) are random processes, where L (I ) is a space 1 q 2 2 2 of square integrable functions on the interval I.Wealsoassume that E(X X X (s)) = 0 0 0, s ∈ I , E(Y Y Y (t )) = 0 0 0, t ∈ I . 1 2 We will further assume that each component X of the random process X X X and Y of the g h random process Y Y can be represented by a finite number of orthonormal basis functions {ϕ } and {ϕ } of space L (I ) and L (I ), respectively: f 2 1 2 2 X (s) = α ϕ (s), s ∈ I , g = 1, 2,..., p, g ge e 1 e=0 Y (t ) = β ϕ (t ), t ∈ I , h = 1, 2,..., q, h hf f 2 f =0 where α and β are the random coefficients. The degree of smoothness of processes X ge hf g and Y depends on the values E and F respectively (small values imply more smoothing). h g h The optimum values for E and F are selected using Bayesian Information Criterion (BIC) g h (see Górecki et al. 2018). As basis functions we can use e.g. the Fourier basis system or spline functions. We introduce the following notation: α α = (α , ..., α , ..., α , ..., α ) , 10 1E p0 pE 1 p 123 484 T. Górecki et al. β β = (β , ..., β , ..., β , ..., β ) , 10 1F q0 qF 1 q ϕ ϕ ϕ (s) = (ϕ (s), ...,ϕ (s)) , s ∈ I , g = 1, 2, ..., p, E 0 E 1 g g ϕ ϕ ϕ (t ) = (ϕ (t), ...,ϕ (t )) , F 0 F h h ⎡ ⎤ ϕ ϕ ϕ (s) 0 0 0 ... 0 0 0 ⎢ ⎥ 0 ϕ 0 00 ϕ ϕ (s) ... 00 ⎢ 2 ⎥ t ∈ I , h = 1, 2, ..., q,   (s) = , 2 1 ⎣ ⎦ ... ... ... ... 0 000 0 0 ... ϕ ϕ ϕ (s) ⎡ ⎤ ϕ ϕ ϕ (t ) 0 0 0 ... 0 0 0 ⎢ ⎥ 0 0 0 ϕ ϕ ϕ (t) ... 0 0 0 ⎢ 2 ⎥ (t ) = , ⎣ ⎦ ... ... ... ... 0 000 0 0 ... ϕ ϕ ϕ (t ) K +p K +q p+(K +p) q+(K +q) 1 2 1 2 α β where α α ∈ R , β β ∈ R ,   ∈ R ,  ∈ R , K = E + ··· + E , 1 2 1 1 p K = F + ··· + F . 2 1 p X Y Using the above matrix notation the random processes X X and Y Y can be represented as: X  α Y  β X X (s) =   (s)α α, s ∈ I , Y Y (t ) =   (t )β β, t ∈ I , (8) 1 1 2 2 where E(α α α) = 0 00, E(β β β) = 0 00. This means that the values of random processes X X X andY Y Y are in finite dimensional subspaces p q p q of L (I ) and L (I ), respectively. We will denote these subspaces by L (I ) and L (I ). 1 2 1 2 2 2 2 2 Typically data are recorded at discrete moments in time. The process of transformation of discrete data to functional data is performed for each realization and each variable separately. Let x denote an observed value of the feature X , g = 1, 2,... p at the jth time point s , gj g j where j = 1, 2, ..., J . Similarly, let y denote an observed value of feature Y , h = 1, 2,... q hj h at the jth time point t ,where j = 1, 2, ..., J . Then our data consist of pJ pairs of (s , x ) j j gj X X Y Y and of qJ pairs of (t , y ).Let X X ,..., X X and Y Y ,..., Y Y be independent trajectories of j hj 1 n 1 n random processes X X X and Y Y Y having the representation (8). The coefficients α α α and β β β are estimated by the least squares method. Let us denote these i i estimates by a a a and b b b , i = 1, 2,..., n. i i As a result, we obtain functional data of the form: X X X (s) =    (s)a a a , Y Y Y (t ) =    (t )b b b , (9) i 1 i i 2 i K +p K +q 1 2 a b where s ∈ I , t ∈ I , aa ∈ R , bb ∈ R , K = E + ··· + E , K = F + ··· + F , 1 2 i i 1 1 p 2 1 q and i = 1, 2,..., n. Górecki and Smaga (2017) described a multivariate analysis of variance (MANOVA) for functional data. In the paper by Górecki et al. (2018), three basic methods of dimension reduc- tion for multidimensional functional data are given: principal component analysis, canonical correlation analysis, and discriminant coordinates. 3 Alignment for multivariate functional data 3.1 The alignment between two kernel functions and two kernel matrices for multivariate functional data p p Let xxx (s) ∈ L (I ), s ∈ I ,where L (I ) is a finite-dimensional space of continuous square- 1 1 1 2 2 integrable vector functions over interval I . 123 Independence test and CCA for multivariate functional data 485 Let p p k : L (I ) × L (I ) → R 1 1 2 2 be a kernel function on L (I ). As already mentioned, in this paper we use the Gaussian kernel. For the multivariate functional data this kernel has the form: k (xxx (s), xxx (s)) = exp(−λ xxx (s) − xxx (s) ), λ > 0. 1 1 But from (9), and by the orthonormality of the basis functions, we have: xxx (s) − xxx (s) = (xxx (s) − xxx (s)) (xxx (s) − xxx (s))ds = a a a − a a a . Hence k (xxx (s), xxx (s)) = k(a a a, a a a ) and k (yyy(t ), yyy (t )) = k(b b b, b b b ), a b x where aa and bb are vectors occurring in the representation (9) of vector functions xx (s), s ∈ I , yyy(t ), t ∈ I . x x For a given subset {xx (s), ..., xx (s)} of L (I ) and the given kernel function k on 1 n 1 p p L (I ) × L (I ), the matrix K K K of size n × n, which has its (i , j )th element K (s),given 1 1 2 2 ij by K (s) = k (xxx (s), xxx (s)), s ∈ I , is called the kernel matrix of the kernel function k i j 1 ij with respect to the set {xxx (s), ..., xxx (s)}, s ∈ I . 1 n 1 ˜ ˜ Definition 11 (Kernel function alignment for functional data)Let k and l be two kernel p p q q functions defined over L (I ) × L (I ) and L (I ) × L (I ), respectively, such that 0 < 1 1 2 2 2 2 2 2 2 2 ˜ ˜ E  [k (X X X , X X X )] < ∞ and 0 < E  [l (Y Y Y , Y Y Y )] < ∞,where X X X , X X X and Y Y Y , Y Y Y are X X X ,X X X Y Y Y ,Y Y Y independent copies distributed according to P and P , respectively. Then the alignment X X X Y Y Y ˜ ˜ between k and l is defined by ˜ ˜ E [k (X X X , X X X )l (Y Y Y , Y Y Y )] X X X ,Y Y Y ˜ ˜ ρ(k , l ) =  . (10) 2 2 ˜ ˜ X X Y Y E [k (X X , X X )] E [l (Y Y , Y Y )] X X X Y Y Y K L We can define similarly the alignment between two kernel matrices K K and L L based on p q the subset {xxx (s), ..., xxx (s)}, s ∈ I ,and {yyy (t), ..., yyy (t )}, t ∈ I ,of L (I ) and L (I ), 1 n 1 1 n 2 1 2 2 2 respectively. n×n n×n K L Definition 12 (Kernel matrix alignment for functional data)Let K K ∈ R and L L ∈ R be two kernel matrices such that K K K = 0and L L L = 0. Then, the centered kernel F F target alignment (KTA) between K K K and L L L is defined: K K K , L L L ρ( ˆ K K K , L L L ) = . (11) K K K L L L F F K L K L If K K and L L are positive semi-definite matrices, then ρ( ˆ K K , L L ) ∈[0, 1].Wehave ρ( ˆ K K K , L L L ) =ˆ ρ(K K K , L L L), where K K K is the matrix of size n × n, which has its (i , j )th element K ,given by K = ij ij a a k(aa , aa ). i j 123 486 T. Górecki et al. 3.2 Kernel-based independence test for multivariate functional data Definition 13 (Empirical HSIC for functional data) The empirical HSIC for functional data is defined as HSIC(S ) = K K K , L L L  , where S ={(xxx (s), yyy (t )), . . . , (xxx (s), yyy (t ))}, s ∈ I , t ∈ I , K K K and L L L are kernel 1 1 n n 1 2 matrices based on the subsets {xxx (s), ..., xxx (s)}, s ∈ I ,and {yyy (t), ..., yyy (t )}, t ∈ I of 1 n 1 1 n 2 p q L (I ) and L (I ), respectively. 1 2 2 2 But K K K = K K K,where K K K is the kernel matrix of size n × n, which has its (i , j )th element K ij given by K = k(a a a , a a a ),where a a a ,..., a a a are vectors occurring in the representation (9) ij i j 1 n vector functions X X X (s), s ∈ I . Analogously, L L L = L L L,where L L L is the kernel matrix of size n × n, which has its (i , j )th element L given by L = l(b b b , b b b ),where b b b ,..., b b b are ij ij i j 1 n vectors occurring in the representation (9) vector functions Y Y Y (t ), t ∈ I . Hence HSIC(S ) = HSIC(S ), where S ={(a a a , b b b ), ...,(a a a , b b b )}. v 1 1 n n Note also that the null hypothesis H : X X X ⊥Y Y Y of independence of the random processes X X X and Y Y Y is equivalent to the null hypothesis H : α α α⊥β β β of independence of random vectors α α α and β β β occurring in the representation (8) random processes X X X and Y Y Y . We can therefore use the tests described in Section 2.4, replacing xxx and yyy by a a a and b b b. 3.3 Canonical correlation analysis based on the alignment between kernel matrices for multivariate functional data In classical canonical correlation analysis (Hotelling 1936), we are interested in the relation- ship between two random vectors X X X and Y Y Y . In the functional case we are interested in the relationship between two random functions X X X and Y Y Y . Functional canonical variables U and V for random processes X X X and Y Y Y are defined as follows U =u u u, X X X= u u u (s)X X X (s)ds, (12) v Y v Y V =v v, Y Y= v v (t )Y Y (t )dt , (13) u v where the vector functions uu and v v are called the vector weight functions and are of the form K +p K +q 1 2 u u u(s) =    (s)u u u,v v v(t ) =    (t )v v v where u u u ∈ R ,v v v ∈ R . (14) 1 2 Classically the weight functions u u u and v v v are chosen to maximize the sample correlation coefficient (Górecki et al. 2018): Cov(U , V ) ρ = √ . (15) Var(U ) Var(V ) The sample correlation coefficient between the variables U and V is now replaced by a centered kernel target alignment (KTA) between kernel matrices K K K and L L L based on the u x v y projected data uu(s), xx (s) and v v(t ), yy (t ) , i.e. their (i , j )th entry are i H i H 123 Independence test and CCA for multivariate functional data 487 u x u x K = k(uu(s), xx (s) , uu(s), xx (s) ), s ∈ I , i , j i H j H 1 and L = l(v v v(t ), yyy (t ) , v v v(t ), yyy (t ) ), t ∈ I , i , j i H j H 2 respectively, i , j = 1,..., n: K L tr(K K L L) ρ( ˆ u u u(s), v v v(t )) =  (16) tr(K K K K K K ) tr(L L L L L L) subject to u u u(s) = v v v(t ) = 1. (17) But K K K = k u u u (s)xxx (s)ds, u u u (s)xxx (s)ds i , j i j I I 1 1 (u u u) = k(u u u a a a , u u u a a a ) = K K K i j i , j and L L L = l v v v (t )yyy (t )dt , v v v (t )yyy (t )dt i , j i j I I 2 2 (v v v) v b v b L = l(v v bb ,v v bb ) = L L , i j i , j where a a a and b b b are vectors occuring in the representation (9) vectors functions xxx (s), s ∈ I , i i 1 K +p K +q 1 2 yyy(t ), t ∈ I , i = 1,..., n, u u u ∈ R , v v v ∈ R . Thus, the choice of weighting functions u u u(s) and v v v(t ) so that the coefficient (16)has K +p a maximum value subject to (17) is equivalent to the choice of vectors u u u ∈ R and K +q v v ∈ R such that the coefficient tr(K K K L L L ) u v v uu ρ( ˆ u u u,v v v) =  (18) tr(K K K K K K ) tr(L L L L L L ) u v uu v v u u u v v v has a maximum value subject to u u u = v v v = 1, (19) u v (uu) (v v) where K K K = (K ), L L L = (L ), i , j = 1,..., n. u u u v v v i , j i , j In order to maximize the coefficient of (18) we can use the result of Chang et al. (2013). Authors used a gradient descent algorithm, with modified gradient to ensure the unit length constraint is satisfied at each step (Edelman et al. 1998). Optimal step-sizes were found numerically using the Nelder-Mead method. This article employs the Gaussian kernel exclu- sively while other kernels are available. The bandwidth parameter λ of the Gaussian kernel was chosen using the “median trick” (Song et al. 2010), i.e. the median Euclidean distance between all pairs of points. The coefficients of the projection of the ith value xxx (t ) of random process X X X on the kth functional canonical variable are equal to U =u u u , xxx = u u u (s)xxx (s)ds = a a a u u u , ik k i i k k i 123 488 T. Górecki et al. y Y analogously the coefficients of the projection of the ith value yy (t ) of random process Y Y on i t the kth functional canonical variable are equal to V = b b b v v v , ik k n× p n×q where i = 1,..., n, k = 1,..., min(rank(A A A), rank(B B B)),where A A A ∈ R and B B B ∈ R , where the ith rows are a a a and b b b , respectively, which have column means of zero. i i As we mentioned earlier KTA is a normalized variant of HSIC. Hence, we can repeat the above reasoning for HSIC criterion. However, we should remember that both approaches are not equivalent and we can obtain different results. 4 Experiments Let us recall some and introduce another symbols: – KTA—centered kernel target alignment, – HSIC—Hilbert–Schmidt Independence Criterion, – FCCA—classical functional canonical correlation analysis (Ramsay and Silverman 2005; Horváth and Kokoszka 2012), – HSIC.FCCA—functional canonical correlation analysis based on HSIC, – HSIC.KTA—functional canonical correlation analysis based on KTA. 4.1 Simulation We generated random processes along with some noises to test the performance of the intro- duced measures. Random processes are specified by X = ε , t t Y = 3X + η , t t t Z = X + ξ , t t where ε ,η and ξ are jointly independent random variables from Gaussian distribution t t t with 0 mean and 0.25 variance. We generated processes of length 100. N = 10000 samples are generated for all processes. The objective is to examine how well functional variants (Fourier basis with 15 basis functions) of KTA and HSIC measures perform compared to measures used on raw data (artificially generated at discrete time stamps). Here, raw data are represented as vector data of generated trajectories, so raw data are three 10000 by 100 dimensional matrices(one for X , second for Y and third for Z ). On the other hand, functional t t t data are three 10000 by 15 dimensional matrices (coefficients of Fourier basis). Here, X and Y are linearly dependent, whereas X , Z and Y , Z are nonlinearly dependent (Fig. 1). t t t t t From Fig. 2 and Table 1, we see that the proposed extension of HSIC and KTA coefficients to functional data gives larger values of coefficients than the variants for raw time series. Unfortunately, it is not possible to perform inference based only on the values of coefficients. We have to apply tests. In Fig. 3 andinTable 2, we observe that when we use functional variants of the proposed measures, we obtain much better results in recognizing nonlinear dependence. Linear dependence between X and Y was easily recognized by each method t t (100% of correct decisions—p values below 5%). Results of functional KTA and HSIC are very similar. Non-functional measures HSIC and KTA give only 7.2% and 6.7% correct decisions (p values below 5%) for relationship X , Z and Y , Z , respectively. On the t t t t 123 Independence test and CCA for multivariate functional data 489 0 20406080 100 0 20406080 100 Time Time Fig. 1 Sample trajectories of X , Y and Z time series for raw (left plot) and functional (right plot) represen- t t t tation HSIC KTA 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Fig. 2 Raw and functional HSIC and KTA coefficients for artificial time series other hand, functional variants recognize dependency (p values below 5%) in 63.3% (both measures) for X , Y and in 47.8% (HSIC), 63.3% (KTA) for Y , Z . t t t t 4.2 Univariate example As a first real example we used average daily temperature (in Celsius degrees) for each day of the year and average daily rainfall (in mm) for each day of the year rounded to 0.1 mm at -1 0 1 2 Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t -0.2 0.0 0.2 0.4 0.6 Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t 490 T. Górecki et al. Table 1 Average raw and (X , Y)(X , Z)(Y , Z ) t t t t t t functional HSIC and KTA coefficients for artificial time Raw series (number in brackets means HSIC 0.795 (0.015) 0.672 (0.027) 0.825 (0.014) standard deviation) KTA 0.758 (0.019) 0.601 (0.028) 0.789 (0.019) Functional HSIC 0.986 (0.000) 0.984 (0.001) 0.988 (0.000) KTA 0.999 (0.000) 0.999 (0.000) 0.999 (0.000) HSIC KTA 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 Fig. 3 p Values from permutation-based tests for raw and functional variants of HSIC and KTA coefficients Table 2 Average p values from (X , Y)(X , Z)(Y , Z ) t t t t t t permutation-based tests for raw and functional variants of HSIC Raw and KTA coefficients (number in HSIC 0.000 (0.000) 0.445 (0.290) 0.458 (0.282) brackets means standard deviation) KTA 0.000 (0.000) 0.445 (0.290) 0.458 (0.282) Functional HSIC 0.000 (0.000) 0.077 (0.129) 0.125 (0.169) KTA 0.000 (0.000) 0.077 (0.129) 0.077 (0.129) 35 different weather stations in Canada from 1960 to 1994. Each station belongs to one of four climate zone: Arctic (3 stations—blue color on plots), Atlantic (15—red color on plots), Continental (12— black color on plots) or Pacific (5—green color on plots) zone (Fig. 4). This data set comes from Ramsay and Silverman (2005). In the first step, we smoothed data. We used the Fourier basis with various values of the smoothing parameter (number of basis functions) from 3 to 15. We can observe the effect Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t Fun(X , Y ) t t Raw(X , Y ) t t Fun(X , Z ) t t Raw(X , Z ) t t Fun(Y , Z ) t t Raw(Y , Z ) t t Independence test and CCA for multivariate functional data 491 Climate zone Arctic Atlantic Continental Pacific -125 -100 -75 -50 Longtitude Fig. 4 Location of Canadian weather stations Raw data Functional data 0 100 200 300 0 100 200 300 Day of the year Day of the year Fig. 5 Raw and functional temperature for Canadian weather stations of smoothing in Figs. 5 and 6 (for Fourier basis with 15 basis functions). We decided to use the Fourier basis for two reasons: it has excellent computational properties, especially if the observations are equally spaced, and it is natural for describing periodic data, such as the annual weather cycles. Here, raw data are two 35 by 100 dimensional matrices (one for temperature and second for precipitation). On the other hand, functional data are two 35 by 15 dimensional matrices (coefficients of Fourier basis). From the plots we can observe that the level of smoothness seems big enough. Addi- tionally, we can observe some relationship between average temperature and precipitation. Namely, for weather stations with large average temperature, we observe relatively bigger average precipitation while for Arctic stations with lowest average temperatures we observe Average temperature Latitude -30 -20 -10 0 10 20 Average temperature -30 -20 -10 0 10 20 492 T. Górecki et al. Raw data Functional data 0 100 200 300 0 100 200 300 Day of the year Day of the year Fig. 6 Raw and functional precipitation for Canadian weather stations Fig. 7 Absolute Spearman correlation coefficient for the first set of functional canonical variables FCCA HSIC.FCCA KTA.FCCA the smallest average precipitation. So we can expect some relationship between average temperature and average precipitation for Canadian weather stations. In the next step, we calculated the values of described earlier coefficients, the values of which are presented in Fig. 8. We observe quite big values of HSIC and KTA, but it is impossible to infer dependency from these values. We see that the values of HSIC and KTA coefficients are stable (both do not depend on basis size). To statistically confirm the association between temperature and precipitation we per- formed some simulation study. This study based on Chang et al. (2013) simulation. Finding a good nonlinear dependency measure is not trivial. KTA and HSIC are not on the same scale. As Chang et al. (2013) we used Spearman correlation coefficient. We performed 50 random splits with the inclusion of 25 samples to identify models. The Spearman correlation coefficient was then calculated using the remaining 10 samples for each of 50 splits. As we know the strongest signal between the temperature and precipitation for Canadian weather stations is nonlinear (Chang et al. 2013). From Fig. 7 we can observe that HSIC.FCCA and KTA.FCCA have produced larger absolute Spearman coefficients than FCCA. Such results suggest that HSIC.FCCA & KTA.FCCA can be viewed as natural nonlinear extensions of CCA also in the case of multivariate functional data. Finally, we performed permutation-based tests for HSIC and KTA coefficients. The results are presented in Fig. 8. All tests rejected H (p values close to 0) for all basis sizes, so we can infer that we have some relationship between average temperature and average precipitation for Canadian weather stations. Unfortunately, we know nothing about the strength and direc- tion of the dependency. Only a visual inspection of the plots suggests that there is a strong and positive relationship. Average precipitation 0 5 10 15 Absolute Spearman correlation 0.2 0.4 0.6 0.8 1.0 Average precipitation 02468 10 12 Independence test and CCA for multivariate functional data 493 HSIC coefficient HSIC test p-values KTA coefficient KTA test p-values 0.007 0.007 0.997 0.006 0.006 0.981 0.971 0.971 0.971 0.971 0.971 0.971 0.969 0.003 0.002 0.002 3579 11 13 15 3579 11 13 15 Basis size Basis size Fig. 8 HSIC and KTA coefficients and p values of permutation-based tests for Canadian weather data HSIC KTA -10 -5 0 5 10 -20 0 20 40 60 X X ˆ ˆ Fig. 9 Projection of the 35 Canadian weather stations on the plane (U , V ) 1 1 ˆ ˆ The relative positions of the 35 Canadian weather stations in the system (U , V ) of 1 1 functional canonical variables are shown in Fig. 9. It seems that for both coefficients the weather stations group reasonably. 4.3 Multivariate example The described method was employed here to cluster the twelve groups (pillars) of variables of 38 European countries in the period 2008-2015. The list of countries used in the dependency analysis is contained in Table 3.Table 4 describes the pillars used in the analysis. For this purpose, use was made of data published by the World Economic Forum (WEF) in its annual reports (http://www.weforum.org). Those are comprehensive data, describing exhaustively various socio-economic conditions or spheres of individual states (Górecki et al. 2016). The data were transformed into functional data. Calculations were performed using the Fourier basis. In view of a small number of time periods (J = 7), for each variable the maximum number of basis components was taken to be equal to five. Here, raw data are twelve matrices (one for each pillar). Dimensions of matrices are different and depend on the number of variables in the pillar. Eg. for the first pillar we have 16 (number of variables) * 7 (number of time points) = 112 columns, hence dimensionality of the matrix for this pillar is 38 by 112. Similarly for the others. On the other hand, functional data are twelve matrices with 38 rows and appropriate number of columns (coefficients of Fourier basis). Number of columns for functional data eg. for the first pillar we calculate as 16 (number of variables) * 5 (number of basis elements) = 80. 0.90 0.95 1.00 1.05 1.10 -80 -60 -40 -20 0 20 40 0.000 0.002 0.004 0.006 0.008 0.010 -150 -100 -50 0 50 100 494 T. Górecki et al. Table 3 Countries used in analysis, 2008–2015 1 Albania (AL) 14 Greece (GR) 27 Poland (PL) 2 Austria (AT) 15 Hungary (HU) 28 Portugal (PT) 3 Belgium (BE) 16 Iceland (IS) 29 Romania (RO) 4 Bosnia and Herzegovina (BA) 17 Ireland (IE) 30 Russian Federation (RU) 5 Bulgaria (BG) 18 Italy (IT) 31 Serbia (XS) 6 Croatia (HR) 19 Latvia (LV) 32 Slovak Republic (SK) 7 Cyprus (CY) 20 Lithuania(LT) 33 Slovenia (SI) 8 Czech Republic (CZ) 21 Luxembourg (LU) 34 Spain (ES) 9 Denmark (DK) 22 Macedonia FYR (MK) 35 Sweden (SE) 10 Estonia (EE) 23 Malta (MT) 36 Switzerland (CH) 11 Finland (FI) 24 Montenegro (ME) 37 Ukraine (UA) 12 France (FR) 25 Netherlands (NL) 38 United Kingdom (GB) 13 Germany (DE) 26 Norway (NO) Table 4 Pillars used in analysis, Pillar Number of variables 2008–2015 G1 Institutions 16 G2 Infrastructure 6 G3 Macroeconomic environment 2 G4 Health and primary education 7 G5 Higher education and training 6 G6 Goods market efficiency 10 G7 Labor market efficiency 6 G8 Financial market development 5 G9 Technological readiness 4 G10 Market size 4 G11 Business sophistication 9 G12 Innovation 5 Tables 5 and 6 contain the values of functional HSIC and KTA coefficients. As expected, they are all close to one. But high values of these coefficients do not necessarily mean that there is a significant relationship between the two groups of variables. We can expect association between groups of pillars. However, it is really hard to guess what groups are associated. Similarly to the Canadian weather example we performed small simulation study for pillars G5 and G6. From Fig. 10 we can observe that HSIC.FCCA and KTA.FCCA have produced larger absolute Spearman coefficients than FCCA. This result suggest that proposed measures have better characteristic in discovering nonlinear relationship for this example. We performed permutation-based tests for the HSIC and KTA coefficients discussed above. For most of tests, p values were close to zero, on the basis of which it can be inferred that there is some significant relationship between the groups (pillars) of variables. Table 7 contains the p values obtained for each test. We have exactly the same p values for both methods. Now, 123 Independence test and CCA for multivariate functional data 495 Table 5 Functional HSIC coefficients 12345678910 11 2 0.9736 3 0.9736 0.9737 4 0.9736 0.9737 0.9737 5 0.9708 0.9706 0.9706 0.9706 6 0.9728 0.9727 0.9727 0.9727 0.9753 7 0.9687 0.9683 0.9683 0.9683 0.9799 0.9780 8 0.9730 0.9730 0.9730 0.9730 0.9725 0.9740 0.9721 9 0.9736 0.9737 0.9737 0.9737 0.9706 0.9727 0.9683 0.9730 10 0.9736 0.9737 0.9737 0.9737 0.9706 0.9727 0.9683 0.9730 0.9737 11 0.9714 0.9711 0.9711 0.9711 0.9785 0.9755 0.9828 0.9726 0.9711 0.9711 12 0.9688 0.9683 0.9683 0.9683 0.9778 0.9741 0.9897 0.9715 0.9783 0.9683 0.9830 Table 6 Functional KTA coefficients 12345678910 11 2 1.0000 3 1.0000 1.0000 4 1.0000 1.0000 1.0000 5 0.9918 0.9916 0.9916 0.9916 6 0.9980 0.9978 0.9978 0.9978 0.9951 7 0.9741 0.9736 0.9736 0.9936 0.9801 0.9821 8 0.9991 0.9990 0.9990 0.9990 0.9933 0.9989 0.9772 9 1.0000 1.0000 1.0000 1.0000 0.9916 0.9978 0.9736 0.9990 10 1.0000 1.0000 1.0000 1.0000 0.9916 0.9978 0.9736 0.9990 1.0000 11 0.9927 0.9924 0.9924 0.9924 0.9947 0.9957 0.9833 0.9936 0.9924 0.9924 12 0.9793 0.9788 0.9788 0.9788 0.9831 0.9834 0.9794 0.9917 0.9788 0.9788 0.9887 Fig. 10 Absolute Spearman correlation coefficient for the first set of functional canonical variables for pillars G5 & G6 FCCA HSIC.FCCA KTA.FCCA we can observe that some groups are independent (α = 0.05): G1 & G3, G3 & G6, G3 & G8, G3 & G11, G3 & G12, G4 & G9. The graphs of the components of the vector weight function for the first functional canon- ical variables of the processes are shown in Fig. 11.FromFig. 11 (left) it can be seen that the greatest contribution in the structure of the first functional canonical correlation (U )comes Absolute Spearman correlation 0.0 0.2 0.4 0.6 0.8 1.0 496 T. Górecki et al. Table 7 Functional HSIC & KTA p values permutation-based tests (only non-zero) 1 2 3 4 5678910 11 2 0.0142 3 0.0714 0.0332 4 0.0042 0.0343 5 0.0001 0.0268 6 0.0157 0.0772 7 0.0009 0.0061 8 0.0294 0.0636 9 0.0030 0.0055 0.0198 0.0640 0.0002 0.0003 0.0009 0.0040 10 0.0059 0.0294 0.0021 0.0055 11 0.0039 0.1034 0.0008 12 0.0008 0.0563 0.0044 p-values greater than usual level of significance 5% are given in bold 1234567 12 34 567 Time Time Fig. 11 Weight functions for first functional canonical variable U (left) and V (right) 1 1 from “black” process, and this holds for all of the observation years considered. Figure 11 (right) shows that, on specified time intervals, the greatest contribution in the structure of the first second functional canonical correlation (V comes alternately from the processes “black” and “red dotted”. The total contribution of a particular original process in the structure of a given functional canonical correlation is equal to the area under the module weighting function corresponding to this process. These contributions for the components are given in Table 8. Figure 12 contains the relative positions of the 38 European countries in the system ˆ ˆ (U , V ) of functional canonical variables for selected groups of variables. The high corre- 1 1 lation of the first two functional canonical variables can be seen in Fig. 12 for two pillars G5 and G6. For KTA criterion, the countries with the highest value of functional canonical variables U and V are: Finland (FI), France (FR), Hungary (HU), Greece (GR), Estonia 1 1 (EE), Germany (DE), Iceland (IS), Czech Republic (CZ) and Denmark (DK). The countries with the lowest value of functional canonical variables U and V are: Romania (RO), Poland 1 1 (PL), Norway (NO), Portugal (PT), Netherlands (NL) and Russian Federation (RU). Other countries belong to the intermediate group. During the numerical calculation process we used R software (R Core Team 2018)and packages fda (Ramsay et al. 2018)and hsicCCA (Chang 2013). -1.0 -0.5 0.0 0.5 -1.0 -0.5 0.0 0.5 Independence test and CCA for multivariate functional data 497 Table 8 Sorted areas under No. Area Proportion (in %) module weighting functions First functional canonical variable (G5) 15.008 51.74 21.724 17.81 31.567 16.19 40.713 7.36 50.351 3.63 60.317 3.27 First functional canonical variable (G6) 15.187 44.77 23.194 27.56 31.287 11.11 40.580 5.00 50.511 4.41 60.323 2.79 70.206 1.77 80.152 1.31 90.091 0.78 10 0.057 0.49 G4 and G5 (KTA) G4 and G6 (KTA) AL HU AT IS GR FR BE BA FI CH SE EE DE HR XS ES LV BG EE CY FI LT DK IT CZ FR SK LU GR UA SE ES IR HU DE MK CH CZ RU UA SI GB DK GB SI IS ME NO SK NL RO PT MT CY PL HR NL NO XS PL IR ME BG AL BA IT AT MT MK RU BE LT LU RO LV PT -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1 0 1 2 X X G5 and G6 (KTA) GR FR HU FI DE EE IS DK CZ ITCY IR HR BA AL AT GBLT LU CHUA BG SE BE LV ES MK ME MT SIXS SK RO RU PL NO NL PT -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 ˆ ˆ Fig. 12 Selected projections of the 38 European countries on the plane (U , V ). Regions used for statistical 1 1 processing purposes by the United Nations Statistics Division: blue square—Northern Europe, cyan square— Western Europe, red square—Eastern Europe, green square—Southern Europe. (Color figure online) 5 Conclusions We proposed an extension of two dependency measures for two sets of variables for multivari- ate functional data. We proposed to use tests to examine the significance of results because the values of proposed coefficients are rather hard to interpret. Additionally, we presented the methods of constructing nonlinear canonical variables for multivariate functional data using HSIC and KTA coefficients. Tested on two real examples, the proposed method has proven -2 -1 0 1 2 -1.5 -0.5 0.5 1.5 -2 -1 0 1 498 T. Górecki et al. useful in investigating the dependency between two sets of variables. Examples confirm use- fulness of our approach in revealing the hidden structure of co-dependence between groups of variables. During the study of proposed coefficients we discovered that the size of basis (smoothing parameter) is rather unimportant, the values (and p values for tests) do not depend on the basis size. Of course, the performance of the methods needs to be further evaluated on additional real and artificial data sets. Moreover, we can examine the behavior of coefficients (and tests) for different bases like B-splines or wavelets (when data are not periodic, the Fourier basis could fail). This constitutes the direction of our future research. Acknowledgements The authors are grateful to editor and two anonymous reviewers for giving many insight- ful and constructive comments and suggestions which led to the improvement of the earlier manuscript. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and repro- duction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. References Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404 Chang B (2013) hsicCCA: Canonical Correlation Analysis based on Kernel Independence Measures. R package version 1.0. https://CRAN.R-project.org/package=hsicCCA Chang B, Kruger U, Kustra R, Zhang J (2013) Canonical correlation analysis based on hilbert-schmidt indepen- dence criterion and centered kernel target alignment. In: Proceedings of the 30th international conference on machine learning, Atlanta, Georgia. JMLR: W and CP 28(2), 316–324 Cortes C, Mohri M, Rostamizadeh A (2012) Algorithms for learning kernels based on centered alignment. J Mach Learn Res 13:795–828 Cristianini N, Shawe-Taylor J, Elisseeff A, Kandola JS (2001) On kernel-target alignment. In: NIPS-2001, 367–373 Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23 Devijver E (2017) Model-based regression clustering for high-dimensional data: application to functional data. Adv Data Anal Classif 11(2):243–279 Edelman A, Arias TA, Smith S (1998) The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal Appl 20(2):303–353 Ferraty F, Vieu P (2003) Curves discrimination: a nonparametric functional approach. Comput Stat Data Anal 44(1–2):161–173 Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, Berlin Feuerverger A (1993) A consistent test for bivariate dependence. Int Stat Rev 61(3):419–433 Górecki T, Krzysk ´ o M, Ratajczak W, Wołynski ´ W (2016) An extension of the classical distance correlation coefficient for multivariate functional data with applications. Stat Transit 17(3):449–9466 Górecki T, Krzysk ´ o M, Wołynski ´ W (2017) Correlation analysis for multivariate functional data. In: Palumbo F, Montanari A, Montanari M (eds) Data science. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 243–258 Górecki T, Krzysk ´ o M, Waszak Ł, Wołynski ´ W (2018) Selected statistical methods of data analysis fir multi- variate functional data. Stat Papers 59:153–182 Górecki T, Smaga Ł (2017) Multivariate analysis of variance for functional data. J Appl Stat 44:2172–2189 Gretton A., Bousquet O., Smola A., and Schölkopf B., (2005): Measuring statistical dependence with Hilbert– Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Algorithmic learning theory. Lecture notes in computer science 3734, 63–77. Springer Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola AJ (2008) A kernel statistical test of inde- pendence. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems. Curran, Red Hook, pp 585–592 Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220 123 Independence test and CCA for multivariate functional data 499 Horváth L, Kokoszka P (2012) Inference for functional data with applications. Springer, Berlin Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377 Hsing T, Eubank R (2015) Theoretical foundations of functional data analysis, with an introduction to linear operators. Wiley, Hoboken James GM, Wang JW, Zhu J (2009) Functional linear regression that’s interpretable. Ann Stat 37(5):2083–2108 Kankainen A (1995) Consistent testing of total independence based on the empirical charecteristic function, Ph.D. thesis, University of Jyväskylä Martin-Baragan B, Lillo R, Romo J (2014) Interpretable support vector machines for functional data. Eur J Oper Res 232:146–155 Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc Lond Ser A 209:415–446 R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ Ramsay JO, Dalzell CJ (1991) Some tools for functional data analysis (with discission). J R Stat Soc Ser B 53(3):539–572 Ramsay JO, Silverman BW (2002) Applied functional data analysis. Springer, New York Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, Berlin Ramsay JO, Wickham H, Graves S, Hooker G (2018) fda: Functional data analysis. R package version 2.4.8. https://CRAN.R-project.org/package=fda Read T, Cressie N (1988) Goodness-of-fit statistics for discrete multivariate analysis. Springer, Berlin Riesz F (1909) Sur les opérations functionnelles linéaires. Comptes rendus hebdomadaires des séances de l’Académie des sciences 149:974–977 Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K (2013) Equivalence of distance-based and RKHS- based statistics in hypothesis testing. Ann Stat 41(5):2263–2291 Schölkopf B, Smola AJ, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319 Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cam- bridge Song L, Boots B, Siddiqi S, Gordon G, Somla A (2010) Hilbert space embeddings of hidden Markov models. In: Proceedings of the 26th international conference on machine learning (ICML2010) Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794 Székely GJ, Rizzo ML (2009) Brownian distance covariance. Ann Appl Stat 3(4):1236–1265 Wang T, Zhao D, Tian S (2015) An overview of kernel alignment and its applications. Artif Intell Rev 43(2):179–192 Zhang K, Peters J, Janzing D, Schölkopf B (2011) Kernel-based conditional independence test and application in causal discovery. In: Cozman FG, Pfeffer A (eds) Proceedings of the 27th conference on uncertainty in artificial intelligence, AUAI Press, Corvallis, OR, USA, 804–813

Journal

Artificial Intelligence ReviewSpringer Journals

Published: Nov 10, 2018

There are no references for this article.