Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fast Estimation of L1-Regularized Linear Models in the Mass-Univariate Setting

Fast Estimation of L1-Regularized Linear Models in the Mass-Univariate Setting In certain modeling approaches, activation analyses of task-based fMRI data can involve a relatively large number of predictors. For example, in the encoding model approach, complex stimuli are represented in a high-dimensional feature space, resulting in design matrices with many predictors. Similarly, single-trial models and finite impulse response models may also encompass a large number of predictors. In settings where only few of those predictors are expected to be informative, a sparse model fit can be obtained via L1-regularization. However, estimating L1-regularized models requires an iterative fitting procedure, which con- siderably increases computation time compared to estimating unregularized or L2-regularized models, and complicates the application of L1-regularization on whole-brain data and large sample sizes. Here we provide several functions for estimating L1-regularized models that are optimized for the mass-univariate analysis approach. The package includes a parallel implemen- tation of the coordinate descent algorithm for CPU-only systems and two implementations of the alternating direction method of multipliers algorithm requiring a GPU device. While the core algorithms are implemented in C++/CUDA, data input/output and parameter settings can be conveniently handled via Matlab. The CPU-based implementation is highly memory-efficient and provides considerable speed-up compared to the standard implementation not optimized for the mass-univariate approach. Further acceleration can be achieved on systems equipped with a CUDA-enabled GPU. Using the fastest GPU-based imple- mentation, computation time for whole-brain estimates can be reduced from 9 h to 5 min in an exemplary data setting. Overall, the provided package facilitates the use of L1-regularization for fMRI activation analyses and enables an efficient employment of L1-regularization on whole-brain data and large sample sizes. . . . . . Keywords fMRI L1-regularization Lasso Sparsity Encoding model GPU Introduction analyses such as multivariate pattern analyses or connectivity analyses (Mumford et al. 2014; Rissman et al. 2004). As Over the last two decades, various fMRI activation analysis single-trial models require a separate predictor for each exper- approaches have been established that involve a relatively imental trial, the resulting design matrices typically encom- large number of predictors. For example, single-trial models pass a relatively large number of predictors. can be employed to obtain activation estimates for individual The finite impulse response (FIR) model is another exam- experimental trials (Mumford et al. 2012). The single-trial ple for an fMRI activation analysis approach involving many estimates can be subsequently used as input for further predictors (Ollinger et al. 2001). In the FIR approach, the fMRI signal is modeled by a set of impulse response predic- tors, instead of a predefined hemodynamic response shape. Electronic supplementary material The online version of this article This modeling approach can capture activation dynamics de- (https://doi.org/10.1007/s12021-020-09489-1) contains supplementary viating from the canonical response shape, but typically in- material, which is available to authorized users. volves a large number of predictors (growing proportionally with the size of the FIR basis set). Both in single-trial and FIR * Holger Mohr holger.mohr@tu-dresden.de models, the large number of predictors can reduce the robust- ness of the beta estimates. Another fMRI modeling approach typically involving a Department of Psychology, Technische Universität Dresden, 01062 Dresden, Germany large number of predictors is the so-called encoding model 386 Neuroinform (2021) 19:385–392 approach (Naselaris et al. 2011; van Gerven 2017). In this using closed-form solutions, L1-regularization requires an it- approach, stimuli are represented in a high-dimensional fea- erative fitting procedure. Thus, estimatingL1-regularized ture space instead of being assigned to a low-dimensional set models instead of L2- or unregularized models substantially of categories. For example, instead of assigning visual stimuli increases the running time of fMRI analyses. For certain types to object categories such as houses, trees, etc., the stimuli are of analyses, for example whole-brain analyses on large sam- represented by a high-dimensional vector of feature weights. ples, it is virtually infeasible to employ L1-regularization. Such features may comprise a set of Gabor wavelets (Kay Here, we aim to facilitate the estimation of L1-regularized et al. 2008), may be extracted from deep neural networks models on fMRI data. In the following, we present a package (Güclü and van Gerven 2015) or may simply consist of pixel of functions for estimating L1-regularized models that are values (Schoenmakers et al. 2013). The encoding model ap- optimized for the mass-univariate approach. We describe the proach has also been employed outside the visual domain, for implementation of the functions, how to set their parameters, example to characterize semantic representations of words and provide benchmark results for two exemplary data (Huth et al. 2016). In this study, the feature space was defined settings. as a basic dictionary of English words, and the model was fitted on whole-brain data. The high-dimensional representa- tion of stimuli used in the encoding model approach typically Methods translates into design matrices encompassing more predictors than time points. In this setting, a unique model fit can only be In the following, we assume to have an fMRI data matrix Y of obtained by adding a regularization term to the model. size n × v,with n being the number of time points and v being Model regularization adds additional constraints to the the number of time series (e.g. the number of voxels). model fitting procedure on top of minimizing the error term. Moreover, we have a design matrix X of size n × p,with p As the number of predictors approaches the number of col- being the number of predictors. The beta-values are stored in lected time points (i.e. the length of the fMRI time series), amatrix B of size p × v. The intercept of the model is of size model regularization becomes increasingly relevant, and for n ×1 and is denoted I, and the intercept’s beta-values are saturated models, regularization is indispensable. In the con- stored in B of size 1 × v. Matrix columns are indexed by j. text of fMRI activation analyses, model regularization can We assume that the columns of the design matrix X are z- improve the robustness of single-trial and FIR models, and n 1 n 2 scored, i.e. ∑ X ¼ 0and ∑ X ¼ 1for all j ={1, is crucial for the encoding model approach. ij ij i¼1 n i¼1 ⋯, p}. For a given regularization parameter λ ≥ 0, the L1- For linear models, the two most common types of regular- ization are L1-regularization and L2-regularization (also b b regularized model fit B; B is expected to minimize, for each known as lasso and ridge regression, Tibshirani 1996;Hoerl j ={1, ⋯, v}, the following objective function: and Kennard 1970). While L1-regularization puts a threshold 0 2 on the sum of absolute values of the beta estimates, L2- b b B ; B ¼ argmin Y −XB −IB  þ λ B j j j j j j 0 2n 2 regularization bounds the sum of squared beta-values. These B ;B two types of regularization can result in fundamentally differ- ent beta estimates: L2-regularized models return nonzero beta- In contrast to L2-regularized or unregularized models, L1- values for all predictors, whereas L1-regularized models re- regularized model fits cannot be computed using a closed- turn a sparse model fit, that is, most of the beta-values are set form solution. Instead, an iterative procedure is required to to zero and only a few predictors are included in the model fit. b find B. In the following sections, we describe two different Whether to employ L1- or L2-regularization depends on a- algorithms for fitting L1-regularized models, coordinate de- priori assumptions on the data at hand. L2-regularization as- scent (Friedman et al. 2010) and alternating direction method sumes that most of the predictors have an impact on the fMRI of multipliers (ADMM, Boyd et al. 2010), and how we opti- signal. In contrast, L1-regularization is based on the assump- mized and implemented these algorithms for mass-univariate tion that the fMRI signal can be modeled by a small fraction of analyses. the included predictors. While both types of regularization have been employed in fMRI studies (Huth et al. 2016; Nishimoto et al. 2011), L2- The CPU-Based Implementation: lasso_mex regularization seems to occur more frequently in the neuroim- aging literature than L1-regularization. This might be partly In the CPU-based implementation, the model fit is computed explained by the fact that fitting L1-regularized models is using the coordinate descent algorithm proposed by Friedman considerably more expensive, in terms of computation time, et al. 2010. In each step of the iterative fitting procedure, the than fitting L2-regularized or unregularized models. While beta-value of a single predictor is updated while the remaining L2-regularized and unregularized models can be estimated beta-values are fixed. As suggested by Friedman et al. 2010, Neuroinform (2021) 19:385–392 387 the beta-values are computed using covariance updates, i.e. in sparse format, with b_values containing the actual beta without explicitly computing residual values. values, b_indexes containing the indexes of the values, and Friedman et al. 2010 suggest to compute covariances be- N_nz containing the number of nonzero beta-values for each tween predictors dynamically as required during the estima- voxel. Unregularized intercept coefficients are returned in b0. tion process. However, in the mass-univariate setting, the ½ b values; b indexes; N nz; b0¼ lasso mexðÞ X; Y; lambda seq ; same design matrix is used to model a large number of fMRI time series. Thus, instead of computing covariances The function lasso_mex is written in Matlab and calls, after between predictors dynamically for each voxel, we precom- some sanity checks and precomputations, the MEX function pute the full covariance matrix of the design matrix before lasso_mex_cpp.c, which is written in C++ and runs the coor- starting the coordinate descent, thereby avoiding redundant dinate descent algorithm. computations of predictor covariances across voxels. The resulting beta-values in sparse format can be converted Aside from precomputing the covariance matrix, our im- to full format using the convert_betas_sparse_to_full func- plementation includes features such as warm starts and active tion: sets described in Friedman et al. 2010. The idea behind active sets is to iterate only through predictors whose beta-values b full ¼ convert betas sparse to full were set to nonzero values at an earlier stage. Once conver- ðÞ b values; b indexes; N nz; sizeðÞ X; 2 ; gence among the beta-values included in the active set is The sequence of lambda-parameters lambda_seq should be achieved, the algorithm iterates through all beta-values to decreasing in order to benefit from warm starts as described check whether additional predictors have to be included. above. Lambda-values are typically defined on a log-scale, Using active sets can considerably speed up the estimation k k − 1 k − n e.g. λ ∈ {2 ,2 , ⋯,2 }. The lambda parameter deter- procedure, and moreover beta-values can be stored in sparse format, thereby reducing memory usage. mines the degree of model regularization. For lambda-values larger than a certain data-dependent threshold, all beta-values The estimation procedure for a given λ parameter can be are set to zero, and the model fit only consists of the intercept. considerably accelerated by properly initializing the beta- The critical threshold can be computed using the values, instead of starting with all beta-values set to zero. calculate_lambda_start function. Such an initialization can be obtained from a prior estimate using a larger λ parameter. Generally, computation times in- lambda start ¼ calculate lambda startðÞ X; Y ; crease as λ becomes smaller, due to the larger number of nonzero beta-values. Thus, successively fitting models along For lambda-values larger than lambda_start, all beta-values a decreasing sequence of lambda parameters using warm starts will be zero. Thus, the first value of the lambda sequence is is typically faster than starting from scratch for each lambda typically set to lambda_start or to value from a predefined value (Friedman et al. 2010). discrete scale that is close to lambda_start. How to set the To further accelerate the estimation procedure, the CPU- smallest value of the lambda-sequence is less clear and in- based implementation distributes computations among multi- volves a trade-off between computation time and the desired ple CPU cores using a parallel for-loop over the voxels, degree of model saturation. Generally, estimating weakly reg- exploiting the fact that models are fitted independently across ularized models requires longer computation times than esti- voxels in the mass-univariate analysis approach. To this end, mating strongly regularized models. As the number of nonze- the algorithm was implemented in C++ using OpenMP for ro beta-values approaches the number of time points of the parallelization. The performance improvement achieved by fMRI time series, model estimation can become time- this parallelization step depends on the number of available consuming and unstable in overparameterized settings (i.e. CPU cores. when p > n). Thus, the smallest value of the lambda sequence is typically chosen to achieve a certain degree of model satu- How to Use lasso_mex ration while keeping computation time within reasonable bounds. While the underlying coordinate descent algorithm is imple- Optionally, technical parameters of the lasso_mex function mented in C++, the lasso_mex function can be conveniently can be set via the options structure. The default values are: called from Matlab via the mex API. The function takes a design matrix X, a matrix Y containing fMRI time series options:n iter max ¼ 1e5; and a sequence of lambda-parameters lambda_seq as input. Technical parameters can be optionally specified using an options. tol _ value = 1e − 3; options structure, otherwise default values are used. The col- umns of the design matrix X must be z-scored, and X must not options:buffer factor ¼ 3; contain an intercept column. The function returns beta-values 388 Neuroinform (2021) 19:385–392 options. cpu _ load _ factor = 1; not require any Matlab toolboxes. After some sanity checks an d p re co mpu t atio ns, t h e me x f un ctio ns ADMMcublasOverMex.c or ADMMcublasUnderMex.c are ½ b values; b indexes; N nz; b0¼ lasso mexðÞ X; Y; lambda seq; options ; called (depending on whether the design matrix is overparameterized or not), which are written in C++. These The n_iter_max parameter defines an upper bound for the functions then call ADMMcublasOver.cu or number of iterations performed by the coordinate descent al- ADMMcublasUnder.cu, which contain CUDA code calling gorithm. This parameter can be set to a larger value if the functions from the cuBLAS library to run the ADMM algo- default value results in theerror messageMax.iter. rithm on the GPU. The second version (lasso_gpu) is imple- Reached, no convergence!. However, reaching the maximum mented directly in Matlab using gpuArray and thus depends number of iterations can also indicate that model regulariza- on the Parallel Computing Toolbox. For both versions, a tion is too weak and more stringent regularization is required. CUDA-enabled GPU device is required (see also Table 1). The tol_value parameter determines the precision of the Both lasso_mexcuda and lasso_gpu use warm starts along estimated beta-values. The coordinate descent algorithm stops the supplied lambda sequence, and the same considerations on new old when max B −B  < tol_value. If higher than default pre- j j the choice of the lambda sequence as discussed above in the lasso_mex section also apply to the lasso_mexcuda and cision is required, the tol_value parameter can be set to a lasso_gpu functions. smaller value, which will typically increase computation time. Vice versa, low-precision estimates can be obtained by setting tol_value to a larger value, which might accelerate the estima- How to Use lasso_mexcuda and lasso_gpu tion procedure. To minimize memory requirements, beta-values are stored The functions take a design matrix X, a matrix Y containing in sparse format. The maximum number of nonzero beta- fMRI time series and a sequence of lambda-parameters values per voxel is determined by the buffer_factor parameter. lambda_seq as input. Technical parameters can be optionally Internally, the buffer_factor is multiplied by n (the number of specified using an options structure, otherwise default values rows of the design matrix X) in order to compute how much are used. The columns of the design matrix X must be z- memory is preallocated for the beta-values. In well-defined scored, and X must not contain an intercept column. Beta- settings, i.e. if p ≤ n, the buffer_factor parameter can be set values are returned in full format in B, and unregularized to 1. In overparameterized settings (i.e. if p > n), if the default intercept coefficients are returned in B0. As lasso_mexcuda value results in the error message N_nz over maximum, larger and lasso_gpu are optimized for the GPU, B and B0 are buffer is required!, the buffer_factor parameter should be set returned in single-precision format. to larger value. How the buffer_factor impacts memory de- mands is detailed in the Supplementary Material section on memory usage. ½ B; B0¼ lasso mexcudaðÞ X; Y; lambda seq ; The parameter cpu_load_factor determines the degree of CPU utilization. For a cpu_load_factor of 1 (default value), all CPU cores are engaged, whereas for a cpu_load_factor of ½ B; B0¼ lasso gpuðÞ X; Y; lambda seq ; 0, only a single core is occupied. To distribute computations among all but one core, the cpu_load_factor can be set to 0.99. Optionally, the following technical parameters can be spec- ified (set to default values here): The GPU-Based Implementations: lasso_mexcuda and lasso_GPU options:n iter max ¼ 1e5; options:tol value ¼ 1e−3; The two GPU-based implementations are based on the alter- nating direction method of multipliers (ADMM) algorithm options:buffer size ¼ 8192; described in Boyd et al. 2010. In contrast to the coordinate descent algorithm, the ADMM algorithm does not sequential- ly cycle trough the beta coefficients but instead all beta coef- ½ B; B0¼ lasso mexcudaðÞ X; Y; lambda seq; options ; ficients are updated simultaneously via matrix multiplication. While the ADMM procedure is less memory-efficient than coordinate descent, it can be accelerated on the GPU by ½ B; B0¼ lasso gpuðÞ X; Y; lambda seq; options ; parallelizing matrix multiplications and other steps. The first version (lasso_mexcuda) can be called from Matlab and does Neuroinform (2021) 19:385–392 389 Table 1 Overview over hardware and software requirements of the lasso_mexcuda and lasso_gpu functions introduced here are specifically different functions for estimating L1-regularized linear models. The lasso optimized for the mass-univariate analysis approach and depend on dif- function is part of the Statistics and Machine Learning Toolbox and only ferent hardware and software configurations included here for benchmarking purposes (see main text). The lasso_mex, Function Hardware requirements Software requirements Algorithm Implementation lasso CPU Matlab, Statistics and ML Toolbox Coordinate descent Matlab lasso_mex CPU Matlab Coordinate descent C++, OpenMP lasso_mexcuda CPU + CUDA-enabled GPU Matlab, CUDA Toolkit ADMM C++, cuBLAS lasso_gpu CPU + CUDA-enabled GPU Matlab, Parallel Computing Toolbox ADMM Matlab −4 The n_iter_max parameter defines an upper bound for the estimated for a single lambda-value λ =2 to allow for a number of iterations performed by the ADMM algorithm. comparison of the computation times with L2-regularized This parameter can be set to a larger value if the default value and unregularized model fits. results in the error message Max. iter. Reached, no conver- The three functions lasso_mex, lasso_mexcuda and gence!. However, reaching the maximum number of iterations lasso_gpu were compared to the lasso function that is part of can also indicate that model regularization is too weak and Matlab’s Statistics and Machine Learning Toolbox. The lasso more stringent regularization is required. function was repeatedly called to fit models for all voxels The tol_value parameter determines the precision of the es- using a for-loop, taking a single time series as input in each timated beta-values. If higher than default precision is required, call. the tol_value parameter can be set to a smaller value, which will The benchmarks were run on a workstation equipped with typically increase computation time. Vice versa, low-precision adual Intel Xeon E5–2665 CPU (16 cores overall), 32 GB estimates can be obtained by setting tol_value to a larger value, memory and an Nvidia Quadro P2000 GPU. The functions which might accelerate the estimation procedure. were benchmarked on Windows 10 64-bit and Matlab The buffer_size determines how many voxels are simulta- R2019b, as well as Ubuntu 18.04 LTS and Matlab R2018b. neously processed on the GPU. Setting this parameter to a On Windows, lasso_mex was compiled using mex and the larger value might accelerate the estimation procedure, pro- MSVC compiler of Visual Studio 2017, and lasso_mexcuda vided that the GPU has sufficient memory resources. If using CUDA Toolkit 10.1.243. On Linux, lasso_mex was buffer_size exceeds the memory capacity of the GPU, compiled using mex and gcc 6.5, and lasso_mexcuda using Matlab terminates the function call with an error message. CUDA Toolkit 9.1.85. Memory requirements can be estimated based on the heuris- Moreover, to assess the impact of the GPU hardware, we tics provided in the Supplementary Material section on mem- compared two different GPU devices using benchmark A, ory usage. Nvidia’s Quadro P2000 (1024 cores, 5 GB memory) and Tesla V100 SXM2 (5120 cores, 16 GB memory). To this end, the functions lasso_mexcuda and lasso_gpu were run Benchmarking on a p3.2xlarge instance on Amazon Web Services (AWS) using a Matlab Amazon Machine Image (AMI). The three functions lasso_mex, lasso_mexcuda and lasso_gpu were benchmarked in two data settings to demonstrate the efficiency of the implementations. In both benchmarks, X Code and Software Requirements and Y data were randomly drawn from the normal distribution. Benchmark A corresponds to an overparameterized model, The presented software package including compiled binaries representing for example an encoding model with a large fea- and source code is available at https://git.io/JvUpi. Data input/ ture space. The number of time points was set to n = 300, output is handled via Matlab for all functions of the package. which corresponds to a 10 min scanner run for a TR of 2 s. The functions have been tested using Matlab R2019b and The number of predictors was set to p = 5000, i.e. p ≫ n.The R2018b, but should work with other versions as well. model was estimated on v = 65536 voxels, approximately cor- Moreover, the GPU-accelerated functions require either the responding to whole-brain data for an isotropic 3 mm resolu- CUDA toolkit (lasso_mexcuda) or Matlab’s Parallel −2 −3 −6 tion. The lambda sequence was set to {2 ,2 , ⋯,2 }. Computing Toolbox (lasso_gpu). Note that each Matlab ver- Benchmark B corresponds to a well-defined setting, e.g. a sion requires a specific version of the CUDA toolkit, as listed single-trial or FIR model. The number of time points and here: https://mathworks.com/help/parallel-computing/gpu- voxels in setting B were identical to setting A, but the number support-by-release.html. GPU devices should have compute of predictors was set to p =200, i.e. p < n. The models were capability > = 3.0. 390 Neuroinform (2021) 19:385–392 Results computation time from approximately 9 h to 5 min could be achieved on Windows using the lasso_gpu function, see On both benchmarks A and B, the lasso_mex function was Supplementary Table 1. As shown in benchmark B, L2- considerably faster than the standard lasso function using a regularized (ridge regression)orunregularized (ordinary single CPU core, showing that the coordinate descent al- least squares, OLS) model estimation remains faster than gorithm underlying lasso_mex is efficiently implemented accelerated L1-regularization. (see Fig. 1). Setting the lasso_mex function to distribute The comparison of two different GPU devices using bench- computations across multiple CPU cores led to further re- mark A on Linux revealed that the larger device (Tesla V100) ductions of computation time. More details on how com- achieves a 4.5x speed-up over the smaller card (Quadro putation time varied as a function of CPU cores are given P2000), see Fig. 2. This acceleration approximately corre- in Supplementary Fig. 1. The parallel implementation of sponds to the ratio of available CUDA cores of 5120:1024, the ADMM algorithm on the GPU (lasso_mexcuda, see Supplementary Fig. 2 for more details. In absolute num- lasso_gpu) provided further acceleration for benchmark bers, computation time could be further reduced to 1 min A, with lasso_gpu being considerably faster than using the lasso_gpu function on the V100 device, see lasso_mexcuda. In absolute numbers, a reduction of Supplementary Table 2. Fig. 1 Benchmark results for the three lasso functions lasso_mex, lasso_ A (gpuArray, green/orange). (b): Benchmark for a well-defined setting mexcuda and lasso_gpu. (a): Benchmark for an overparameterized with fewer predictors than time points (n =300, p = 200), corresponding setting with n = 300 time points and p = 5000 predictors, corresponding for example to a single-trial or FIR model. Again, on a single CPU core, for example to an encoding model with a large feature space. The lasso_ the lasso_mex function fitted L1-regularized models more efficiently than mex function required considerably less running time for whole-brain the standard lasso function. Further speed-up could be achieved by dis- estimates than the standard lasso function (Matlab, orange) on a single tributing computations across multiple CPU cores. The GPU-based CPU core (C++, light blue). Distributing computations across multiple implementations (lasso_mexcuda, lasso_gpu) performed not as fast as CPU cores further reduced the running time of the lasso_mex function the multicore CPU version on this benchmark, as the GPU was not fully (OpenMP, dark blue). The lasso_mexcuda function, which runs the occupied in this small-scale setting. Unregularized (OLS, yellow) or L2- ADMM algorithm on a GPU using the cuBLAS library, further acceler- regularized (ridge, gray) model estimation using closed-form solutions ated the estimation procedure (cuBLAS, green). The lasso_gpu function, remains faster than accelerated L1-regularization. Absolute computation which runs the ADMM algorithm on the GPU using the Parallel times are given in Supplementary Table 1 Computing Toolbox, provided the highest speed-up factor for benchmark Neuroinform (2021) 19:385–392 391 Fig. 2 Benchmark results for two different GPU devices. Nvidia’s Tesla V100 GPU device yielded a speed-up of approximately 4.5x over the Quadro P2000 card, which roughly corresponds to the ratio of 5120:1024 cores. Absolute computation times are given in Supplementary Table 2 Discussion been systematically evaluated for these types of models. Using the functions presented here, future studies can effi- We introduced a package of functions for estimating L1- ciently evaluate how L1-regularization impacts the robustness regularized models in the mass-univariate fMRI analysis ap- and predictiveness of single-trial and FIR models in compar- proach. The presented functions significantly accelerate the ison to L2-regularization or unregularized model estimation. model estimation procedure and thereby facilitate the use of In the following, we discuss some limitations of L1- L1-regularization in the mass-univariate approach and enable regularization in general and the specific implementations pre- its application on whole-brain data and large samples. While sented here. It is important to note that the beta estimates the presented benchmarks were performed on data corre- returned by L1-regularized models are typically sparse and sponding to a single fMRI scanning session, efficiency gains thus not normally distributed. Depending on how the beta scale up linearly for multiple sessions per subject and multiple estimates are used in subsequent analysis steps, a modification subjects per sample. For example, for a sample of 20 subjects of some of these steps might be required. For example, for and 5 scanning sessions per subject, a computation time re- aggregating data, median values can be used instead of arith- duction from 9 h to 5 min per scanning session translates into a metic means to preserve sparsity. Moreover, when using L1- reduction from 37 days to 8 h for the whole sample. regularization for example for estimating single-trial models, This speed-up makes it practically feasible to employ L1- both the spatial activations patterns and the beta-series are regularization in the context of the encoding model approach. typically sparse. To compute correlations between such sparse While the majority of studies using the encoding model ap- spatial patterns or beta-series, rank-based correlation measures proach have been focused on visual processing (van Gerven should be used instead of Pearson correlations. In contrast, 2017), Huth et al. 2016 have shown that an extension to other using sparse beta estimates for predicting fMRI signals in domains such as whole-brain semantic representations is pos- the context of the encoding models approach would typically sible and promising. Data-driven generation of predictive fea- not require specific adjustments. For example, to compute tures, for example via deep neural networks (Güclü and van model predictions of an L1-regularized model, the design ma- Gerven 2015;Kell etal. 2018;Mohr et al. 2019), typically trix of a given test dataset can simply be multiplied with the results in large feature spaces and therefore requires efficient sparse beta estimates obtained on a training dataset. With re- model estimation procedures. Given the rapid advancement of spect to decoding techniques such as multivariate pattern anal- machine learning in a range of domains that are also relevant ysis (MVPA), it should be noted that although it would be for neuroimaging (e.g. geometric representations of objects technically possible to use the presented functions for estimat- (Eslami et al. 2018) or spatial navigation (Banino et al. ing decoding models (by filling the design matrix with fMRI 2018)), we expect a proliferation of the encoding model ap- data), no acceleration can be expected in this case, as decoding proach for predicting fMRI signals via machine-learning gen- models are typically estimated for a single outcome variable erated features. Estimating such large-scale encoding models only. Since the presented functions are optimized for the using L1-regularization can be efficiently performed by the mass-univariate approach, estimation procedures are only ac- functions presented here. celerated in settings involving many outcome variables. Moreover, the presented functions facilitate the use of L1- In conclusion, the functions introduced here significantly regularization for modeling approaches such as single-trial accelerate L1-regularization in the mass-univariate setting and and FIR models (Mumford et al. 2012; Ollinger et al. 2001). make it practically feasible to estimate L1-regularized models To our knowledge, the impact of L1-regularization has not yet on whole-brain data and large samples. 392 Neuroinform (2021) 19:385–392 Eslami, S. M. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. Information Sharing Statement The presented software pack- S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, age including compiled binaries and source code is available K., Reichert, D. P., Buesing, L., Weber, T., Vinyals, O., at https://git.io/JvUpi. Rosenbaum, D., Rabinowitz, N., King, H., Hillier, C., Botvinick, M., Wierstra, D., Kavukcuoglu, K., & Hassabis, D. (2018). Neural scene representation and rendering. Science, 360,1204–1210. Acknowledgments This work was supported by the German Research Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths Foundation (DFG), grant CRC 940, project Z2. We thank the Center for for generalized linear models via coordinate descent. JStat Softw, Information Services and High Performance Computing (ZIH) at TU 33,1–22. Dresden for generously providing computing resources. Güclü, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ven- Availability of Data and Material Not applicable. tral stream. J Neurosci, 35, 10005–10014. Code Availability The presented software package including compiled Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estima- binaries and source code is available at https://git.io/JvUpi. tion for nonorthogonal problems. Technometrics, 42,80–86. Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile Funding Open Access funding provided by Projekt DEAL. This work human cerebral cortex. Nature, 532,453–458. was supported by the German Research Foundation (DFG), grant CRC Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). 940, project Z2. We thank the Center for Information Services and High Identifying natural images from human brain activity. Nature, 452, Performance Computing (ZIH) at TU Dresden for generously providing 352–355. computing resources. Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network rep- Compliance with Ethical Standards licates human auditory behavior, predicts brain responses, and re- veals a cortical processing hierarchy. Neuron, 98,630–644 e16. Mohr, H., Cichy, R, M. & Ruge, H. (2019). Deep neural networks can Conflicts of Interest/Competing Interests The authors declare that they predict human behavior in arcade games. Proceedings of the 2019 have no conflict of interest. conference on cognitive computational neuroscience, Berlin, Open Access This article is licensed under a Creative Commons Germany. DOI: https://doi.org/10.32470/CCN.2019.1043-0 Attribution 4.0 International License, which permits use, sharing, Mumford, J. A., Turner, B. O., Ashby, F. G., & Poldrack, R. A. (2012). adaptation, distribution and reproduction in any medium or format, as Deconvolving BOLD activation in event-related designs for long as you give appropriate credit to the original author(s) and the multivoxel pattern classification analyses. NeuroImage, 59, 2636– source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article Mumford, J. A., Davis, T., & Poldrack, R. A. (2014). The impact of study are included in the article's Creative Commons licence, unless indicated design on pattern estimation for single-trial multivariate pattern anal- otherwise in a credit line to the material. If material is not included in the ysis. NeuroImage, 103,130–138. article's Creative Commons licence and your intended use is not Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). permitted by statutory regulation or exceeds the permitted use, you will Encoding and decoding in fMRI. NeuroImage, 56,400–410. need to obtain permission directly from the copyright holder. To view a Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Curr Biol, 21,1641–1646. Ollinger, J. M., Shulman, G. L., & Corbetta, M. (2001). Separating pro- cesses within a trial in event-related functional MRI. NeuroImage, 13,210–217. References Rissman, J., Gazzaley, A., & D’Esposito, M. (2004). Measuring func- tional connectivity during distinct stages of a cognitive task. NeuroImage, 23,752–763. Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Schoenmakers, S., Barth, M., Heskes, T., & van Gerven, M. (2013). Pritzel, A., Chadwick, M. J., Degris, T., Modayil, J., Wayne, G., Linear reconstruction of perceived images from human brain activ- Soyer, H., Viola, F., Zhang, B., Goroshin, R., Rabinowitz, N., ity. NeuroImage, 83,951–961. Pascanu, R., Beattie, C., Petersen, S., Sadik, A., Gaffney, S., King, Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J H., Kavukcuoglu, K., Hassabis, D., Hadsell, R., & Kumaran, D. R Stat Soc Ser B Methodol, 58,267–288. (2018). Vector-based navigation using grid-like representations in van Gerven, M. A. J. (2017). A primer on encoding models in sensory artificial agents. Nature, 557,429–433. neuroscience. J Math Psychol, 76,172–183. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn., 3, Publisher’sNote Springer Nature remains neutral with regard to juris- 1–122. dictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Neuroinformatics Springer Journals

Fast Estimation of L1-Regularized Linear Models in the Mass-Univariate Setting

Neuroinformatics , Volume 19 (3) – Sep 15, 2020

Loading next page...
 
/lp/springer-journals/fast-estimation-of-l1-regularized-linear-models-in-the-mass-univariate-bal9bU0l8D
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2020
ISSN
1539-2791
eISSN
1559-0089
DOI
10.1007/s12021-020-09489-1
Publisher site
See Article on Publisher Site

Abstract

In certain modeling approaches, activation analyses of task-based fMRI data can involve a relatively large number of predictors. For example, in the encoding model approach, complex stimuli are represented in a high-dimensional feature space, resulting in design matrices with many predictors. Similarly, single-trial models and finite impulse response models may also encompass a large number of predictors. In settings where only few of those predictors are expected to be informative, a sparse model fit can be obtained via L1-regularization. However, estimating L1-regularized models requires an iterative fitting procedure, which con- siderably increases computation time compared to estimating unregularized or L2-regularized models, and complicates the application of L1-regularization on whole-brain data and large sample sizes. Here we provide several functions for estimating L1-regularized models that are optimized for the mass-univariate analysis approach. The package includes a parallel implemen- tation of the coordinate descent algorithm for CPU-only systems and two implementations of the alternating direction method of multipliers algorithm requiring a GPU device. While the core algorithms are implemented in C++/CUDA, data input/output and parameter settings can be conveniently handled via Matlab. The CPU-based implementation is highly memory-efficient and provides considerable speed-up compared to the standard implementation not optimized for the mass-univariate approach. Further acceleration can be achieved on systems equipped with a CUDA-enabled GPU. Using the fastest GPU-based imple- mentation, computation time for whole-brain estimates can be reduced from 9 h to 5 min in an exemplary data setting. Overall, the provided package facilitates the use of L1-regularization for fMRI activation analyses and enables an efficient employment of L1-regularization on whole-brain data and large sample sizes. . . . . . Keywords fMRI L1-regularization Lasso Sparsity Encoding model GPU Introduction analyses such as multivariate pattern analyses or connectivity analyses (Mumford et al. 2014; Rissman et al. 2004). As Over the last two decades, various fMRI activation analysis single-trial models require a separate predictor for each exper- approaches have been established that involve a relatively imental trial, the resulting design matrices typically encom- large number of predictors. For example, single-trial models pass a relatively large number of predictors. can be employed to obtain activation estimates for individual The finite impulse response (FIR) model is another exam- experimental trials (Mumford et al. 2012). The single-trial ple for an fMRI activation analysis approach involving many estimates can be subsequently used as input for further predictors (Ollinger et al. 2001). In the FIR approach, the fMRI signal is modeled by a set of impulse response predic- tors, instead of a predefined hemodynamic response shape. Electronic supplementary material The online version of this article This modeling approach can capture activation dynamics de- (https://doi.org/10.1007/s12021-020-09489-1) contains supplementary viating from the canonical response shape, but typically in- material, which is available to authorized users. volves a large number of predictors (growing proportionally with the size of the FIR basis set). Both in single-trial and FIR * Holger Mohr holger.mohr@tu-dresden.de models, the large number of predictors can reduce the robust- ness of the beta estimates. Another fMRI modeling approach typically involving a Department of Psychology, Technische Universität Dresden, 01062 Dresden, Germany large number of predictors is the so-called encoding model 386 Neuroinform (2021) 19:385–392 approach (Naselaris et al. 2011; van Gerven 2017). In this using closed-form solutions, L1-regularization requires an it- approach, stimuli are represented in a high-dimensional fea- erative fitting procedure. Thus, estimatingL1-regularized ture space instead of being assigned to a low-dimensional set models instead of L2- or unregularized models substantially of categories. For example, instead of assigning visual stimuli increases the running time of fMRI analyses. For certain types to object categories such as houses, trees, etc., the stimuli are of analyses, for example whole-brain analyses on large sam- represented by a high-dimensional vector of feature weights. ples, it is virtually infeasible to employ L1-regularization. Such features may comprise a set of Gabor wavelets (Kay Here, we aim to facilitate the estimation of L1-regularized et al. 2008), may be extracted from deep neural networks models on fMRI data. In the following, we present a package (Güclü and van Gerven 2015) or may simply consist of pixel of functions for estimating L1-regularized models that are values (Schoenmakers et al. 2013). The encoding model ap- optimized for the mass-univariate approach. We describe the proach has also been employed outside the visual domain, for implementation of the functions, how to set their parameters, example to characterize semantic representations of words and provide benchmark results for two exemplary data (Huth et al. 2016). In this study, the feature space was defined settings. as a basic dictionary of English words, and the model was fitted on whole-brain data. The high-dimensional representa- tion of stimuli used in the encoding model approach typically Methods translates into design matrices encompassing more predictors than time points. In this setting, a unique model fit can only be In the following, we assume to have an fMRI data matrix Y of obtained by adding a regularization term to the model. size n × v,with n being the number of time points and v being Model regularization adds additional constraints to the the number of time series (e.g. the number of voxels). model fitting procedure on top of minimizing the error term. Moreover, we have a design matrix X of size n × p,with p As the number of predictors approaches the number of col- being the number of predictors. The beta-values are stored in lected time points (i.e. the length of the fMRI time series), amatrix B of size p × v. The intercept of the model is of size model regularization becomes increasingly relevant, and for n ×1 and is denoted I, and the intercept’s beta-values are saturated models, regularization is indispensable. In the con- stored in B of size 1 × v. Matrix columns are indexed by j. text of fMRI activation analyses, model regularization can We assume that the columns of the design matrix X are z- improve the robustness of single-trial and FIR models, and n 1 n 2 scored, i.e. ∑ X ¼ 0and ∑ X ¼ 1for all j ={1, is crucial for the encoding model approach. ij ij i¼1 n i¼1 ⋯, p}. For a given regularization parameter λ ≥ 0, the L1- For linear models, the two most common types of regular- ization are L1-regularization and L2-regularization (also b b regularized model fit B; B is expected to minimize, for each known as lasso and ridge regression, Tibshirani 1996;Hoerl j ={1, ⋯, v}, the following objective function: and Kennard 1970). While L1-regularization puts a threshold 0 2 on the sum of absolute values of the beta estimates, L2- b b B ; B ¼ argmin Y −XB −IB  þ λ B j j j j j j 0 2n 2 regularization bounds the sum of squared beta-values. These B ;B two types of regularization can result in fundamentally differ- ent beta estimates: L2-regularized models return nonzero beta- In contrast to L2-regularized or unregularized models, L1- values for all predictors, whereas L1-regularized models re- regularized model fits cannot be computed using a closed- turn a sparse model fit, that is, most of the beta-values are set form solution. Instead, an iterative procedure is required to to zero and only a few predictors are included in the model fit. b find B. In the following sections, we describe two different Whether to employ L1- or L2-regularization depends on a- algorithms for fitting L1-regularized models, coordinate de- priori assumptions on the data at hand. L2-regularization as- scent (Friedman et al. 2010) and alternating direction method sumes that most of the predictors have an impact on the fMRI of multipliers (ADMM, Boyd et al. 2010), and how we opti- signal. In contrast, L1-regularization is based on the assump- mized and implemented these algorithms for mass-univariate tion that the fMRI signal can be modeled by a small fraction of analyses. the included predictors. While both types of regularization have been employed in fMRI studies (Huth et al. 2016; Nishimoto et al. 2011), L2- The CPU-Based Implementation: lasso_mex regularization seems to occur more frequently in the neuroim- aging literature than L1-regularization. This might be partly In the CPU-based implementation, the model fit is computed explained by the fact that fitting L1-regularized models is using the coordinate descent algorithm proposed by Friedman considerably more expensive, in terms of computation time, et al. 2010. In each step of the iterative fitting procedure, the than fitting L2-regularized or unregularized models. While beta-value of a single predictor is updated while the remaining L2-regularized and unregularized models can be estimated beta-values are fixed. As suggested by Friedman et al. 2010, Neuroinform (2021) 19:385–392 387 the beta-values are computed using covariance updates, i.e. in sparse format, with b_values containing the actual beta without explicitly computing residual values. values, b_indexes containing the indexes of the values, and Friedman et al. 2010 suggest to compute covariances be- N_nz containing the number of nonzero beta-values for each tween predictors dynamically as required during the estima- voxel. Unregularized intercept coefficients are returned in b0. tion process. However, in the mass-univariate setting, the ½ b values; b indexes; N nz; b0¼ lasso mexðÞ X; Y; lambda seq ; same design matrix is used to model a large number of fMRI time series. Thus, instead of computing covariances The function lasso_mex is written in Matlab and calls, after between predictors dynamically for each voxel, we precom- some sanity checks and precomputations, the MEX function pute the full covariance matrix of the design matrix before lasso_mex_cpp.c, which is written in C++ and runs the coor- starting the coordinate descent, thereby avoiding redundant dinate descent algorithm. computations of predictor covariances across voxels. The resulting beta-values in sparse format can be converted Aside from precomputing the covariance matrix, our im- to full format using the convert_betas_sparse_to_full func- plementation includes features such as warm starts and active tion: sets described in Friedman et al. 2010. The idea behind active sets is to iterate only through predictors whose beta-values b full ¼ convert betas sparse to full were set to nonzero values at an earlier stage. Once conver- ðÞ b values; b indexes; N nz; sizeðÞ X; 2 ; gence among the beta-values included in the active set is The sequence of lambda-parameters lambda_seq should be achieved, the algorithm iterates through all beta-values to decreasing in order to benefit from warm starts as described check whether additional predictors have to be included. above. Lambda-values are typically defined on a log-scale, Using active sets can considerably speed up the estimation k k − 1 k − n e.g. λ ∈ {2 ,2 , ⋯,2 }. The lambda parameter deter- procedure, and moreover beta-values can be stored in sparse format, thereby reducing memory usage. mines the degree of model regularization. For lambda-values larger than a certain data-dependent threshold, all beta-values The estimation procedure for a given λ parameter can be are set to zero, and the model fit only consists of the intercept. considerably accelerated by properly initializing the beta- The critical threshold can be computed using the values, instead of starting with all beta-values set to zero. calculate_lambda_start function. Such an initialization can be obtained from a prior estimate using a larger λ parameter. Generally, computation times in- lambda start ¼ calculate lambda startðÞ X; Y ; crease as λ becomes smaller, due to the larger number of nonzero beta-values. Thus, successively fitting models along For lambda-values larger than lambda_start, all beta-values a decreasing sequence of lambda parameters using warm starts will be zero. Thus, the first value of the lambda sequence is is typically faster than starting from scratch for each lambda typically set to lambda_start or to value from a predefined value (Friedman et al. 2010). discrete scale that is close to lambda_start. How to set the To further accelerate the estimation procedure, the CPU- smallest value of the lambda-sequence is less clear and in- based implementation distributes computations among multi- volves a trade-off between computation time and the desired ple CPU cores using a parallel for-loop over the voxels, degree of model saturation. Generally, estimating weakly reg- exploiting the fact that models are fitted independently across ularized models requires longer computation times than esti- voxels in the mass-univariate analysis approach. To this end, mating strongly regularized models. As the number of nonze- the algorithm was implemented in C++ using OpenMP for ro beta-values approaches the number of time points of the parallelization. The performance improvement achieved by fMRI time series, model estimation can become time- this parallelization step depends on the number of available consuming and unstable in overparameterized settings (i.e. CPU cores. when p > n). Thus, the smallest value of the lambda sequence is typically chosen to achieve a certain degree of model satu- How to Use lasso_mex ration while keeping computation time within reasonable bounds. While the underlying coordinate descent algorithm is imple- Optionally, technical parameters of the lasso_mex function mented in C++, the lasso_mex function can be conveniently can be set via the options structure. The default values are: called from Matlab via the mex API. The function takes a design matrix X, a matrix Y containing fMRI time series options:n iter max ¼ 1e5; and a sequence of lambda-parameters lambda_seq as input. Technical parameters can be optionally specified using an options. tol _ value = 1e − 3; options structure, otherwise default values are used. The col- umns of the design matrix X must be z-scored, and X must not options:buffer factor ¼ 3; contain an intercept column. The function returns beta-values 388 Neuroinform (2021) 19:385–392 options. cpu _ load _ factor = 1; not require any Matlab toolboxes. After some sanity checks an d p re co mpu t atio ns, t h e me x f un ctio ns ADMMcublasOverMex.c or ADMMcublasUnderMex.c are ½ b values; b indexes; N nz; b0¼ lasso mexðÞ X; Y; lambda seq; options ; called (depending on whether the design matrix is overparameterized or not), which are written in C++. These The n_iter_max parameter defines an upper bound for the functions then call ADMMcublasOver.cu or number of iterations performed by the coordinate descent al- ADMMcublasUnder.cu, which contain CUDA code calling gorithm. This parameter can be set to a larger value if the functions from the cuBLAS library to run the ADMM algo- default value results in theerror messageMax.iter. rithm on the GPU. The second version (lasso_gpu) is imple- Reached, no convergence!. However, reaching the maximum mented directly in Matlab using gpuArray and thus depends number of iterations can also indicate that model regulariza- on the Parallel Computing Toolbox. For both versions, a tion is too weak and more stringent regularization is required. CUDA-enabled GPU device is required (see also Table 1). The tol_value parameter determines the precision of the Both lasso_mexcuda and lasso_gpu use warm starts along estimated beta-values. The coordinate descent algorithm stops the supplied lambda sequence, and the same considerations on new old when max B −B  < tol_value. If higher than default pre- j j the choice of the lambda sequence as discussed above in the lasso_mex section also apply to the lasso_mexcuda and cision is required, the tol_value parameter can be set to a lasso_gpu functions. smaller value, which will typically increase computation time. Vice versa, low-precision estimates can be obtained by setting tol_value to a larger value, which might accelerate the estima- How to Use lasso_mexcuda and lasso_gpu tion procedure. To minimize memory requirements, beta-values are stored The functions take a design matrix X, a matrix Y containing in sparse format. The maximum number of nonzero beta- fMRI time series and a sequence of lambda-parameters values per voxel is determined by the buffer_factor parameter. lambda_seq as input. Technical parameters can be optionally Internally, the buffer_factor is multiplied by n (the number of specified using an options structure, otherwise default values rows of the design matrix X) in order to compute how much are used. The columns of the design matrix X must be z- memory is preallocated for the beta-values. In well-defined scored, and X must not contain an intercept column. Beta- settings, i.e. if p ≤ n, the buffer_factor parameter can be set values are returned in full format in B, and unregularized to 1. In overparameterized settings (i.e. if p > n), if the default intercept coefficients are returned in B0. As lasso_mexcuda value results in the error message N_nz over maximum, larger and lasso_gpu are optimized for the GPU, B and B0 are buffer is required!, the buffer_factor parameter should be set returned in single-precision format. to larger value. How the buffer_factor impacts memory de- mands is detailed in the Supplementary Material section on memory usage. ½ B; B0¼ lasso mexcudaðÞ X; Y; lambda seq ; The parameter cpu_load_factor determines the degree of CPU utilization. For a cpu_load_factor of 1 (default value), all CPU cores are engaged, whereas for a cpu_load_factor of ½ B; B0¼ lasso gpuðÞ X; Y; lambda seq ; 0, only a single core is occupied. To distribute computations among all but one core, the cpu_load_factor can be set to 0.99. Optionally, the following technical parameters can be spec- ified (set to default values here): The GPU-Based Implementations: lasso_mexcuda and lasso_GPU options:n iter max ¼ 1e5; options:tol value ¼ 1e−3; The two GPU-based implementations are based on the alter- nating direction method of multipliers (ADMM) algorithm options:buffer size ¼ 8192; described in Boyd et al. 2010. In contrast to the coordinate descent algorithm, the ADMM algorithm does not sequential- ly cycle trough the beta coefficients but instead all beta coef- ½ B; B0¼ lasso mexcudaðÞ X; Y; lambda seq; options ; ficients are updated simultaneously via matrix multiplication. While the ADMM procedure is less memory-efficient than coordinate descent, it can be accelerated on the GPU by ½ B; B0¼ lasso gpuðÞ X; Y; lambda seq; options ; parallelizing matrix multiplications and other steps. The first version (lasso_mexcuda) can be called from Matlab and does Neuroinform (2021) 19:385–392 389 Table 1 Overview over hardware and software requirements of the lasso_mexcuda and lasso_gpu functions introduced here are specifically different functions for estimating L1-regularized linear models. The lasso optimized for the mass-univariate analysis approach and depend on dif- function is part of the Statistics and Machine Learning Toolbox and only ferent hardware and software configurations included here for benchmarking purposes (see main text). The lasso_mex, Function Hardware requirements Software requirements Algorithm Implementation lasso CPU Matlab, Statistics and ML Toolbox Coordinate descent Matlab lasso_mex CPU Matlab Coordinate descent C++, OpenMP lasso_mexcuda CPU + CUDA-enabled GPU Matlab, CUDA Toolkit ADMM C++, cuBLAS lasso_gpu CPU + CUDA-enabled GPU Matlab, Parallel Computing Toolbox ADMM Matlab −4 The n_iter_max parameter defines an upper bound for the estimated for a single lambda-value λ =2 to allow for a number of iterations performed by the ADMM algorithm. comparison of the computation times with L2-regularized This parameter can be set to a larger value if the default value and unregularized model fits. results in the error message Max. iter. Reached, no conver- The three functions lasso_mex, lasso_mexcuda and gence!. However, reaching the maximum number of iterations lasso_gpu were compared to the lasso function that is part of can also indicate that model regularization is too weak and Matlab’s Statistics and Machine Learning Toolbox. The lasso more stringent regularization is required. function was repeatedly called to fit models for all voxels The tol_value parameter determines the precision of the es- using a for-loop, taking a single time series as input in each timated beta-values. If higher than default precision is required, call. the tol_value parameter can be set to a smaller value, which will The benchmarks were run on a workstation equipped with typically increase computation time. Vice versa, low-precision adual Intel Xeon E5–2665 CPU (16 cores overall), 32 GB estimates can be obtained by setting tol_value to a larger value, memory and an Nvidia Quadro P2000 GPU. The functions which might accelerate the estimation procedure. were benchmarked on Windows 10 64-bit and Matlab The buffer_size determines how many voxels are simulta- R2019b, as well as Ubuntu 18.04 LTS and Matlab R2018b. neously processed on the GPU. Setting this parameter to a On Windows, lasso_mex was compiled using mex and the larger value might accelerate the estimation procedure, pro- MSVC compiler of Visual Studio 2017, and lasso_mexcuda vided that the GPU has sufficient memory resources. If using CUDA Toolkit 10.1.243. On Linux, lasso_mex was buffer_size exceeds the memory capacity of the GPU, compiled using mex and gcc 6.5, and lasso_mexcuda using Matlab terminates the function call with an error message. CUDA Toolkit 9.1.85. Memory requirements can be estimated based on the heuris- Moreover, to assess the impact of the GPU hardware, we tics provided in the Supplementary Material section on mem- compared two different GPU devices using benchmark A, ory usage. Nvidia’s Quadro P2000 (1024 cores, 5 GB memory) and Tesla V100 SXM2 (5120 cores, 16 GB memory). To this end, the functions lasso_mexcuda and lasso_gpu were run Benchmarking on a p3.2xlarge instance on Amazon Web Services (AWS) using a Matlab Amazon Machine Image (AMI). The three functions lasso_mex, lasso_mexcuda and lasso_gpu were benchmarked in two data settings to demonstrate the efficiency of the implementations. In both benchmarks, X Code and Software Requirements and Y data were randomly drawn from the normal distribution. Benchmark A corresponds to an overparameterized model, The presented software package including compiled binaries representing for example an encoding model with a large fea- and source code is available at https://git.io/JvUpi. Data input/ ture space. The number of time points was set to n = 300, output is handled via Matlab for all functions of the package. which corresponds to a 10 min scanner run for a TR of 2 s. The functions have been tested using Matlab R2019b and The number of predictors was set to p = 5000, i.e. p ≫ n.The R2018b, but should work with other versions as well. model was estimated on v = 65536 voxels, approximately cor- Moreover, the GPU-accelerated functions require either the responding to whole-brain data for an isotropic 3 mm resolu- CUDA toolkit (lasso_mexcuda) or Matlab’s Parallel −2 −3 −6 tion. The lambda sequence was set to {2 ,2 , ⋯,2 }. Computing Toolbox (lasso_gpu). Note that each Matlab ver- Benchmark B corresponds to a well-defined setting, e.g. a sion requires a specific version of the CUDA toolkit, as listed single-trial or FIR model. The number of time points and here: https://mathworks.com/help/parallel-computing/gpu- voxels in setting B were identical to setting A, but the number support-by-release.html. GPU devices should have compute of predictors was set to p =200, i.e. p < n. The models were capability > = 3.0. 390 Neuroinform (2021) 19:385–392 Results computation time from approximately 9 h to 5 min could be achieved on Windows using the lasso_gpu function, see On both benchmarks A and B, the lasso_mex function was Supplementary Table 1. As shown in benchmark B, L2- considerably faster than the standard lasso function using a regularized (ridge regression)orunregularized (ordinary single CPU core, showing that the coordinate descent al- least squares, OLS) model estimation remains faster than gorithm underlying lasso_mex is efficiently implemented accelerated L1-regularization. (see Fig. 1). Setting the lasso_mex function to distribute The comparison of two different GPU devices using bench- computations across multiple CPU cores led to further re- mark A on Linux revealed that the larger device (Tesla V100) ductions of computation time. More details on how com- achieves a 4.5x speed-up over the smaller card (Quadro putation time varied as a function of CPU cores are given P2000), see Fig. 2. This acceleration approximately corre- in Supplementary Fig. 1. The parallel implementation of sponds to the ratio of available CUDA cores of 5120:1024, the ADMM algorithm on the GPU (lasso_mexcuda, see Supplementary Fig. 2 for more details. In absolute num- lasso_gpu) provided further acceleration for benchmark bers, computation time could be further reduced to 1 min A, with lasso_gpu being considerably faster than using the lasso_gpu function on the V100 device, see lasso_mexcuda. In absolute numbers, a reduction of Supplementary Table 2. Fig. 1 Benchmark results for the three lasso functions lasso_mex, lasso_ A (gpuArray, green/orange). (b): Benchmark for a well-defined setting mexcuda and lasso_gpu. (a): Benchmark for an overparameterized with fewer predictors than time points (n =300, p = 200), corresponding setting with n = 300 time points and p = 5000 predictors, corresponding for example to a single-trial or FIR model. Again, on a single CPU core, for example to an encoding model with a large feature space. The lasso_ the lasso_mex function fitted L1-regularized models more efficiently than mex function required considerably less running time for whole-brain the standard lasso function. Further speed-up could be achieved by dis- estimates than the standard lasso function (Matlab, orange) on a single tributing computations across multiple CPU cores. The GPU-based CPU core (C++, light blue). Distributing computations across multiple implementations (lasso_mexcuda, lasso_gpu) performed not as fast as CPU cores further reduced the running time of the lasso_mex function the multicore CPU version on this benchmark, as the GPU was not fully (OpenMP, dark blue). The lasso_mexcuda function, which runs the occupied in this small-scale setting. Unregularized (OLS, yellow) or L2- ADMM algorithm on a GPU using the cuBLAS library, further acceler- regularized (ridge, gray) model estimation using closed-form solutions ated the estimation procedure (cuBLAS, green). The lasso_gpu function, remains faster than accelerated L1-regularization. Absolute computation which runs the ADMM algorithm on the GPU using the Parallel times are given in Supplementary Table 1 Computing Toolbox, provided the highest speed-up factor for benchmark Neuroinform (2021) 19:385–392 391 Fig. 2 Benchmark results for two different GPU devices. Nvidia’s Tesla V100 GPU device yielded a speed-up of approximately 4.5x over the Quadro P2000 card, which roughly corresponds to the ratio of 5120:1024 cores. Absolute computation times are given in Supplementary Table 2 Discussion been systematically evaluated for these types of models. Using the functions presented here, future studies can effi- We introduced a package of functions for estimating L1- ciently evaluate how L1-regularization impacts the robustness regularized models in the mass-univariate fMRI analysis ap- and predictiveness of single-trial and FIR models in compar- proach. The presented functions significantly accelerate the ison to L2-regularization or unregularized model estimation. model estimation procedure and thereby facilitate the use of In the following, we discuss some limitations of L1- L1-regularization in the mass-univariate approach and enable regularization in general and the specific implementations pre- its application on whole-brain data and large samples. While sented here. It is important to note that the beta estimates the presented benchmarks were performed on data corre- returned by L1-regularized models are typically sparse and sponding to a single fMRI scanning session, efficiency gains thus not normally distributed. Depending on how the beta scale up linearly for multiple sessions per subject and multiple estimates are used in subsequent analysis steps, a modification subjects per sample. For example, for a sample of 20 subjects of some of these steps might be required. For example, for and 5 scanning sessions per subject, a computation time re- aggregating data, median values can be used instead of arith- duction from 9 h to 5 min per scanning session translates into a metic means to preserve sparsity. Moreover, when using L1- reduction from 37 days to 8 h for the whole sample. regularization for example for estimating single-trial models, This speed-up makes it practically feasible to employ L1- both the spatial activations patterns and the beta-series are regularization in the context of the encoding model approach. typically sparse. To compute correlations between such sparse While the majority of studies using the encoding model ap- spatial patterns or beta-series, rank-based correlation measures proach have been focused on visual processing (van Gerven should be used instead of Pearson correlations. In contrast, 2017), Huth et al. 2016 have shown that an extension to other using sparse beta estimates for predicting fMRI signals in domains such as whole-brain semantic representations is pos- the context of the encoding models approach would typically sible and promising. Data-driven generation of predictive fea- not require specific adjustments. For example, to compute tures, for example via deep neural networks (Güclü and van model predictions of an L1-regularized model, the design ma- Gerven 2015;Kell etal. 2018;Mohr et al. 2019), typically trix of a given test dataset can simply be multiplied with the results in large feature spaces and therefore requires efficient sparse beta estimates obtained on a training dataset. With re- model estimation procedures. Given the rapid advancement of spect to decoding techniques such as multivariate pattern anal- machine learning in a range of domains that are also relevant ysis (MVPA), it should be noted that although it would be for neuroimaging (e.g. geometric representations of objects technically possible to use the presented functions for estimat- (Eslami et al. 2018) or spatial navigation (Banino et al. ing decoding models (by filling the design matrix with fMRI 2018)), we expect a proliferation of the encoding model ap- data), no acceleration can be expected in this case, as decoding proach for predicting fMRI signals via machine-learning gen- models are typically estimated for a single outcome variable erated features. Estimating such large-scale encoding models only. Since the presented functions are optimized for the using L1-regularization can be efficiently performed by the mass-univariate approach, estimation procedures are only ac- functions presented here. celerated in settings involving many outcome variables. Moreover, the presented functions facilitate the use of L1- In conclusion, the functions introduced here significantly regularization for modeling approaches such as single-trial accelerate L1-regularization in the mass-univariate setting and and FIR models (Mumford et al. 2012; Ollinger et al. 2001). make it practically feasible to estimate L1-regularized models To our knowledge, the impact of L1-regularization has not yet on whole-brain data and large samples. 392 Neuroinform (2021) 19:385–392 Eslami, S. M. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. Information Sharing Statement The presented software pack- S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, age including compiled binaries and source code is available K., Reichert, D. P., Buesing, L., Weber, T., Vinyals, O., at https://git.io/JvUpi. Rosenbaum, D., Rabinowitz, N., King, H., Hillier, C., Botvinick, M., Wierstra, D., Kavukcuoglu, K., & Hassabis, D. (2018). Neural scene representation and rendering. Science, 360,1204–1210. Acknowledgments This work was supported by the German Research Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths Foundation (DFG), grant CRC 940, project Z2. We thank the Center for for generalized linear models via coordinate descent. JStat Softw, Information Services and High Performance Computing (ZIH) at TU 33,1–22. Dresden for generously providing computing resources. Güclü, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ven- Availability of Data and Material Not applicable. tral stream. J Neurosci, 35, 10005–10014. Code Availability The presented software package including compiled Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estima- binaries and source code is available at https://git.io/JvUpi. tion for nonorthogonal problems. Technometrics, 42,80–86. Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile Funding Open Access funding provided by Projekt DEAL. This work human cerebral cortex. Nature, 532,453–458. was supported by the German Research Foundation (DFG), grant CRC Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). 940, project Z2. We thank the Center for Information Services and High Identifying natural images from human brain activity. Nature, 452, Performance Computing (ZIH) at TU Dresden for generously providing 352–355. computing resources. Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network rep- Compliance with Ethical Standards licates human auditory behavior, predicts brain responses, and re- veals a cortical processing hierarchy. Neuron, 98,630–644 e16. Mohr, H., Cichy, R, M. & Ruge, H. (2019). Deep neural networks can Conflicts of Interest/Competing Interests The authors declare that they predict human behavior in arcade games. Proceedings of the 2019 have no conflict of interest. conference on cognitive computational neuroscience, Berlin, Open Access This article is licensed under a Creative Commons Germany. DOI: https://doi.org/10.32470/CCN.2019.1043-0 Attribution 4.0 International License, which permits use, sharing, Mumford, J. A., Turner, B. O., Ashby, F. G., & Poldrack, R. A. (2012). adaptation, distribution and reproduction in any medium or format, as Deconvolving BOLD activation in event-related designs for long as you give appropriate credit to the original author(s) and the multivoxel pattern classification analyses. NeuroImage, 59, 2636– source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article Mumford, J. A., Davis, T., & Poldrack, R. A. (2014). The impact of study are included in the article's Creative Commons licence, unless indicated design on pattern estimation for single-trial multivariate pattern anal- otherwise in a credit line to the material. If material is not included in the ysis. NeuroImage, 103,130–138. article's Creative Commons licence and your intended use is not Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). permitted by statutory regulation or exceeds the permitted use, you will Encoding and decoding in fMRI. NeuroImage, 56,400–410. need to obtain permission directly from the copyright holder. To view a Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Curr Biol, 21,1641–1646. Ollinger, J. M., Shulman, G. L., & Corbetta, M. (2001). Separating pro- cesses within a trial in event-related functional MRI. NeuroImage, 13,210–217. References Rissman, J., Gazzaley, A., & D’Esposito, M. (2004). Measuring func- tional connectivity during distinct stages of a cognitive task. NeuroImage, 23,752–763. Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Schoenmakers, S., Barth, M., Heskes, T., & van Gerven, M. (2013). Pritzel, A., Chadwick, M. J., Degris, T., Modayil, J., Wayne, G., Linear reconstruction of perceived images from human brain activ- Soyer, H., Viola, F., Zhang, B., Goroshin, R., Rabinowitz, N., ity. NeuroImage, 83,951–961. Pascanu, R., Beattie, C., Petersen, S., Sadik, A., Gaffney, S., King, Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J H., Kavukcuoglu, K., Hassabis, D., Hadsell, R., & Kumaran, D. R Stat Soc Ser B Methodol, 58,267–288. (2018). Vector-based navigation using grid-like representations in van Gerven, M. A. J. (2017). A primer on encoding models in sensory artificial agents. Nature, 557,429–433. neuroscience. J Math Psychol, 76,172–183. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn., 3, Publisher’sNote Springer Nature remains neutral with regard to juris- 1–122. dictional claims in published maps and institutional affiliations.

Journal

NeuroinformaticsSpringer Journals

Published: Sep 15, 2020

Keywords: fMRI; L1-regularization; Lasso; Sparsity; Encoding model; GPU

References