Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Improving Covariance Matrices Derived from Tiny Training Datasets for the Classification of Event-Related Potentials with Linear Discriminant Analysis

Improving Covariance Matrices Derived from Tiny Training Datasets for the Classification of... Electroencephalogram data used in the domain of brain–computer interfaces typically has subpar signal-to-noise ratio and data acquisition is expensive. An effective and commonly used classifier to discriminate event-related potentials is the linear discriminant analysis which, however, requires an estimate of the feature distribution. While this information is provided by the feature covariance matrix its large number of free parameters calls for regularization approaches like Ledoit– Wolf shrinkage. Assuming that the noise of event-related potential recordings is not time-locked, we propose to decouple the time component from the covariance matrix of event-related potential data in order to further improve the estimates of the covariance matrix for linear discriminant analysis. We compare three regularized variants thereof and a feature representation based on Riemannian geometry against our proposed novel linear discriminant analysis with time-decoupled covariance estimates. Extensive evaluations on 14 electroencephalogram datasets reveal, that the novel approach increases the classification performance by up to four percentage points for small training datasets, and gracefully converges to the performance of standard shrinkage-regularized LDA for large training datasets. Given these results, practitioners in this field should consider using our proposed time-decoupled covariance estimation when they apply linear discriminant analysis to classify event-related potentials, especially when few training data points are available. Keywords Event related potentials · Robust classification · Learning from small datasets · Noise transfer learning · Brain–computer interface · Covariance matrix enhancement Introduction ratio of the signals recorded by the EEG electrodes on the scalp is bad, as many factors—e.g. volume conduction of A brain–computer interface (BCI) allows a subject to the brain or the long distance between sensor and the brain e.g. control a computer program using his or her brain sig- tissue—impede the recording quality (Srinivasan 2012). To realize control via BCIs, machine learning techniques are nals, which are often recorded via the electroencephalogram (EEG), as it is non-invasive, requires relatively inexpensive key to decode the brain signals in real-time. In addition to the bad signal-to-noise ratio, the machine learning problem equipment and could be used by a large part of the popula- is aggravated by the oftentimes small amount of training tion (Wolpaw et al. 2002). Unfortunately, the signal-to-noise data available in BCI experiments. Existing approaches to deal with small datasets, such as This work was (partly) supported by the BrainLinks-BrainTools transfer learning between subjects or sessions, have limited Cluster of Excellence funded by the German Research Foundation success if brain signals differ greatly between subjects (DFG, grant number EXC 1086) and the project SuitAble (DFG, grant number 387670982). The authors would also like and even between sessions of the same subject (Jayaram to acknowledge support by the state of Baden-Wurttember ¨ g, et al. 2016). Also, many BCI paradigms work optimally Germany, through bwHPC and the German Research Foundation only, if their experimental parameters are tuned to each (DFG, INST 39/963-1 FUGG). subject individually (Hohne ¨ and Tangermann 2012; Sugi Michael Tangermann et al. 2018; Allison and Pineda 2006). As many different michael.tangermann@donders.ru.nl experimental parameters need to be tested to find the optimal ones, the possibility to work with very small Extended author information available on the last page of the article. datasets would be a great benefit here. 462 Neuroinform (2021) 19:461–476 The mentioned challenges explain the popularity of features and the corresponding covariance (Blankertz et al. relatively simple classifiers in the BCI domain which 2011). In the domain of ERPs, the shrinkage-regularized can make efficient use of the training data (Blankertz LDA still belongs to the state-of-the-art methods (Lotte et al. et al. 2011). In contrast, in domains such as image 2007; Lotte et al. 2018). For ideal ERP data, the assumptions recognition, the massive amounts of data available enable of LDA would even be fulfilled, making LDA the optimal the employment of more sophisticated methods such as classification approach. However, in practice there are non- neural networks (Russakovsky et al. 2015). stationarities, outliers and artifacts which violate the LDA’s In this work, we are concerned with the classification of assumptions. Recently, Riemannian methods found their event-related potentials (ERPs) recorded using EEG data. way into BCI. For ERP classification they show promising These ERPs can be evoked by presenting visual, auditory performance gains for some datasets (Barachant et al. 2010; or haptic stimuli to a subject (Sellers et al. 2012; Schreuder Barachant and Congedo 2014). et al. 2010;Rutkowskiand Mori 2015). Time-locked to When using LDA to classify ERP signals, most the stimulus presentation, the ERP can be measured in the formulations require an estimate of the covariance matrix. EEG signal. Due to the small ERP amplitude and the high This matrix has 0.5 · (D + 1) · D free parameters with amplitude of the EEG’s background activity, visualizations D being the dimensionality of the feature vector. Using an like in Fig. 1 require repeated stimulus presentations and EEG cap with 31 channels and five time intervals only to averaging of the resulting ERP epochs. In this figure, two derive voltage features (as indicated in Fig. 1), the feature ERPs are shown, one for a specific stimulus the subject has dimensionality of D = 155 results in 12090 free parameters attended (so-called target ERP), and another one for a single of the covariance matrix, which need to be estimated or even multiple other stimuli the subject has ignored (non- during the LDA training. Usually, the amount of data points target ERP). The difference between target and non-target (epochs) in BCI problems is rather small, leading to sub- ERP voltages is the basis for classifying which stimulus a optimal estimates of the covariance matrix. If the number subject attends to in real-time. For a productive use of a of data points happens to be smaller than D, the covariance BCI, however, it is infeasible to average such a large number does not even have full rank and cannot be inverted. of epochs before a classification output can be obtained. In one of the example datasets in Blankertz et al. (2011) Therefore, machine learning is used to make classification they found this to be especially true if “the number of train- possible on short recordings. ing samples [is] low (750) compared to the dimensionality Many BCI systems make use of a linear discriminant of the features (385)”. For comparison, in our benchmark analysis (LDA) (see e.g. Bishop 2006) to classify if a stimu- the smallest training dataset has 72 training samples while lus was attended or not. The LDA makes use of ERP voltage the features have 310 dimensions. Farquhar and Hill (2013) show that classification performance increases with more training data, as they expected. However, they also note that “minimizing the number of training samples required to achieve acceptable performance is critically important to practical BCI performance”. Lotte et al. (2018) suggest that typical BCI systems could be trained with as few as 20-100 trials per class. In our benchmark, in the smallest dataset we use 12 target and 60 non-target training samples to train the classifiers. All authors recommend to shrink the covariance matrix used in LDA to the (scaled) unit sphere, especially when very few training data are available. The required amount of regularization can be determined analyt- ically as proposed by Ledoit and Wolf (2004)orbyusing cross-validation. There are alternative approaches for dealing with few training data. A first step is usually to perform strict data cleaning, such that the quality of the data is improved, for Fig. 1 Example of the mean ERP responses obtained from a single example using spatial filtering approaches (Winkler et al. subject during an auditory oddball paradigm with stimulus onset asynchrony (SOA) of 193 ms. For this plot, 300 target and 1500 non- 2014; Foodeh et al. 2016). Aside from regularization to target epochs were averaged. Prior to averaging, each epoch had been deal with the dimensionality issue, another straightforward corrected for baseline shifts relative to the interval [-0.2, 0.0] seconds. way is to reduce dimensionality altogether, e.g. by selecting Five gray shaded areas between 0.1 and 0.5 seconds post stimulus channel subsets (see e.g. Lal et al. 2004; Sannelli et al. 2010; onset indicate which time intervals are typically used to derive features for classification Feess et al. 2013) or using preprocessing methods that find Neuroinform (2021) 19:461–476 a lower-dimensional representation, e.g., xDAWN (Rivet for comparisons. Finally, in “Time-decoupled Covariance et al. 2009) or kernel principal component analysis Matrices” we detail our proposed new method with time- (kPCA) (Scholk ¨ opf et al. 1997). Other approaches employ decoupled covariance estimation. transfer learning, which re-use data from previous sessions or even different subjects to improve the classification Benchmark performance (Jayaram et al. 2016). However, in this work we focus on developing a method that can be applied To compare competing classification approaches, we without sophisticated preprocessing or using additional evaluated their obtained performances on fourteen datasets, data. This both increases the employability of our method which have been derived from twelve ERP data sources and facilitates the application to a large number of different (see Table 1) using MOABB (Mother of All BCI ERP datasets. Benchmarks) (Jayaram and Barachant 2018). We used In this work, we aim at improving the regularized covari- all (at the time of writing) ERP datasets available in ance matrix even further by making use of the observation, MOABB and added complementing datasets from our lab, that the noise in ERP data can be mostly attributed to task- i.e. two additional visual speller datasets and ERP data from unrelated background brain activity (Blankertz et al. 2011), auditory paradigms with tone and word stimuli. which therefore is not time-locked to the stimulus. For analysis purposes we have logically split two of the To show the efficacy of our method named time- data sources into two datasets each: The original EPFL decoupled covariance estimation, we carefully evaluate its dataset (EPFL) was split into data obtained from healthy performance on datasets recorded by our lab as well as on subjects and patients, while the brain invaders dataset public ERP datasets of which most are available in MOABB (BI) was split into subjects with one session and subjects (Mother of All BCI Benchmarks) (Jayaram and Barachant with eight sessions. Note that for the TONE patient and 2018). the WORD patient datasets, a publication is still pending, however the paradigms that were used are a tone oddball and a word oddball, similar to the paradigm described in Musso Methods et al. (2016). Some datasets share subjects: the subjects who took part in the word paradigm in WORD healthy also took We first describe our benchmarking approach. This includes part in the tone oddball paradigm in TONE healthy.The the datasets we used, the general classification and valida- same is true for WORD patient and TONE patient, except tion procedure as well as the preprocessing of the EEG data. that in this case four additional subjects are contained in Afterwards we present the classification methods we use the latter dataset. For in-depth explanations of the datasets Table 1 Overview of the ERP datasets evaluated and their characteristics Dataset Subjects Sessions Channels Paradigm Access Reference EPFL healthy* 4 4 32 visual, 6-choices public Hoffmann et al. (2008) EPFL patient* 4 4 32 visual, 6-choices public Hoffmann et al. (2008) BNCI healthy 1 10 1 16 visual, speller public Aricoe ` tal. (2014) BNCI healthy 2 10 1 8 visual, speller public Guger et al. (2009) BNCI patient 8 1 8 visual, speller public Riccio et al. (2013) BI a 7 8 16 visual, speller-like public Van Veen et al. (2019) BI b 17 1 16 visual, speller-like public Van Veen et al. (2019) SPOT 13 1 31 auditory, tones, oddball public Sosulski and Tangermann (2019) TONE healthy 20 1 63 auditory, tones, oddball closed Musso et al. (2016) WORD healthy 20 1 63 auditory, words, 6-choices closed Musso et al. (2016) TONE patient 14 11–25 31 auditory, tones, oddball closed pending WORD patient 10 11–25 31 auditory, words, 6-choice closed pending SPELLER LLP 12 1 31 visual speller closed H¨ubner et al. (2017) SPELLER MIX 12 1 31 visual speller closed H¨ubner et al. (2018) Datasets listed as closed access have been recorded in our own lab but cannot be fully published as subjects’ consent had not been obtained for this purpose. Datasets marked with an asterisk and a plus sign each have been derived by splitting a larger data source (see main text) 464 Neuroinform (2021) 19:461–476 please refer to the corresponding references in Table 1.In were available in total, and additionally, how many epochs total we evaluate on ERP data of 131 subjects (390 sessions were used within a virtual data subset. in total). Some datasets contained multiple sessions of each We are primarily interested in the classification perfor- subject. We decided to condense them into a single AUC mance on very small datasets. However, some of the datasets value per subject and classification approach. For this available to us consist of many epochs per session of each reason, AUC values obtained over multiple sessions of a subject. To investigate training concepts for small datasets, subject were averaged before reporting. we have split most datasets into virtual data subsets.These Statistical significance (at α = 0.05) between the splits were done at logical points in the paradigm. For exam- classification methods were determined using a paired ple in the SPOT dataset, where a subject performed 60 to 80 Wilcoxon signed rank test (Wilcoxon et al. 1970)on auditory oddball runs with 90 stimuli each and an approx- the differences between the average performance of each imately eight second pause between runs, we could split dataset, i.e. 14 values. We compared our proposed method this data up into virtual, non-overlapping subsets consisting with its underlying base classification, and our proposed of only 90 epochs, i.e. using each oddball run individually method with the best other classification method. Correction for a cross-validation loop. For this dataset, we therefore for multiple testing was done using the Holm–Bonferroni obtained between 60 and 80 virtual data subsets. For each method (Holm 1979). novel subset, the classification performance was estimated Using the code supplied in our repository, the bench- in an individual cross-validation loop. In order to obtain mark results can be reproduced. a single classification performance value per session of a subject, we averaged the performances obtained from the Data Preprocessing virtual data subsets. Overall, the virtual data subsets have sizes between 90 and 4200 epochs. Before obtaining the actual features which can be used by the classifiers, all EEG datasets are preprocessed using a Evaluation Procedures forward and a backward pass of a Butterworth bandpass filter in the range of 0.5 Hz to 16 Hz. Afterwards, the We used stratified 5-fold cross-validation within each data was downsampled to 100 Hz and windowed to 0 s virtual data subset in order to derive the classification to 1 s relative to each stimulus onset to represent the performance expressed by the area under the receiver corresponding data epoch. operating characteristic curve (AUC). Table 2 indicates for The common preprocessing step of baseline correction each dataset, how many epochs per session of each subject for ERP analysis causes the standard deviation in each channel to not be equal between time intervals (see Fig. 2), i.e. causes heteroscedasticity. As our proposed method Table 2 Average number of virtual data subsets (VDS) available for a assumes homoscedasticity, we both run the benchmark with session of a subject in each dataset and without baseline correction to determine the impact of this. # of Epochs A common step to reduce the influence of artifacts is to Dataset # of VDS per VDS exclude epochs that exceed a min-max criterion or reject EPFL healthy* 1 832 channels that show abnormal variance. However, we kept EPFL patient* 1 833 all epochs and channels in all datasets, as picking the right BNCI healthy 1 18 96 criterion for each dataset can lead to subjective results. BNCI healthy 2 1 2520 Using one common criterion for all datasets can also be BNCI patient 1 4200 detrimental, as the datasets recorded using visual paradigms BI a 1 480 tend to have large amplitudes compared to the auditory ones. BI b 1 480 Therefore, we consider the ability to cope with artifacts as SPOT 61 90 another challenge for the evaluated classification methods. TONE healthy 2 300 A common preprocessing for LDA-based classifiers is WORD healthy 18 540 to average the ERP responses in certain time intervals TONE patient 2 300 (cf. gray shaded areas in Fig. 1) to reduce the number WORD patient 54 152 of feature dimensions. When subject-specific maximized SPELLER LLP 54 204 performance is desired, these time intervals could be SPELLER MIX 60 204 determined automatically (see e.g. Bashashati et al. 2016). Also indicates the number of epochs within the virtual data subsets https://github.com/jsosulski/time-decoupled-lda Neuroinform (2021) 19:461–476 Fig. 2 Top row: Averaged event-related potential of all epochs of sub- and d with baseline correction. As expected, the standard deviation ject 1 of the BNCI healthy 1 dataset in response to visual stimuli is clearly reduced in the baseline interval. However, there is a dis- with two standard errors of the mean, a without and b with baseline tortion outside the baseline interval which leads to different standard correction as indicated by the horizontal gray box. Bottom row: Corre- deviations in the feature intervals (gray vertical boxes) sponding pooled standard deviation of the mean-free ERPs c without However, for comparability in our benchmark, we evaluated Classification Methods four sets of fixed time intervals (depending on the paradigm of the dataset, see Table 3) for all datasets and subjects for We employed three major types of classifiers. The first obtaining the features. comprises three versions of the linear discriminant analysis, This averaging in time intervals is not necessary for with each version using a different calculation method the Riemannian method (cf. “Classification Methods”), as for the covariance matrix. All three versions directly use it uses xDAWN-preprocessing (Rivet et al. 2009), which the EEG voltage features derived from sensor space. The extracts the ERP components in the whole epoch (0 s to number of voltage features per channel was treated as a 1 s) and inherently reduces dimensionality by using few hyperparameter. One version to calculate the covariance obtained xDAWN components instead of the whole EEG matrix for LDA is to estimate one matrix for each class channel set. and average these matrices into a common matrix. We call this LDA approach LDA c-covs. The implementation for this method was taken from the scikit-learn toolbox version v0.21.3 (Pedregosa et al. 2011). Alternatively, Table 3 Used time interval boundaries for the temporal sample-wise one can subtract the class-wise means from the data, averaging for the different paradigms and then pool the data of both classes and calculate # of Intervals Interval boundaries [s] one common covariance matrix from this pooled data (cf. (4)to(6)in“Feature Extraction and Covariance Visual and tone paradigms Calculation”). We refer to this approach as LDA p-cov. 2 {0.10, 0.18, 0.28} This was a custom implementation that can be found in 5 {0.10, 0.17, 0.23, 0.30, 0.41, 0.50} our repository. The third version is the newly proposed 10 {0.10, 0.14, 0.17, 0.20, 0.23, 0.27, LDA with a time-decoupled covariance estimation, named 0.30, 0.35, 0.41, 0.45, 0.50} LDA imp. p-cov, which is detailed in “Time-decoupled 40 {0.10, 0.11,..., 0.49, 0.50} Covariance Matrices”. As we are interested in settings with tiny datasets, we Word paradigms often faced the situation that the feature dimensionality 2 {0.40, 0.56}, {0.65, 0.91} exceeds the number of training samples. Therefore, the 5 {0.18, 0.26, 0.40, 0.56, 0.68, 0.91} second major classifier type uses a dimensionality reduction 10 {0.18, 0.23, 0.29, 0.40, 0.48, 0.56, step, which was performed initially on the voltage features 0.61, 0.68, 0.75, 0.82, 0.91} using a linear kernel PCA (Scholkopf et al. 1997). This 73 {0.18, 0.19,..., 0.90, 0.91} results in a smaller number of component features, which 466 Neuroinform (2021) 19:461–476 were furthermore classified by a LDA c-covs approach. obtained the best average performance across all datasets The number of components to use was treated as a and subjects. hyperparameter. In each cross-validation split, the kPCA components were calculated on the training fold and applied Feature Extraction and Covariance Calculation to both training and test folds. For brevity, we refer to this classifier type as kPCA. This section details the typical process of obtaining The third classifier type makes use of a specific space amplitude-based features and the LDA weights as detailed to represent each epoch as a covariance matrix and perform by Blankertz et al. (2011) for ERP-based BCIs. We describe operations in this space of covariance matrices using this procedure very detailed, as we build upon parts of it in Riemannian geometry. First proposed for BCI by Barachant the next section for our proposed method. and colleagues for motor imagery data (Barachant et al. The number of epochs per training dataset, the number 2010), extensions for ERP processing have been proposed. of available EEG channels and the number of time intervals We followed the ERP analysis pipeline of Kolkhorst (or, when using kPCA preprocessing, the number of et al. (2018), making use of xDAWN as a spatial filter kPCA components) varied between datasets. However, for preprocessing step (Rivet et al. 2009), extending the feature readability we will simplify the notation in this section representation by target (and non-target) templates prior and the next, by providing the formulae for an example to calculating the covariance matrix per epoch, and a dataset with 31 channels, 5 time intervals per channel and classification thereof in a tangent space representation using 90 training epochs. Note however, that the method can be logistic regression. Note that in each cross-validation split, applied to any number of channels, epochs or time intervals the xDAWN components were determined on the training larger than one. fold and applied to both training and test folds. For the We use the notation x for the scalar value representing Riemannian method, we varied the number of xDAWN the voltage in the i-th channel c during the j-th time interval components between one and six and treated this choice as T of one epoch. This yields the stacked feature vector a hyperparameter. Additionally we varied whether the target c c c c c 1 2 31 1 31 T x := (x ,x ,...,x ,x ,...,x ) , (1) T T T T T 1 1 1 2 5 class only, or both target and non-target class templates were used in the covariance representation. This pipeline will be which contains the relevant voltage features of a single referred to as Riemann. epoch, with x ∈ R . Stacking the feature vectors x of For all classifier types, hyperparameters were evaluated all 90 available epochs of a single trial, we obtain the data using values from a predetermined grid. In the case of matrix the LDA types and kPCA, the boundaries for the time X =[x , x ,..., x ], (2) 1 2 90 interval features considered are given by Table 3.For 155×90 with X ∈ R and x belonging to the i-th epoch. instance, the boundaries {0.10, 0.18, 0.28} describe two i Similarly, the class labels of all 90 epochs are contained time intervals, with the first being [0.10, 0.18) and the in the vector second one [0.18, 0.28). All evaluated hyperparameters can be found in Table 4. y := (y ,y ,...,y ) y ∈{0, 1}, (3) 1 2 90 i For the kPCA and the Riemann methods, all possible with an entry of 1 indicating a target stimulus and 0 a hyperparameter combinations are evaluated. To avoid non-target of the i-th epoch. overfitting, we report the single parameter set which Before calculating the covariance matrix Σ,wemust make X mean-free. As we have two different classes in our data, target and non-target, we need the class-wise means Table 4 All evaluated hyperparameters for each type of classifier μ if y = 1 M := , (4) Type Hyperparameter Values μ if y = 0 LDA Time intervals {2, 5, 10, all} where M describes the i-th column of the matrix M, with μ and μ containing the average target / non-target ERP kPCA Time intervals {2, 5, 10, all} 1 0 voltages (in these 90 epochs), respectively. Now we can kPCA comps. {10, 20,..., 90, all} calculate the class-wise mean-free feature matrix Riemann xDAWN comps. {1, 2, 3, 4, 5, 6} X := X − M, (5) Template class {both, target} and finally obtain the sample covariance matrix A value of ‘all’ time intervals means that every EEG sample in the ERP interval is taken individually, the exact time points differ between Σ := XX , (6) the paradigms (cf. Table 3) N − 1 Neuroinform (2021) 19:461–476 155×155 with Σ ∈ R . Given that in this example we consider in order to enhance the covariance matrix needed for the using only 90 epochs, Σ is linearly dependent and therefore calculation of the LDA weight vectors and bias. We thereby not invertible. In addition, Σ badly approximates the true build on the process described in “Feature Extraction and underlying covariance matrix Σ due to a systematic bias of Covariance Calculation”. overestimating large and underestimating small eigenvalues The dimensionality of the feature vector x results in a when too few datapoints are available (Blankertz et al. covariance matrix Σ with a particular structure. In order 2011). A practical method to obtain an invertible covariance to describe submatrices of Σ we use the notation Σ i:j,n:m matrix and counter the abovementioned bias is to regularize which indicates the submatrix obtained by using the i-th up the covariance matrix toward the main diagonal: to the j-th row and the n-thuptothe m-th column from Σ. For example, using the feature vector definition as described ν := diag(Σ) (7) in Eq. 1, the matrix Σ would describe the covariance 1:31,1:31 ˆ between all 31 channels within the first time interval T .The Σ := (1 − γ)Σ + γ νI ¯ (8) 1 covariance between all channels and between time intervals The sample covariance matrix Σ is regularized towards T and T is contained in Σ (and Σ the 1 2 1:31,32:62 32:62,1:31 a diagonal matrix where diagonal entries correspond other way around). to the average ν ¯ of the diagonal values of Σ.The If A2 is true, the covariance between channels (given time regularization strength γ was obtained using the Ledoit– intervals of the same size) should look similar within each Wolf lemma (Ledoit and Wolf 2004). time interval, i.e., Σ  Σ  ...  Σ (9) 1:31,1:31 32:62,32:62 125:155,125:155 Time-decoupled Covariance Matrices These five different blocks, which describe the covariance Our proposed method builds on the general LDA p-cov between channels separately for the five time intervals will pipeline from the previous section, but improves the be called B ,B ,...,B in the following. 1 2 5 covariance matrix by a better estimation of the spatial noise An example of a covariance matrix obtained from ERP structure. This is made possible by time-decoupling of the data is given by Fig. 3 as a heat map. The within-time noise estimation. interval blocks on the main diagonal (depicted with a green For the purpose of classification using LDA, two border) show a similar structure, but slightly vary in the common domain-specific assumptions about the noise in average intensities. The latter is caused by a different ERP data work well in practice (Blankertz et al. 2011): number of temporal samples averaged in each time interval. The first (A1) states that the noise on the ERP features However, if both A1 and A2 are true, the within-time interval is normally distributed and has zero-mean, which is reasonable to assume when using a high-pass filter on the measured signal and acknowledging that the EEG background noise is the result of many spatio-temporally overlapping brain sources. The second assumption (A2)is that the noise is unrelated to the current user task (i.e. either attend or ignore a stimulus) or—going one step further—if a stimulus has been played recently or not. On the level of a single epoch, this means that within a single EEG channel the noise should be homoscedastic, i.e. the same for the five extracted voltage features per channel. We saw before, that this is approximately fulfilled when no baseline correction is performed on the epochs (cf. Fig. 2). For the most common noise sources, such as technical noise and background EEG activity this assumption seems reasonable. In the conventional estimation of the covariance matrix, the channel-wise noise within a time interval is estimated for each time interval individually. However, when there is no difference in the channel-wise noise between the time Fig. 3 Covariance matrix of the ERP features depicted in the ERP plot intervals (A2), it seems reasonable to estimate one common in Fig. 1, except that no baseline correction was applied (see Fig. 2). channel-wise covariance matrix that is decoupled from the There are five distinct blocks (indicated by green borders), each different time intervals. We propose the idea, to obtain a containing the covariance between EEG channels within one time robust estimation of the between-channel covariance matrix interval 468 Neuroinform (2021) 19:461–476 channel covariance matrices B should (aside from noise) Results be equal, if the number of samples that are averaged in each of the time intervals T are identical. In this case, we We first report the average performance of the tested can calculate the within channel covariance regardless of a classification methods on all datasets and then the influence specific time interval in the epoch. Let of training dataset size on performance differences between the methods. Finally, subject-wise results are shown for C c c c T 1 2 31 x := (x ,x ,...,x ) , (10) some datasets. C 31 with x ∈ R represent the features of a single time Optimal Hyperparameters and Grand Average interval only. For our example, we obtain five of these Performance vectors x per epoch, one for each of the five different time intervals. Stacking these vectors, the feature matrix can be Searching through the hyperparameter space, we found that re-arranged to on average ten time intervals were optimal for all LDA- C C C C X :=[x , x ,..., x ], (11) 1 2 450 based approaches. For kPCA preprocessing, 70 components 31×(90·5) performed best across the datasets. The Riemannian-based with X ∈ R . Compared to Eq. 2,wenow classifier obtained the best performance when using only have a much larger number of samples to estimate ˆ the target class as a template and using five xDAWN this smaller between-channel covariance matrix Σ .The components (see Table 5). calculation of Σ can be performed as described in Eqs. 5 The grand average results using these hyperparameters and 6. Empirically we found, however, that shrinkage are shown as black markers in Fig. 4. Colored markers regularization should be avoided (except when D> N)as indicate the average AUC values across subjects sepa- it negatively affects classification performance. Note, that if rately for each dataset. The proposed new LDA method the width of the feature time intervals T ,T ,...,T differs, 1 2 5 (LDA imp. p-cov) using the time-decoupled pooled covari- the data should be scaled to a common variance prior to ance matrix outperformed the corresponding LDA method creating X . This can be accomplished by considering the (LDA p-cov) with a standard shrinkage-regularized pooled number of samples in a time interval T averaged per time covariance matrix by about 4 % points AUC (p = 0.003). interval, leading to a scaling factor of |T | for the m-th The kPCA is supposed to handle large feature dimensionali- time interval. ties rather well. As its performances, however, are very close After obtaining an estimate for the between-channel to those of the LDA p-cov, we assume that the improvement covariance matrix Σ , we use it to replace the blocks B of ˆ of our proposed novel method is not merely caused by a the whole covariance matrix Σ, however, only after having better handling of high-dimensional data. rescaled Σ to match the determinant of B ,i.e. The AUC improvement of our novel approach is still det B m around 2 % AUC points on average (p = 0.036) when C C ˆ ˆ Σ := Σ . (12) ˆ C comparing the proposed novel approach with the runner up, det Σ the Riemannian method. This rescaling ensures det B = det Σ . Intuitively, the We observed strong discontinuities in the raw data of rescaling has the effect, that the overall spread of the data the EPFL datasets, expressed by sudden step-wise voltage distribution described by the covariance matrix Σ remains offsets in the data. This seems to cause serious problems for equal to the overall spread of the data distribution described the LDA-based methods, whereas the Riemannian method by B . After the rescaling, we can substitute B with Σ and obtain a new covariance matrix Σ which will be used for the Table 5 Optimal hyperparameters that produced the best performance calculation of the classifier weights in linear discriminant on average across all datasets for each classification method analysis: Method Hyperparameter Value −1 w := Σ (μ − μ ) (13) 0 1 LDA imp. p-cov Time intervals 10 b := w(μ + μ ). (14) 0 1 LDA p-cov Time intervals 10 ˙ LDA c-covs Time intervals 10 This new covariance matrix Σ can be understood as kPCA Time intervals 10 a covariance matrix in which the blocks on the main diagonal, i.e. the between-channel covariance, has been kPCA components 70 time-decoupled. Hereinafter we refer to the LDA that uses Riemann Template class Target this way of time-decoupling to improve the pooled data xDAWN components 5 covariance matrix as LDA imp. p-cov. Neuroinform (2021) 19:461–476 Table 6 Comparison of the grand average AUC performances across all datasets and subjects of the evaluated classification methods Baseline correction: Method No Yes LDA imp. p-cov 0.858 0.818 LDA p-cov 0.819 0.811 LDA c-covs 0.815 0.808 Riemann 0.836 0.839 kPCA 0.826 0.815 This table shows the detrimental effects of baseline corrections on LDA classification performance a threshold of 0.5 Hz. In the case of even lower thresh- olds, the influence of baseline correction may have to be re-evaluated. Fig. 4 Performances for all datasets individually (polygon markers) and averaged (‘X’ markers), both after averaging across subjects. Black error bars indicate two standard errors of the mean Influence of Training Dataset Size In the top plot in Fig. 5a the performance difference between copes better with these discontinuities. Applied to the non- the proposed LDA imp. p-cov and the runner-up Riemann EPFL datasets, the Riemann method does not show a is shown for each subject and dataset. The proposed method clear advantage over the LDA methods (see Fig. 4). The outperforms the Riemann method especially when the reported advantage of Riemannian methods on ERP data amount of training data is small but it stays marginally (cf. Kolkhorst et al. 2018; Barachant and Congedo 2014) superior also for most larger datasets. The EPFL datasets may be more pronounced on larger training datasets. These, deviate from this observation, which could be attributed to however, were rare in our benchmark (median: 300 epochs, the Riemann method’s ability to cope well with artifacts, as inter-quartile range: 204 to 540 epochs). these datasets contain strong discontinuities in the epoched Interestingly, kPCA increases the average performance of signals. the LDA with class-wise covariance matrices. However, the We observe, that the Riemann method performs particu- effect is not very large. Figure 4 reveals that kPCA improves larly bad on the relatively large WORD healthy dataset. In performance greatly for the EPFL datasets compared to this dataset, the informative ERP features tend to have larger LDA c-covs, but for most other datasets its performance latencies than in datasets using less complex stimuli. decreases slightly. This indicates kPCA’s ability to deal For three subjects the performance is more than 5 % well with the discontinuities in the EPFL datasets. Lower points AUC worse when using the LDA imp. p-cov method. performances on the remaining datasets indicate that They belong to the datasets SPELLER MIX, SPELLER LLP kPCA’s hyperparameters do not generalize well over all and TONE healthy. Closer investigation revealed that for datasets. some virtual data subsets in these subjects the eigenvalues The effect of baseline correction on the classification per- of the covariance matrix were no longer all positive after formance of the different classifiers is provided by Table 6. replacing the diagonal blocks, causing the poor average We found that the performance of LDA classifiers tends performance when employed in the LDA. to decrease when using a baseline interval of -0.2 s to Figure 5b shows how the LDA imp. p-cov method 0 s compared to using no baseline correction at all. A compares to the regular LDA p-cov method. Here, the possible explanation for this is the effect baseline cor- same trend with respect to virtual data subset sizes can rection has on the feature’s standard deviations as shown be observed. Interestingly, our proposed method seems to in Fig. 2. As applying baseline correction violates assump- handle the discontinuities present in the EPFL datasets tion A2 (cf. “Time-decoupled Covariance Matrices”) the much better, leading to large performance differences performance decay of LDA imp. p-cov is largest among between these two methods. Compared to the Riemann all methods. The Riemannian classification method is the method, we can now see that the performances are nearly only one that benefits marginally (0.003 AUC points) from equal for the two largest datasets positioned on the right end baseline corrections. Note that we used a high-pass with of the horizontal axis. 470 Neuroinform (2021) 19:461–476 Fig. 5 a Separately for each subject and dataset, the mean AUC dif- only target class templates) for each dataset and the LDAs use the ference between the LDA using the time-decoupled pooled covariance overall optimal number of ten time intervals for each dataset. From the and the Riemannian method is provided. b Corresponding AUC dif- left to right the datasets are ordered by the average number of epochs ferences of the new method in comparison with an LDA using pooled in a virtual data subset (mean and standard deviation are provided in covariance matrices (bottom). In this overview, the Riemannian meth- brackets). Colors encode the number of EEG channels in a dataset ods use the overall optimal hyperparameters (five xDawn components, The impact of having few training data depends on As expected, this improvement is reduced when more and multiple factors, such as paradigm, signal-to-noise ratio more training data is used. Note that for the BNCI patient and dimensionality. To better quantify this impact, we dataset (Fig. 6b), the mean drops below 0, while the median additionally evaluated how the performance difference stays close to 0. This difference is caused by only one between our proposed method and the baseline LDA p- subject who has bad performance using the LDA imp. p-cov cov method develops depending on the amount of training with 1000 to 3000 epochs for the cross-validation. data. For three datasets, we trained both classifiers using Our proposed method estimates a more reliable version an increasing number of training samples per VDS from of the between-channel covariance matrix. To determine 100 up to the largest amount that was available for all how the number of channels impacts our proposed method, subjects in a dataset. In order to obtain standard error we evaluated on the TONE healthy dataset (as it offered estimates, we calculated the cross-validated AUC for each the largest number of channels) with both increasing VDS size 20 times on different within-class permutations. number of data for the VDS as well as with artificially As shown in Fig. 6, for few training samples LDA imp. p- reduced channel subsets. The results in Fig. 7 show that cov provides a better average performance for each dataset. the performance improvement remains relatively stable Neuroinform (2021) 19:461–476 471 Fig. 7 Interaction between amount of training data and number of channels. Each curve represents a different number of channels and depicts the performance difference between LDA imp. p-cov and LDA p-cov for the TONE healthy dataset for varying number of training samples per VDS. For this purpose, the full channel set of 63 channels was reduced to smaller, approximately equidistant sets. Each curve provides the median of 20 permutations. Values above 0 indicate a better performance of our proposed LDA imp. p-cov method Subject-wise Results for Selected Datasets In Fig. 8, absolute AUC performances of each subject are provided for three selected datasets and separately for the five classification approaches. The SPOT dataset on top (a) has the smallest virtual data subsets of 90 epochs each. We can see that our proposed method outperforms all other methods for every individual subject. Additionally, the ranking between subjects is very stable between the five approaches and specifically between the LDA-based methods. In Fig. 8b, the SPELLER LLP dataset with 204 epochs per virtual data subset is shown. A few outlier subjects are observed with markedly decreased performances. The left-most dark triangle in the LDA imp. p-cov method corresponds to the data of one of the subjects, which shows the numerical issues described previously. Fig. 6 Impact of increasing training samples on the performance However, these numerical issues do not apply to all poorly difference between LDA imp. p-cov and LDA p-cov for the datasets a performing subjects. The second worst subject in the BI b , b BNCI patient and c BNCI healthy 2. Values above 0 indicate LDA imp. p-cov method (left-most dark square) for example a better performance of our proposed LDA imp. p-cov method. With shows performance gains with the novel method compared growing dataset sizes, both methods converge to the same performance to the other methods. Figure 8cshows the BI b dataset with 480 epochs per when using 63 or 31 channels. However, especially when virtual data subset. In this dataset, the overall performance using very small channel sets of only four channels, the of the LDA imp. p-cov method is slightly better than that performance improvement obtained by our new approach is of the runner-up LDA p-cov. Additionally, the performance lower across all training data set sizes, when compared to of most subjects is very similar between the two methods, using the full set of channels. and only some subjects show a noticeable performance gain 472 Neuroinform (2021) 19:461–476 using the time-decoupled matrices. In the EEG data there we did not find any immediate indicator, such as artifacts or heavy noise, that indicates why that could be the case. Discussion In this work we considered mostly the traditional two- class ERP oddball paradigms. As our proposed method improves the covariance matrix, it could also be applied to multi-class methods that require a covariance matrix estimation, e.g. multi-class LDA, given that the assumptions we made in “Time-decoupled Covariance Matrices”are fulfilled. Additionally, there are other BCI paradigms using different kind of signals. For error-related potentials (Dal Seno et al. 2010) and slow cortical potentials (Krauledat et al. 2004), our method should be applicable without any additional adaptations, except choosing the relevant time intervals. However, this needs to be confirmed in future work. The transfer of our proposed method to oscillatory signals, such as steady state evoked potentials and event- related de-/synchronization in motor imagery, is not as straightforward. For these signals, usually the feature vector contains only spatial data from one single time interval. However, if features from multiple time intervals are used, our method should be applicable, given that our assumptions are fulfilled. In this work, we only evaluated the classification of brain signals. In theory, our approach should be applicable to regression approaches. However, these have to be covariance-based and use a spatio-temporal covariance matrix. While these approaches exist, they typically use a spatial covariance matrix (Dahne ¨ et al. 2014) only, or use features in different frequency bands (Fatemi and Daliri 2020), which can violate the assumption of feature homoscedasticity over time. For three subjects who were identified as negative performance outliers, we found that numerical instability can be caused by the proposed diagonal block replacement. Unfortunately, so far we found no indicators in the EEG signals, e.g. artifacts or heavy noise, which could predict if this rare problem will occur. In future work, we aim to determine the cause of the numerical instability from the data and how established approaches, e.g. regularization of the covariance matrix, can be implemented to obtain a Fig. 8 Results for the a SPOT,the b SPELLER LLP and the c BI b well-conditioned matrix after the replacement operation. dataset. Subjects can be matched by marker type and color brightness. Note that so far we only inspected the impact on Within one classification method strip, subjects are ordered from LDA performance of the time-decoupling of the covariance left to right with respect to their individual scores obtained by the LDA imp. p-cov method. The cross marker indicates the mean AUC, matrix. While we can observe improved performances, it with black bars marking two standard errors of the mean still is unclear whether this new covariance matrix is closer Neuroinform (2021) 19:461–476 to the true underlying covariance matrix. In future work, Second, the new methods allows to shorten the required we plan to run simulation studies to evaluate which matrix calibration time while still keeping the same classification estimation technique, i.e. time-decoupling, shrinkage or performance—a quality that improves usability specifically the sample covariance matrix, is closer to the actual data for patient studies. generating covariance matrix. Third, our approach yields classification performances We observed a performance benefit not only in these well above chance level even when using tiny (72 small datasets, but in most tested datasets of varying epochs) amounts of training data. This characteristic can dimensionality and SNR level. Thus we are optimistic, make experimental parameter optimization feasible, as the that the applicability of our proposed novel method is classification performance can be estimated reliably even on not restricted to the domain of BCI. Instead, it could be very short EEG recordings. valuable to apply it also to other data. Generally, any data Fourth, with increasing training data, the classification which also has a spatio-temporal structure, and in which the performance of our method converges to the performance spatial noise can be assumed to be constant with respect of the regular LDA. Therefore, there is no harm done using to time could profit from the proposed approach. Specific our approach even when abundant training data is available candidates are MEG and multi-electrode EMG recordings. or when it is unclear, if the size of the training data set Another possible application could be spatially distributed is in the right range for profiting from the time-decoupled sensor networks that make use of identically constructed covariances. sensors. Due to the aforementioned arguments, we would We also found that using a linear kernel principal recommend BCI practitioners to use the proposed time- component analysis does not improve the performance of an decoupled covariances for LDA as a first shot method, as it LDA classifier for most datasets. This indicates that large shows clear benefits over an ordinary shrinkage-regularized dimensionality is not a primary issue in these datasets. LDA for most ERP-classification scenarios. In ERP paradigms, the concrete ratio of target and non-target stimuli within chronological virtual data subsets depends on the used stimulus sequence. As in this work we Conclusion focused on method development, we chose to use a stratified cross-validation scheme rather than a chronological cross- Using domain knowledge and exploiting the specific validation scheme which typically is preferred in the BCI structure of the feature vectors in ERP classification domain. While the latter would have been closer to the paradigms, we propose a new way to estimate a covariance final application, it could not guarantee that all folds have matrix that outperforms a shrinkage based covariance the same class ratios. This would have been a disadvantage matrix, especially on small datasets. Our results could for the comparison of methods, as we introduce another enable fast-adapting BCIs that require short calibration challenge into the benchmark, i.e. how well does a method times. A possible application for our method is the tuning handle differing class ratios in training and validation. of stimulation parameters to an individual subject. Here, In fact, for datasets recorded in our lab in which the long recordings are not feasible and the information content generated stimulus sequence guarantees stratified folds, we of short recordings should be maximized to determine the observed, that the relative performance differences between optimal parameters. the methods were nearly identical (data not shown), but the overall performance across methods dropped by up to 5 % points AUC, depending on the dataset. Information Sharing Statement Our proposed novel time-decoupling of the covariance has merit when using it to enhance the feature covariance Results and most figures for the public datasets we used matrix of an LDA classifier. We think our approach can be reproduced using the code available at https://github. improves the usability of ERP-based BCIs due to several com/jsosulski/time-decoupled-lda. The detailed instruc- arguments, for which we found clear evidence in our tions make it easy to obtain the results, especially when extensive evaluation on multiple datasets. using the same system we used, i.e. Ubuntu 18.04 and First, our approach offers a simple, yet effective, way python 3.6.9. The proposed improved classifier makes use to improve the classification performance for very small of the widely used sklearn API and can be used as a drop- datasets—a problem identified by multiple authors in the in replacement for other sklearn classifiers. The classifier field of BCI, who emphasized the need for decoding is also available in the aforementioned repository. Public algorithms to be able to handle the training with few data datasets used in this work are automatically downloaded points. using the provided code. 474 Neuroinform (2021) 19:461–476 Acknowledgements The authors are thankful for the discussion with Barachant, A., Bonnet, S., Congedo, M., Jutten, C. (2010). Rie- Klaus-Robert Muller ¨ about this work. mannian geometry applied to BCI classification. In International conference on latent variable analysis and signal separation (pp. 629–636): Springer. Author Contributions Study design: JS, MT Bashashati, H., Ward, R.K., Bashashati, A. (2016). User-customized Literature review: JS, JPK, MT brain computer interfaces using Bayesian optimization. Journal of Implementation: JS, JPK Neural Engineering, 13(2), 026001. Data analysis: JS, JPK, MT Bishop, C.M. (2006). Linear models for classification. In Pattern recog- Wrote the manuscript: JS, MT nition and machine learning. chap 4 (pp. 179–220): Springer. Reviewed the manuscript: JS, JPK, MT Blankertz, B., Lemm, S., Treder, M., Haufe, S., Muller, ¨ K.R. (2011). Single-trial analysis and classification of ERP components—a Funding Open Access funding enabled and organized by Projekt tutorial. NeuroImage, 56(2), 814–825. DEAL. This work was (partly) supported by the BrainLinks- Dahne, ¨ S., Meinecke, F.C., Haufe, S., Hohne, ¨ J., Tangermann, M., BrainTools Cluster of Excellence funded by the German Research M¨uller, K.R., Nikulin, V.V. (2014). SPOc: a novel framework Foundation (DFG, grant number EXC 1086) and the project SuitAble for relating the amplitude of neuronal oscillations to behaviorally (DFG, grant number 387670982). The authors would also like to relevant parameters. NeuroImage, 86, 111–122. acknowledge support by the state of Baden-Wurttember ¨ g, Germany, Dal Seno, B., Matteucci, M., Mainardi, L. (2010). Online Detection through bwHPC and the German Research Foundation (DFG, INST of p300 and Error Potentials in a Bci Speller. Computational 39/963-1 FUGG). intelligence and neuroscience, 2010. Farquhar, J., & Hill, N.J. (2013). Interactions between pre-processing Data Availability All datasets used in this work are listed in Table 1. and classification methods for event-related-potential classifica- We used both publicly available data and restricted access data that was tion. Neuroinformatics, 11(2), 175–192. recorded in our lab and the subjects’ permission we collected did not Fatemi, M., & Daliri, M.R. (2020). Nonlinear sparse partial least include publishing raw EEG data. (See information sharing statement.) squares: an investigation of the effect of nonlinearity and sparsity on the decoding of intracranial data. Journal of Neural Data Availability The code that was used to produce the benchmark Engineering, 17(1), 016055. (on the public datasets) can be found in the above mentioned Feess, D., Krell, M.M., Metzen, J.H. (2013). Comparison of sensor repository. The classifier implementation is also available there. As selection mechanisms for an ERP-based brain-computer interface. it follows the popular sklearn API, it can be used as a drop- PloS One, 8(7), e67543. in replacement for many python pipelines. (See information sharing Foodeh, R., Khorasani, A., Shalchyan, V., Daliri, M.R. (2016). statement.) Minimum noise estimate filter: a novel automated artifacts removal method for field potentials. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(8), 1143–1152. Compliance with Ethical Standards Guger, C., Daban, S., Sellers, E., Holzner, C., Krausz, G., Carabalona, R., Gramatica, F., Edlinger, G. (2009). How many people are Conflict of interests The authors declare they have no conflicts of able to control a P300-based brain–computer interface (BCI)? interest. Neuroscience Letters, 462(1), 94–98. Hoffmann, U., Vesin, J.M., Ebrahimi, T., Diserens, K. (2008). Open Access This article is licensed under a Creative Commons An efficient P300-based brain–computer interface for disabled Attribution 4.0 International License, which permits use, sharing, subjects. Journal of Neuroscience Methods, 167(1), 115–125. adaptation, distribution and reproduction in any medium or format, as Hohne, ¨ J., & Tangermann, M. (2012). How stimulation speed long as you give appropriate credit to the original author(s) and the affects event-related potentials and BCI performance. In source, provide a link to the Creative Commons licence, and indicate 2012 annual international conference of the IEEE engi- if changes were made. The images or other third party material in neering in medicine and biology society (pp. 1802–1805). this article are included in the article’s Creative Commons licence, https://doi.org/10.1109/EMBC.2012.6346300. unless indicated otherwise in a credit line to the material. If material Holm, S. (1979). A simple sequentially rejective multiple test is not included in the article’s Creative Commons licence and your procedure. Scandinavian Journal of Statistics, 65–70. intended use is not permitted by statutory regulation or exceeds H¨ubner, D., Verhoeven, T., Schmid, K., Muller, ¨ K.R., Tangermann, the permitted use, you will need to obtain permission directly from M., Kindermans, P.J. (2017). Learning from label proportions the copyright holder. To view a copy of this licence, visit http:// in brain-computer interfaces: online unsupervised learning with creativecommonshorg/licenses/by/4.0/. guarantees. PloS one, 12(4). H¨ubner, D., Verhoeven, T., Muller, ¨ K.R., Kindermans, P.J., Tanger- mann, M. (2018). Unsupervised learning for brain-computer interfaces based on event-related potentials: Review and online References comparison. IEEE Computational Intelligence Magazine, 13(2), 66–77. Allison, B.Z., & Pineda, J.A. (2006). Effects of SOA and flash pattern Jayaram, V., & Barachant, A. (2018). MOABB: Trustworthy algorithm manipulations on ERPs, performance, and preference: implica- benchmarking for BCIs. Journal of Neural Engineering, 15(6), tions for a BCI system. International Journal of Psychophysiol- 066011. ogy, 59(2), 127–140. Jayaram, V., Alamgir, M., Altun, Y., Scholkopf, B., Grosse-Wentrup, Arico, ` P., Aloise, F., Schettini, F., Salinari, S., Mattia, D., Cincotti, F. M. (2016). Transfer learning in brain-computer interfaces. IEEE (2014). Influence of p300 latency jitter on event related potential- Computational Intelligence Magazine, 11(1), 20–31. based brain–computer interface performance. Journal of Neural Kolkhorst, H., Tangermann, M., Burgard, W. (2018). Guess what Engineering, 11(3), 035008. I attend: Interface-free object selection using brain signals. In Barachant, A., & Congedo, M. (2014). A plug&play P300 BCI using IEEE/RSJ international conference on intelligent robots and information geometry. arXiv:14090107. systems (IROS), (Vol. 2018 pp. 7111–7116): IEEE. Neuroinform (2021) 19:461–476 475 Krauledat, M., Dornhege, G., Blankertz, B., Losch, F., Curio, G., Sannelli, C., Dickhaus, T., Halder, S., Hammer, E.M., Muller, ¨ K.R., Muller, K.R. (2004). Improving speed and accuracy of brain- Blankertz, B. (2010). On optimal channel configurations for computer interfaces using readiness potential features. In The SMR-based brain–computer interfaces. Brain Topography, 23(2), 26th annual international conference of the IEEE engineering in 186–193. medicine and biology society, (Vol. 2 pp. 4511–4515): IEEE. Scholk ¨ opf, B., Smola, A., Muller, ¨ K.R. (1997). Kernel principal Lal, T.N., Schroder, M., Hinterberger, T., Weston, J., Bogdan, M., component analysis. In International conference on artificial Birbaumer, N., Scholkopf, B. (2004). Support vector channel neural networks (pp. 583–588): Springer. selection in BCI. IEEE Transactions on Biomedical Engineering, Schreuder, M., Blankertz, B., Tangermann, M. (2010). A new 51(6), 1003–1010. auditory multi-class brain-computer interface paradigm: Spa- Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for tial hearing as an informative cue. PLOS One, 5(4), 1–14. large-dimensional covariance matrices. Journal of Multivariate https://doi.org/10.1371/journal.pone.0009813. Analysis, 88(2), 365–411. Sellers, E.W., Arbel, Y., Donchin, E. (2012). BCIs that use P300 event- Lotte, F., Congedo, M., Lecuyer, ´ A., Lamarche, F., Arnaldi, B. related potentials. In Brain-computer interfaces: principles and (2007). A review of classification algorithms for EEG-based practice (p. 215): Oxford University Press. brain–computer interfaces. Journal of Neural Engineering, 4(2), Sosulski, J., & Tangermann, M. (2019). Spatial filters for auditory R1. evoked potentials transfer between different experimental condi- Lotte, F., Bougrain, L., Cichocki, A., Clerc, M., Congedo, M., tions. In Proceedings of the 8th Graz brain-computer interface Rakotomamonjy, A., Yger, F. (2018). A review of classification conference, (Vol. 2019 pp. 273–278). algorithms for EEG-based brain–computer interfaces: a 10 year Srinivasan, R. (2012). Acquiring brain signals from outside the brain. update. Journal of Neural Engineering, 15(3), 031005. In Brain-computer interfaces: principles and practice. chap 6 Musso, M., Bambadian, A., Denzer, S., Umarova, R., Hubner, ¨ (pp. 105–122). D., Tangermann, M. (2016). A novel BCI based rehabilitation Sugi, M., Hagimoto, Y., Nambu, I., Gonzalez, A., Takei, Y., Yano, S., approach for aphasia rehabilitation. In Proceedings of the 6th Hokari, H., Wada, Y. (2018). Improving the performance of an international brain-computer interface meeting (p. 104). auditory brain-computer interface using virtual sound sources by Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., shortening stimulus onset asynchrony. Frontiers in Neuroscience, Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, 12, 108. V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Van Veen, G., Barachant, A., Andreev, A., Cattan, G., Rodrigues, P.C., Perrot, M., Duchesnay, E. (2011). Scikit-learn: Machine learning Congedo, M. (2019). Building Brain Invaders: EEG data of an in Python. Journal of Machine Learning Research, 12, 2825– experimental validation. arXiv:190505182. 2830. Wilcoxon, F., Katti, S., Wilcox, R.A. (1970). Critical values and Riccio, A., Simione, L., Schettini, F., Pizzimenti, A., Inghilleri, M., probability levels for the Wilcoxon rank sum test and the Wilcoxon Olivetti Belardinelli, M., Mattia, D., Cincotti, F. (2013). Attention signedranktest. Selected Tables in Mathematical Statistics, 1, and P300-based BCI performance in people with amyotrophic 171–259. lateral sclerosis. Frontiers in Human Neuroscience, 7, 732. Winkler, I., Brandl, S., Horn, F., Waldburger, E., Allefeld, C., Tanger- Rivet, B., Souloumiac, A., Attina, V., Gibert, G. (2009). xDAWN algo- mann, M. (2014). Robust artifactual independent component clas- rithm to enhance evoked potentials: application to brain–computer sification for BCI practitioners. Journal of Neural Engineering, interface. IEEE Transactions on Biomedical Engineering, 56(8), 11(3), 035013. 2035–2043. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Vaughan, T.M. (2002). Brain–computer interfaces for communi- Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). cation and control. Clinical Neurophysiology, 113(6), 767–791. Imagenet large scale visual recognition challenge. International https://doi.org/10.1016/S1388-2457(02)00057-3. Journal of Computer Vision, 115(3), 211–252. Rutkowski, T.M., & Mori, H. (2015). Tactile and bone-conduction Publisher’s Note Springer Nature remains neutral with regard to auditory brain computer interface for vision and hearing impaired jurisdictional claims in published maps and institutional affiliations. users. Journal of Neuroscience Methods, 244, 45–51. 476 Neuroinform (2021) 19:461–476 Affiliations 1 2 1,3,4 Jan Sosulski · Jan-Philipp Kemmer · Michael Tangermann Jan Sosulski jan.sosulski@blbt.uni-freiburg.de Jan-Philipp Kemmer jan-philipp.kemmer@venus.uni-freiburg.de Brain State Decoding Lab, Cluster of Excellence BrainLinks-BrainTools, Department of Computer Science, University of Freiburg, Freiburg, Germany University of Freiburg, Freiburg, Germany Autonomous Intelligent Systems Lab, Department of Computer Science, University of Freiburg, Freiburg, Germany Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Neuroinformatics Springer Journals

Improving Covariance Matrices Derived from Tiny Training Datasets for the Classification of Event-Related Potentials with Linear Discriminant Analysis

Loading next page...
 
/lp/springer-journals/improving-covariance-matrices-derived-from-tiny-training-datasets-for-eiOrl9FAmr
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2020
ISSN
1539-2791
eISSN
1559-0089
DOI
10.1007/s12021-020-09501-8
Publisher site
See Article on Publisher Site

Abstract

Electroencephalogram data used in the domain of brain–computer interfaces typically has subpar signal-to-noise ratio and data acquisition is expensive. An effective and commonly used classifier to discriminate event-related potentials is the linear discriminant analysis which, however, requires an estimate of the feature distribution. While this information is provided by the feature covariance matrix its large number of free parameters calls for regularization approaches like Ledoit– Wolf shrinkage. Assuming that the noise of event-related potential recordings is not time-locked, we propose to decouple the time component from the covariance matrix of event-related potential data in order to further improve the estimates of the covariance matrix for linear discriminant analysis. We compare three regularized variants thereof and a feature representation based on Riemannian geometry against our proposed novel linear discriminant analysis with time-decoupled covariance estimates. Extensive evaluations on 14 electroencephalogram datasets reveal, that the novel approach increases the classification performance by up to four percentage points for small training datasets, and gracefully converges to the performance of standard shrinkage-regularized LDA for large training datasets. Given these results, practitioners in this field should consider using our proposed time-decoupled covariance estimation when they apply linear discriminant analysis to classify event-related potentials, especially when few training data points are available. Keywords Event related potentials · Robust classification · Learning from small datasets · Noise transfer learning · Brain–computer interface · Covariance matrix enhancement Introduction ratio of the signals recorded by the EEG electrodes on the scalp is bad, as many factors—e.g. volume conduction of A brain–computer interface (BCI) allows a subject to the brain or the long distance between sensor and the brain e.g. control a computer program using his or her brain sig- tissue—impede the recording quality (Srinivasan 2012). To realize control via BCIs, machine learning techniques are nals, which are often recorded via the electroencephalogram (EEG), as it is non-invasive, requires relatively inexpensive key to decode the brain signals in real-time. In addition to the bad signal-to-noise ratio, the machine learning problem equipment and could be used by a large part of the popula- is aggravated by the oftentimes small amount of training tion (Wolpaw et al. 2002). Unfortunately, the signal-to-noise data available in BCI experiments. Existing approaches to deal with small datasets, such as This work was (partly) supported by the BrainLinks-BrainTools transfer learning between subjects or sessions, have limited Cluster of Excellence funded by the German Research Foundation success if brain signals differ greatly between subjects (DFG, grant number EXC 1086) and the project SuitAble (DFG, grant number 387670982). The authors would also like and even between sessions of the same subject (Jayaram to acknowledge support by the state of Baden-Wurttember ¨ g, et al. 2016). Also, many BCI paradigms work optimally Germany, through bwHPC and the German Research Foundation only, if their experimental parameters are tuned to each (DFG, INST 39/963-1 FUGG). subject individually (Hohne ¨ and Tangermann 2012; Sugi Michael Tangermann et al. 2018; Allison and Pineda 2006). As many different michael.tangermann@donders.ru.nl experimental parameters need to be tested to find the optimal ones, the possibility to work with very small Extended author information available on the last page of the article. datasets would be a great benefit here. 462 Neuroinform (2021) 19:461–476 The mentioned challenges explain the popularity of features and the corresponding covariance (Blankertz et al. relatively simple classifiers in the BCI domain which 2011). In the domain of ERPs, the shrinkage-regularized can make efficient use of the training data (Blankertz LDA still belongs to the state-of-the-art methods (Lotte et al. et al. 2011). In contrast, in domains such as image 2007; Lotte et al. 2018). For ideal ERP data, the assumptions recognition, the massive amounts of data available enable of LDA would even be fulfilled, making LDA the optimal the employment of more sophisticated methods such as classification approach. However, in practice there are non- neural networks (Russakovsky et al. 2015). stationarities, outliers and artifacts which violate the LDA’s In this work, we are concerned with the classification of assumptions. Recently, Riemannian methods found their event-related potentials (ERPs) recorded using EEG data. way into BCI. For ERP classification they show promising These ERPs can be evoked by presenting visual, auditory performance gains for some datasets (Barachant et al. 2010; or haptic stimuli to a subject (Sellers et al. 2012; Schreuder Barachant and Congedo 2014). et al. 2010;Rutkowskiand Mori 2015). Time-locked to When using LDA to classify ERP signals, most the stimulus presentation, the ERP can be measured in the formulations require an estimate of the covariance matrix. EEG signal. Due to the small ERP amplitude and the high This matrix has 0.5 · (D + 1) · D free parameters with amplitude of the EEG’s background activity, visualizations D being the dimensionality of the feature vector. Using an like in Fig. 1 require repeated stimulus presentations and EEG cap with 31 channels and five time intervals only to averaging of the resulting ERP epochs. In this figure, two derive voltage features (as indicated in Fig. 1), the feature ERPs are shown, one for a specific stimulus the subject has dimensionality of D = 155 results in 12090 free parameters attended (so-called target ERP), and another one for a single of the covariance matrix, which need to be estimated or even multiple other stimuli the subject has ignored (non- during the LDA training. Usually, the amount of data points target ERP). The difference between target and non-target (epochs) in BCI problems is rather small, leading to sub- ERP voltages is the basis for classifying which stimulus a optimal estimates of the covariance matrix. If the number subject attends to in real-time. For a productive use of a of data points happens to be smaller than D, the covariance BCI, however, it is infeasible to average such a large number does not even have full rank and cannot be inverted. of epochs before a classification output can be obtained. In one of the example datasets in Blankertz et al. (2011) Therefore, machine learning is used to make classification they found this to be especially true if “the number of train- possible on short recordings. ing samples [is] low (750) compared to the dimensionality Many BCI systems make use of a linear discriminant of the features (385)”. For comparison, in our benchmark analysis (LDA) (see e.g. Bishop 2006) to classify if a stimu- the smallest training dataset has 72 training samples while lus was attended or not. The LDA makes use of ERP voltage the features have 310 dimensions. Farquhar and Hill (2013) show that classification performance increases with more training data, as they expected. However, they also note that “minimizing the number of training samples required to achieve acceptable performance is critically important to practical BCI performance”. Lotte et al. (2018) suggest that typical BCI systems could be trained with as few as 20-100 trials per class. In our benchmark, in the smallest dataset we use 12 target and 60 non-target training samples to train the classifiers. All authors recommend to shrink the covariance matrix used in LDA to the (scaled) unit sphere, especially when very few training data are available. The required amount of regularization can be determined analyt- ically as proposed by Ledoit and Wolf (2004)orbyusing cross-validation. There are alternative approaches for dealing with few training data. A first step is usually to perform strict data cleaning, such that the quality of the data is improved, for Fig. 1 Example of the mean ERP responses obtained from a single example using spatial filtering approaches (Winkler et al. subject during an auditory oddball paradigm with stimulus onset asynchrony (SOA) of 193 ms. For this plot, 300 target and 1500 non- 2014; Foodeh et al. 2016). Aside from regularization to target epochs were averaged. Prior to averaging, each epoch had been deal with the dimensionality issue, another straightforward corrected for baseline shifts relative to the interval [-0.2, 0.0] seconds. way is to reduce dimensionality altogether, e.g. by selecting Five gray shaded areas between 0.1 and 0.5 seconds post stimulus channel subsets (see e.g. Lal et al. 2004; Sannelli et al. 2010; onset indicate which time intervals are typically used to derive features for classification Feess et al. 2013) or using preprocessing methods that find Neuroinform (2021) 19:461–476 a lower-dimensional representation, e.g., xDAWN (Rivet for comparisons. Finally, in “Time-decoupled Covariance et al. 2009) or kernel principal component analysis Matrices” we detail our proposed new method with time- (kPCA) (Scholk ¨ opf et al. 1997). Other approaches employ decoupled covariance estimation. transfer learning, which re-use data from previous sessions or even different subjects to improve the classification Benchmark performance (Jayaram et al. 2016). However, in this work we focus on developing a method that can be applied To compare competing classification approaches, we without sophisticated preprocessing or using additional evaluated their obtained performances on fourteen datasets, data. This both increases the employability of our method which have been derived from twelve ERP data sources and facilitates the application to a large number of different (see Table 1) using MOABB (Mother of All BCI ERP datasets. Benchmarks) (Jayaram and Barachant 2018). We used In this work, we aim at improving the regularized covari- all (at the time of writing) ERP datasets available in ance matrix even further by making use of the observation, MOABB and added complementing datasets from our lab, that the noise in ERP data can be mostly attributed to task- i.e. two additional visual speller datasets and ERP data from unrelated background brain activity (Blankertz et al. 2011), auditory paradigms with tone and word stimuli. which therefore is not time-locked to the stimulus. For analysis purposes we have logically split two of the To show the efficacy of our method named time- data sources into two datasets each: The original EPFL decoupled covariance estimation, we carefully evaluate its dataset (EPFL) was split into data obtained from healthy performance on datasets recorded by our lab as well as on subjects and patients, while the brain invaders dataset public ERP datasets of which most are available in MOABB (BI) was split into subjects with one session and subjects (Mother of All BCI Benchmarks) (Jayaram and Barachant with eight sessions. Note that for the TONE patient and 2018). the WORD patient datasets, a publication is still pending, however the paradigms that were used are a tone oddball and a word oddball, similar to the paradigm described in Musso Methods et al. (2016). Some datasets share subjects: the subjects who took part in the word paradigm in WORD healthy also took We first describe our benchmarking approach. This includes part in the tone oddball paradigm in TONE healthy.The the datasets we used, the general classification and valida- same is true for WORD patient and TONE patient, except tion procedure as well as the preprocessing of the EEG data. that in this case four additional subjects are contained in Afterwards we present the classification methods we use the latter dataset. For in-depth explanations of the datasets Table 1 Overview of the ERP datasets evaluated and their characteristics Dataset Subjects Sessions Channels Paradigm Access Reference EPFL healthy* 4 4 32 visual, 6-choices public Hoffmann et al. (2008) EPFL patient* 4 4 32 visual, 6-choices public Hoffmann et al. (2008) BNCI healthy 1 10 1 16 visual, speller public Aricoe ` tal. (2014) BNCI healthy 2 10 1 8 visual, speller public Guger et al. (2009) BNCI patient 8 1 8 visual, speller public Riccio et al. (2013) BI a 7 8 16 visual, speller-like public Van Veen et al. (2019) BI b 17 1 16 visual, speller-like public Van Veen et al. (2019) SPOT 13 1 31 auditory, tones, oddball public Sosulski and Tangermann (2019) TONE healthy 20 1 63 auditory, tones, oddball closed Musso et al. (2016) WORD healthy 20 1 63 auditory, words, 6-choices closed Musso et al. (2016) TONE patient 14 11–25 31 auditory, tones, oddball closed pending WORD patient 10 11–25 31 auditory, words, 6-choice closed pending SPELLER LLP 12 1 31 visual speller closed H¨ubner et al. (2017) SPELLER MIX 12 1 31 visual speller closed H¨ubner et al. (2018) Datasets listed as closed access have been recorded in our own lab but cannot be fully published as subjects’ consent had not been obtained for this purpose. Datasets marked with an asterisk and a plus sign each have been derived by splitting a larger data source (see main text) 464 Neuroinform (2021) 19:461–476 please refer to the corresponding references in Table 1.In were available in total, and additionally, how many epochs total we evaluate on ERP data of 131 subjects (390 sessions were used within a virtual data subset. in total). Some datasets contained multiple sessions of each We are primarily interested in the classification perfor- subject. We decided to condense them into a single AUC mance on very small datasets. However, some of the datasets value per subject and classification approach. For this available to us consist of many epochs per session of each reason, AUC values obtained over multiple sessions of a subject. To investigate training concepts for small datasets, subject were averaged before reporting. we have split most datasets into virtual data subsets.These Statistical significance (at α = 0.05) between the splits were done at logical points in the paradigm. For exam- classification methods were determined using a paired ple in the SPOT dataset, where a subject performed 60 to 80 Wilcoxon signed rank test (Wilcoxon et al. 1970)on auditory oddball runs with 90 stimuli each and an approx- the differences between the average performance of each imately eight second pause between runs, we could split dataset, i.e. 14 values. We compared our proposed method this data up into virtual, non-overlapping subsets consisting with its underlying base classification, and our proposed of only 90 epochs, i.e. using each oddball run individually method with the best other classification method. Correction for a cross-validation loop. For this dataset, we therefore for multiple testing was done using the Holm–Bonferroni obtained between 60 and 80 virtual data subsets. For each method (Holm 1979). novel subset, the classification performance was estimated Using the code supplied in our repository, the bench- in an individual cross-validation loop. In order to obtain mark results can be reproduced. a single classification performance value per session of a subject, we averaged the performances obtained from the Data Preprocessing virtual data subsets. Overall, the virtual data subsets have sizes between 90 and 4200 epochs. Before obtaining the actual features which can be used by the classifiers, all EEG datasets are preprocessed using a Evaluation Procedures forward and a backward pass of a Butterworth bandpass filter in the range of 0.5 Hz to 16 Hz. Afterwards, the We used stratified 5-fold cross-validation within each data was downsampled to 100 Hz and windowed to 0 s virtual data subset in order to derive the classification to 1 s relative to each stimulus onset to represent the performance expressed by the area under the receiver corresponding data epoch. operating characteristic curve (AUC). Table 2 indicates for The common preprocessing step of baseline correction each dataset, how many epochs per session of each subject for ERP analysis causes the standard deviation in each channel to not be equal between time intervals (see Fig. 2), i.e. causes heteroscedasticity. As our proposed method Table 2 Average number of virtual data subsets (VDS) available for a assumes homoscedasticity, we both run the benchmark with session of a subject in each dataset and without baseline correction to determine the impact of this. # of Epochs A common step to reduce the influence of artifacts is to Dataset # of VDS per VDS exclude epochs that exceed a min-max criterion or reject EPFL healthy* 1 832 channels that show abnormal variance. However, we kept EPFL patient* 1 833 all epochs and channels in all datasets, as picking the right BNCI healthy 1 18 96 criterion for each dataset can lead to subjective results. BNCI healthy 2 1 2520 Using one common criterion for all datasets can also be BNCI patient 1 4200 detrimental, as the datasets recorded using visual paradigms BI a 1 480 tend to have large amplitudes compared to the auditory ones. BI b 1 480 Therefore, we consider the ability to cope with artifacts as SPOT 61 90 another challenge for the evaluated classification methods. TONE healthy 2 300 A common preprocessing for LDA-based classifiers is WORD healthy 18 540 to average the ERP responses in certain time intervals TONE patient 2 300 (cf. gray shaded areas in Fig. 1) to reduce the number WORD patient 54 152 of feature dimensions. When subject-specific maximized SPELLER LLP 54 204 performance is desired, these time intervals could be SPELLER MIX 60 204 determined automatically (see e.g. Bashashati et al. 2016). Also indicates the number of epochs within the virtual data subsets https://github.com/jsosulski/time-decoupled-lda Neuroinform (2021) 19:461–476 Fig. 2 Top row: Averaged event-related potential of all epochs of sub- and d with baseline correction. As expected, the standard deviation ject 1 of the BNCI healthy 1 dataset in response to visual stimuli is clearly reduced in the baseline interval. However, there is a dis- with two standard errors of the mean, a without and b with baseline tortion outside the baseline interval which leads to different standard correction as indicated by the horizontal gray box. Bottom row: Corre- deviations in the feature intervals (gray vertical boxes) sponding pooled standard deviation of the mean-free ERPs c without However, for comparability in our benchmark, we evaluated Classification Methods four sets of fixed time intervals (depending on the paradigm of the dataset, see Table 3) for all datasets and subjects for We employed three major types of classifiers. The first obtaining the features. comprises three versions of the linear discriminant analysis, This averaging in time intervals is not necessary for with each version using a different calculation method the Riemannian method (cf. “Classification Methods”), as for the covariance matrix. All three versions directly use it uses xDAWN-preprocessing (Rivet et al. 2009), which the EEG voltage features derived from sensor space. The extracts the ERP components in the whole epoch (0 s to number of voltage features per channel was treated as a 1 s) and inherently reduces dimensionality by using few hyperparameter. One version to calculate the covariance obtained xDAWN components instead of the whole EEG matrix for LDA is to estimate one matrix for each class channel set. and average these matrices into a common matrix. We call this LDA approach LDA c-covs. The implementation for this method was taken from the scikit-learn toolbox version v0.21.3 (Pedregosa et al. 2011). Alternatively, Table 3 Used time interval boundaries for the temporal sample-wise one can subtract the class-wise means from the data, averaging for the different paradigms and then pool the data of both classes and calculate # of Intervals Interval boundaries [s] one common covariance matrix from this pooled data (cf. (4)to(6)in“Feature Extraction and Covariance Visual and tone paradigms Calculation”). We refer to this approach as LDA p-cov. 2 {0.10, 0.18, 0.28} This was a custom implementation that can be found in 5 {0.10, 0.17, 0.23, 0.30, 0.41, 0.50} our repository. The third version is the newly proposed 10 {0.10, 0.14, 0.17, 0.20, 0.23, 0.27, LDA with a time-decoupled covariance estimation, named 0.30, 0.35, 0.41, 0.45, 0.50} LDA imp. p-cov, which is detailed in “Time-decoupled 40 {0.10, 0.11,..., 0.49, 0.50} Covariance Matrices”. As we are interested in settings with tiny datasets, we Word paradigms often faced the situation that the feature dimensionality 2 {0.40, 0.56}, {0.65, 0.91} exceeds the number of training samples. Therefore, the 5 {0.18, 0.26, 0.40, 0.56, 0.68, 0.91} second major classifier type uses a dimensionality reduction 10 {0.18, 0.23, 0.29, 0.40, 0.48, 0.56, step, which was performed initially on the voltage features 0.61, 0.68, 0.75, 0.82, 0.91} using a linear kernel PCA (Scholkopf et al. 1997). This 73 {0.18, 0.19,..., 0.90, 0.91} results in a smaller number of component features, which 466 Neuroinform (2021) 19:461–476 were furthermore classified by a LDA c-covs approach. obtained the best average performance across all datasets The number of components to use was treated as a and subjects. hyperparameter. In each cross-validation split, the kPCA components were calculated on the training fold and applied Feature Extraction and Covariance Calculation to both training and test folds. For brevity, we refer to this classifier type as kPCA. This section details the typical process of obtaining The third classifier type makes use of a specific space amplitude-based features and the LDA weights as detailed to represent each epoch as a covariance matrix and perform by Blankertz et al. (2011) for ERP-based BCIs. We describe operations in this space of covariance matrices using this procedure very detailed, as we build upon parts of it in Riemannian geometry. First proposed for BCI by Barachant the next section for our proposed method. and colleagues for motor imagery data (Barachant et al. The number of epochs per training dataset, the number 2010), extensions for ERP processing have been proposed. of available EEG channels and the number of time intervals We followed the ERP analysis pipeline of Kolkhorst (or, when using kPCA preprocessing, the number of et al. (2018), making use of xDAWN as a spatial filter kPCA components) varied between datasets. However, for preprocessing step (Rivet et al. 2009), extending the feature readability we will simplify the notation in this section representation by target (and non-target) templates prior and the next, by providing the formulae for an example to calculating the covariance matrix per epoch, and a dataset with 31 channels, 5 time intervals per channel and classification thereof in a tangent space representation using 90 training epochs. Note however, that the method can be logistic regression. Note that in each cross-validation split, applied to any number of channels, epochs or time intervals the xDAWN components were determined on the training larger than one. fold and applied to both training and test folds. For the We use the notation x for the scalar value representing Riemannian method, we varied the number of xDAWN the voltage in the i-th channel c during the j-th time interval components between one and six and treated this choice as T of one epoch. This yields the stacked feature vector a hyperparameter. Additionally we varied whether the target c c c c c 1 2 31 1 31 T x := (x ,x ,...,x ,x ,...,x ) , (1) T T T T T 1 1 1 2 5 class only, or both target and non-target class templates were used in the covariance representation. This pipeline will be which contains the relevant voltage features of a single referred to as Riemann. epoch, with x ∈ R . Stacking the feature vectors x of For all classifier types, hyperparameters were evaluated all 90 available epochs of a single trial, we obtain the data using values from a predetermined grid. In the case of matrix the LDA types and kPCA, the boundaries for the time X =[x , x ,..., x ], (2) 1 2 90 interval features considered are given by Table 3.For 155×90 with X ∈ R and x belonging to the i-th epoch. instance, the boundaries {0.10, 0.18, 0.28} describe two i Similarly, the class labels of all 90 epochs are contained time intervals, with the first being [0.10, 0.18) and the in the vector second one [0.18, 0.28). All evaluated hyperparameters can be found in Table 4. y := (y ,y ,...,y ) y ∈{0, 1}, (3) 1 2 90 i For the kPCA and the Riemann methods, all possible with an entry of 1 indicating a target stimulus and 0 a hyperparameter combinations are evaluated. To avoid non-target of the i-th epoch. overfitting, we report the single parameter set which Before calculating the covariance matrix Σ,wemust make X mean-free. As we have two different classes in our data, target and non-target, we need the class-wise means Table 4 All evaluated hyperparameters for each type of classifier μ if y = 1 M := , (4) Type Hyperparameter Values μ if y = 0 LDA Time intervals {2, 5, 10, all} where M describes the i-th column of the matrix M, with μ and μ containing the average target / non-target ERP kPCA Time intervals {2, 5, 10, all} 1 0 voltages (in these 90 epochs), respectively. Now we can kPCA comps. {10, 20,..., 90, all} calculate the class-wise mean-free feature matrix Riemann xDAWN comps. {1, 2, 3, 4, 5, 6} X := X − M, (5) Template class {both, target} and finally obtain the sample covariance matrix A value of ‘all’ time intervals means that every EEG sample in the ERP interval is taken individually, the exact time points differ between Σ := XX , (6) the paradigms (cf. Table 3) N − 1 Neuroinform (2021) 19:461–476 155×155 with Σ ∈ R . Given that in this example we consider in order to enhance the covariance matrix needed for the using only 90 epochs, Σ is linearly dependent and therefore calculation of the LDA weight vectors and bias. We thereby not invertible. In addition, Σ badly approximates the true build on the process described in “Feature Extraction and underlying covariance matrix Σ due to a systematic bias of Covariance Calculation”. overestimating large and underestimating small eigenvalues The dimensionality of the feature vector x results in a when too few datapoints are available (Blankertz et al. covariance matrix Σ with a particular structure. In order 2011). A practical method to obtain an invertible covariance to describe submatrices of Σ we use the notation Σ i:j,n:m matrix and counter the abovementioned bias is to regularize which indicates the submatrix obtained by using the i-th up the covariance matrix toward the main diagonal: to the j-th row and the n-thuptothe m-th column from Σ. For example, using the feature vector definition as described ν := diag(Σ) (7) in Eq. 1, the matrix Σ would describe the covariance 1:31,1:31 ˆ between all 31 channels within the first time interval T .The Σ := (1 − γ)Σ + γ νI ¯ (8) 1 covariance between all channels and between time intervals The sample covariance matrix Σ is regularized towards T and T is contained in Σ (and Σ the 1 2 1:31,32:62 32:62,1:31 a diagonal matrix where diagonal entries correspond other way around). to the average ν ¯ of the diagonal values of Σ.The If A2 is true, the covariance between channels (given time regularization strength γ was obtained using the Ledoit– intervals of the same size) should look similar within each Wolf lemma (Ledoit and Wolf 2004). time interval, i.e., Σ  Σ  ...  Σ (9) 1:31,1:31 32:62,32:62 125:155,125:155 Time-decoupled Covariance Matrices These five different blocks, which describe the covariance Our proposed method builds on the general LDA p-cov between channels separately for the five time intervals will pipeline from the previous section, but improves the be called B ,B ,...,B in the following. 1 2 5 covariance matrix by a better estimation of the spatial noise An example of a covariance matrix obtained from ERP structure. This is made possible by time-decoupling of the data is given by Fig. 3 as a heat map. The within-time noise estimation. interval blocks on the main diagonal (depicted with a green For the purpose of classification using LDA, two border) show a similar structure, but slightly vary in the common domain-specific assumptions about the noise in average intensities. The latter is caused by a different ERP data work well in practice (Blankertz et al. 2011): number of temporal samples averaged in each time interval. The first (A1) states that the noise on the ERP features However, if both A1 and A2 are true, the within-time interval is normally distributed and has zero-mean, which is reasonable to assume when using a high-pass filter on the measured signal and acknowledging that the EEG background noise is the result of many spatio-temporally overlapping brain sources. The second assumption (A2)is that the noise is unrelated to the current user task (i.e. either attend or ignore a stimulus) or—going one step further—if a stimulus has been played recently or not. On the level of a single epoch, this means that within a single EEG channel the noise should be homoscedastic, i.e. the same for the five extracted voltage features per channel. We saw before, that this is approximately fulfilled when no baseline correction is performed on the epochs (cf. Fig. 2). For the most common noise sources, such as technical noise and background EEG activity this assumption seems reasonable. In the conventional estimation of the covariance matrix, the channel-wise noise within a time interval is estimated for each time interval individually. However, when there is no difference in the channel-wise noise between the time Fig. 3 Covariance matrix of the ERP features depicted in the ERP plot intervals (A2), it seems reasonable to estimate one common in Fig. 1, except that no baseline correction was applied (see Fig. 2). channel-wise covariance matrix that is decoupled from the There are five distinct blocks (indicated by green borders), each different time intervals. We propose the idea, to obtain a containing the covariance between EEG channels within one time robust estimation of the between-channel covariance matrix interval 468 Neuroinform (2021) 19:461–476 channel covariance matrices B should (aside from noise) Results be equal, if the number of samples that are averaged in each of the time intervals T are identical. In this case, we We first report the average performance of the tested can calculate the within channel covariance regardless of a classification methods on all datasets and then the influence specific time interval in the epoch. Let of training dataset size on performance differences between the methods. Finally, subject-wise results are shown for C c c c T 1 2 31 x := (x ,x ,...,x ) , (10) some datasets. C 31 with x ∈ R represent the features of a single time Optimal Hyperparameters and Grand Average interval only. For our example, we obtain five of these Performance vectors x per epoch, one for each of the five different time intervals. Stacking these vectors, the feature matrix can be Searching through the hyperparameter space, we found that re-arranged to on average ten time intervals were optimal for all LDA- C C C C X :=[x , x ,..., x ], (11) 1 2 450 based approaches. For kPCA preprocessing, 70 components 31×(90·5) performed best across the datasets. The Riemannian-based with X ∈ R . Compared to Eq. 2,wenow classifier obtained the best performance when using only have a much larger number of samples to estimate ˆ the target class as a template and using five xDAWN this smaller between-channel covariance matrix Σ .The components (see Table 5). calculation of Σ can be performed as described in Eqs. 5 The grand average results using these hyperparameters and 6. Empirically we found, however, that shrinkage are shown as black markers in Fig. 4. Colored markers regularization should be avoided (except when D> N)as indicate the average AUC values across subjects sepa- it negatively affects classification performance. Note, that if rately for each dataset. The proposed new LDA method the width of the feature time intervals T ,T ,...,T differs, 1 2 5 (LDA imp. p-cov) using the time-decoupled pooled covari- the data should be scaled to a common variance prior to ance matrix outperformed the corresponding LDA method creating X . This can be accomplished by considering the (LDA p-cov) with a standard shrinkage-regularized pooled number of samples in a time interval T averaged per time covariance matrix by about 4 % points AUC (p = 0.003). interval, leading to a scaling factor of |T | for the m-th The kPCA is supposed to handle large feature dimensionali- time interval. ties rather well. As its performances, however, are very close After obtaining an estimate for the between-channel to those of the LDA p-cov, we assume that the improvement covariance matrix Σ , we use it to replace the blocks B of ˆ of our proposed novel method is not merely caused by a the whole covariance matrix Σ, however, only after having better handling of high-dimensional data. rescaled Σ to match the determinant of B ,i.e. The AUC improvement of our novel approach is still det B m around 2 % AUC points on average (p = 0.036) when C C ˆ ˆ Σ := Σ . (12) ˆ C comparing the proposed novel approach with the runner up, det Σ the Riemannian method. This rescaling ensures det B = det Σ . Intuitively, the We observed strong discontinuities in the raw data of rescaling has the effect, that the overall spread of the data the EPFL datasets, expressed by sudden step-wise voltage distribution described by the covariance matrix Σ remains offsets in the data. This seems to cause serious problems for equal to the overall spread of the data distribution described the LDA-based methods, whereas the Riemannian method by B . After the rescaling, we can substitute B with Σ and obtain a new covariance matrix Σ which will be used for the Table 5 Optimal hyperparameters that produced the best performance calculation of the classifier weights in linear discriminant on average across all datasets for each classification method analysis: Method Hyperparameter Value −1 w := Σ (μ − μ ) (13) 0 1 LDA imp. p-cov Time intervals 10 b := w(μ + μ ). (14) 0 1 LDA p-cov Time intervals 10 ˙ LDA c-covs Time intervals 10 This new covariance matrix Σ can be understood as kPCA Time intervals 10 a covariance matrix in which the blocks on the main diagonal, i.e. the between-channel covariance, has been kPCA components 70 time-decoupled. Hereinafter we refer to the LDA that uses Riemann Template class Target this way of time-decoupling to improve the pooled data xDAWN components 5 covariance matrix as LDA imp. p-cov. Neuroinform (2021) 19:461–476 Table 6 Comparison of the grand average AUC performances across all datasets and subjects of the evaluated classification methods Baseline correction: Method No Yes LDA imp. p-cov 0.858 0.818 LDA p-cov 0.819 0.811 LDA c-covs 0.815 0.808 Riemann 0.836 0.839 kPCA 0.826 0.815 This table shows the detrimental effects of baseline corrections on LDA classification performance a threshold of 0.5 Hz. In the case of even lower thresh- olds, the influence of baseline correction may have to be re-evaluated. Fig. 4 Performances for all datasets individually (polygon markers) and averaged (‘X’ markers), both after averaging across subjects. Black error bars indicate two standard errors of the mean Influence of Training Dataset Size In the top plot in Fig. 5a the performance difference between copes better with these discontinuities. Applied to the non- the proposed LDA imp. p-cov and the runner-up Riemann EPFL datasets, the Riemann method does not show a is shown for each subject and dataset. The proposed method clear advantage over the LDA methods (see Fig. 4). The outperforms the Riemann method especially when the reported advantage of Riemannian methods on ERP data amount of training data is small but it stays marginally (cf. Kolkhorst et al. 2018; Barachant and Congedo 2014) superior also for most larger datasets. The EPFL datasets may be more pronounced on larger training datasets. These, deviate from this observation, which could be attributed to however, were rare in our benchmark (median: 300 epochs, the Riemann method’s ability to cope well with artifacts, as inter-quartile range: 204 to 540 epochs). these datasets contain strong discontinuities in the epoched Interestingly, kPCA increases the average performance of signals. the LDA with class-wise covariance matrices. However, the We observe, that the Riemann method performs particu- effect is not very large. Figure 4 reveals that kPCA improves larly bad on the relatively large WORD healthy dataset. In performance greatly for the EPFL datasets compared to this dataset, the informative ERP features tend to have larger LDA c-covs, but for most other datasets its performance latencies than in datasets using less complex stimuli. decreases slightly. This indicates kPCA’s ability to deal For three subjects the performance is more than 5 % well with the discontinuities in the EPFL datasets. Lower points AUC worse when using the LDA imp. p-cov method. performances on the remaining datasets indicate that They belong to the datasets SPELLER MIX, SPELLER LLP kPCA’s hyperparameters do not generalize well over all and TONE healthy. Closer investigation revealed that for datasets. some virtual data subsets in these subjects the eigenvalues The effect of baseline correction on the classification per- of the covariance matrix were no longer all positive after formance of the different classifiers is provided by Table 6. replacing the diagonal blocks, causing the poor average We found that the performance of LDA classifiers tends performance when employed in the LDA. to decrease when using a baseline interval of -0.2 s to Figure 5b shows how the LDA imp. p-cov method 0 s compared to using no baseline correction at all. A compares to the regular LDA p-cov method. Here, the possible explanation for this is the effect baseline cor- same trend with respect to virtual data subset sizes can rection has on the feature’s standard deviations as shown be observed. Interestingly, our proposed method seems to in Fig. 2. As applying baseline correction violates assump- handle the discontinuities present in the EPFL datasets tion A2 (cf. “Time-decoupled Covariance Matrices”) the much better, leading to large performance differences performance decay of LDA imp. p-cov is largest among between these two methods. Compared to the Riemann all methods. The Riemannian classification method is the method, we can now see that the performances are nearly only one that benefits marginally (0.003 AUC points) from equal for the two largest datasets positioned on the right end baseline corrections. Note that we used a high-pass with of the horizontal axis. 470 Neuroinform (2021) 19:461–476 Fig. 5 a Separately for each subject and dataset, the mean AUC dif- only target class templates) for each dataset and the LDAs use the ference between the LDA using the time-decoupled pooled covariance overall optimal number of ten time intervals for each dataset. From the and the Riemannian method is provided. b Corresponding AUC dif- left to right the datasets are ordered by the average number of epochs ferences of the new method in comparison with an LDA using pooled in a virtual data subset (mean and standard deviation are provided in covariance matrices (bottom). In this overview, the Riemannian meth- brackets). Colors encode the number of EEG channels in a dataset ods use the overall optimal hyperparameters (five xDawn components, The impact of having few training data depends on As expected, this improvement is reduced when more and multiple factors, such as paradigm, signal-to-noise ratio more training data is used. Note that for the BNCI patient and dimensionality. To better quantify this impact, we dataset (Fig. 6b), the mean drops below 0, while the median additionally evaluated how the performance difference stays close to 0. This difference is caused by only one between our proposed method and the baseline LDA p- subject who has bad performance using the LDA imp. p-cov cov method develops depending on the amount of training with 1000 to 3000 epochs for the cross-validation. data. For three datasets, we trained both classifiers using Our proposed method estimates a more reliable version an increasing number of training samples per VDS from of the between-channel covariance matrix. To determine 100 up to the largest amount that was available for all how the number of channels impacts our proposed method, subjects in a dataset. In order to obtain standard error we evaluated on the TONE healthy dataset (as it offered estimates, we calculated the cross-validated AUC for each the largest number of channels) with both increasing VDS size 20 times on different within-class permutations. number of data for the VDS as well as with artificially As shown in Fig. 6, for few training samples LDA imp. p- reduced channel subsets. The results in Fig. 7 show that cov provides a better average performance for each dataset. the performance improvement remains relatively stable Neuroinform (2021) 19:461–476 471 Fig. 7 Interaction between amount of training data and number of channels. Each curve represents a different number of channels and depicts the performance difference between LDA imp. p-cov and LDA p-cov for the TONE healthy dataset for varying number of training samples per VDS. For this purpose, the full channel set of 63 channels was reduced to smaller, approximately equidistant sets. Each curve provides the median of 20 permutations. Values above 0 indicate a better performance of our proposed LDA imp. p-cov method Subject-wise Results for Selected Datasets In Fig. 8, absolute AUC performances of each subject are provided for three selected datasets and separately for the five classification approaches. The SPOT dataset on top (a) has the smallest virtual data subsets of 90 epochs each. We can see that our proposed method outperforms all other methods for every individual subject. Additionally, the ranking between subjects is very stable between the five approaches and specifically between the LDA-based methods. In Fig. 8b, the SPELLER LLP dataset with 204 epochs per virtual data subset is shown. A few outlier subjects are observed with markedly decreased performances. The left-most dark triangle in the LDA imp. p-cov method corresponds to the data of one of the subjects, which shows the numerical issues described previously. Fig. 6 Impact of increasing training samples on the performance However, these numerical issues do not apply to all poorly difference between LDA imp. p-cov and LDA p-cov for the datasets a performing subjects. The second worst subject in the BI b , b BNCI patient and c BNCI healthy 2. Values above 0 indicate LDA imp. p-cov method (left-most dark square) for example a better performance of our proposed LDA imp. p-cov method. With shows performance gains with the novel method compared growing dataset sizes, both methods converge to the same performance to the other methods. Figure 8cshows the BI b dataset with 480 epochs per when using 63 or 31 channels. However, especially when virtual data subset. In this dataset, the overall performance using very small channel sets of only four channels, the of the LDA imp. p-cov method is slightly better than that performance improvement obtained by our new approach is of the runner-up LDA p-cov. Additionally, the performance lower across all training data set sizes, when compared to of most subjects is very similar between the two methods, using the full set of channels. and only some subjects show a noticeable performance gain 472 Neuroinform (2021) 19:461–476 using the time-decoupled matrices. In the EEG data there we did not find any immediate indicator, such as artifacts or heavy noise, that indicates why that could be the case. Discussion In this work we considered mostly the traditional two- class ERP oddball paradigms. As our proposed method improves the covariance matrix, it could also be applied to multi-class methods that require a covariance matrix estimation, e.g. multi-class LDA, given that the assumptions we made in “Time-decoupled Covariance Matrices”are fulfilled. Additionally, there are other BCI paradigms using different kind of signals. For error-related potentials (Dal Seno et al. 2010) and slow cortical potentials (Krauledat et al. 2004), our method should be applicable without any additional adaptations, except choosing the relevant time intervals. However, this needs to be confirmed in future work. The transfer of our proposed method to oscillatory signals, such as steady state evoked potentials and event- related de-/synchronization in motor imagery, is not as straightforward. For these signals, usually the feature vector contains only spatial data from one single time interval. However, if features from multiple time intervals are used, our method should be applicable, given that our assumptions are fulfilled. In this work, we only evaluated the classification of brain signals. In theory, our approach should be applicable to regression approaches. However, these have to be covariance-based and use a spatio-temporal covariance matrix. While these approaches exist, they typically use a spatial covariance matrix (Dahne ¨ et al. 2014) only, or use features in different frequency bands (Fatemi and Daliri 2020), which can violate the assumption of feature homoscedasticity over time. For three subjects who were identified as negative performance outliers, we found that numerical instability can be caused by the proposed diagonal block replacement. Unfortunately, so far we found no indicators in the EEG signals, e.g. artifacts or heavy noise, which could predict if this rare problem will occur. In future work, we aim to determine the cause of the numerical instability from the data and how established approaches, e.g. regularization of the covariance matrix, can be implemented to obtain a Fig. 8 Results for the a SPOT,the b SPELLER LLP and the c BI b well-conditioned matrix after the replacement operation. dataset. Subjects can be matched by marker type and color brightness. Note that so far we only inspected the impact on Within one classification method strip, subjects are ordered from LDA performance of the time-decoupling of the covariance left to right with respect to their individual scores obtained by the LDA imp. p-cov method. The cross marker indicates the mean AUC, matrix. While we can observe improved performances, it with black bars marking two standard errors of the mean still is unclear whether this new covariance matrix is closer Neuroinform (2021) 19:461–476 to the true underlying covariance matrix. In future work, Second, the new methods allows to shorten the required we plan to run simulation studies to evaluate which matrix calibration time while still keeping the same classification estimation technique, i.e. time-decoupling, shrinkage or performance—a quality that improves usability specifically the sample covariance matrix, is closer to the actual data for patient studies. generating covariance matrix. Third, our approach yields classification performances We observed a performance benefit not only in these well above chance level even when using tiny (72 small datasets, but in most tested datasets of varying epochs) amounts of training data. This characteristic can dimensionality and SNR level. Thus we are optimistic, make experimental parameter optimization feasible, as the that the applicability of our proposed novel method is classification performance can be estimated reliably even on not restricted to the domain of BCI. Instead, it could be very short EEG recordings. valuable to apply it also to other data. Generally, any data Fourth, with increasing training data, the classification which also has a spatio-temporal structure, and in which the performance of our method converges to the performance spatial noise can be assumed to be constant with respect of the regular LDA. Therefore, there is no harm done using to time could profit from the proposed approach. Specific our approach even when abundant training data is available candidates are MEG and multi-electrode EMG recordings. or when it is unclear, if the size of the training data set Another possible application could be spatially distributed is in the right range for profiting from the time-decoupled sensor networks that make use of identically constructed covariances. sensors. Due to the aforementioned arguments, we would We also found that using a linear kernel principal recommend BCI practitioners to use the proposed time- component analysis does not improve the performance of an decoupled covariances for LDA as a first shot method, as it LDA classifier for most datasets. This indicates that large shows clear benefits over an ordinary shrinkage-regularized dimensionality is not a primary issue in these datasets. LDA for most ERP-classification scenarios. In ERP paradigms, the concrete ratio of target and non-target stimuli within chronological virtual data subsets depends on the used stimulus sequence. As in this work we Conclusion focused on method development, we chose to use a stratified cross-validation scheme rather than a chronological cross- Using domain knowledge and exploiting the specific validation scheme which typically is preferred in the BCI structure of the feature vectors in ERP classification domain. While the latter would have been closer to the paradigms, we propose a new way to estimate a covariance final application, it could not guarantee that all folds have matrix that outperforms a shrinkage based covariance the same class ratios. This would have been a disadvantage matrix, especially on small datasets. Our results could for the comparison of methods, as we introduce another enable fast-adapting BCIs that require short calibration challenge into the benchmark, i.e. how well does a method times. A possible application for our method is the tuning handle differing class ratios in training and validation. of stimulation parameters to an individual subject. Here, In fact, for datasets recorded in our lab in which the long recordings are not feasible and the information content generated stimulus sequence guarantees stratified folds, we of short recordings should be maximized to determine the observed, that the relative performance differences between optimal parameters. the methods were nearly identical (data not shown), but the overall performance across methods dropped by up to 5 % points AUC, depending on the dataset. Information Sharing Statement Our proposed novel time-decoupling of the covariance has merit when using it to enhance the feature covariance Results and most figures for the public datasets we used matrix of an LDA classifier. We think our approach can be reproduced using the code available at https://github. improves the usability of ERP-based BCIs due to several com/jsosulski/time-decoupled-lda. The detailed instruc- arguments, for which we found clear evidence in our tions make it easy to obtain the results, especially when extensive evaluation on multiple datasets. using the same system we used, i.e. Ubuntu 18.04 and First, our approach offers a simple, yet effective, way python 3.6.9. The proposed improved classifier makes use to improve the classification performance for very small of the widely used sklearn API and can be used as a drop- datasets—a problem identified by multiple authors in the in replacement for other sklearn classifiers. The classifier field of BCI, who emphasized the need for decoding is also available in the aforementioned repository. Public algorithms to be able to handle the training with few data datasets used in this work are automatically downloaded points. using the provided code. 474 Neuroinform (2021) 19:461–476 Acknowledgements The authors are thankful for the discussion with Barachant, A., Bonnet, S., Congedo, M., Jutten, C. (2010). Rie- Klaus-Robert Muller ¨ about this work. mannian geometry applied to BCI classification. In International conference on latent variable analysis and signal separation (pp. 629–636): Springer. Author Contributions Study design: JS, MT Bashashati, H., Ward, R.K., Bashashati, A. (2016). User-customized Literature review: JS, JPK, MT brain computer interfaces using Bayesian optimization. Journal of Implementation: JS, JPK Neural Engineering, 13(2), 026001. Data analysis: JS, JPK, MT Bishop, C.M. (2006). Linear models for classification. In Pattern recog- Wrote the manuscript: JS, MT nition and machine learning. chap 4 (pp. 179–220): Springer. Reviewed the manuscript: JS, JPK, MT Blankertz, B., Lemm, S., Treder, M., Haufe, S., Muller, ¨ K.R. (2011). Single-trial analysis and classification of ERP components—a Funding Open Access funding enabled and organized by Projekt tutorial. NeuroImage, 56(2), 814–825. DEAL. This work was (partly) supported by the BrainLinks- Dahne, ¨ S., Meinecke, F.C., Haufe, S., Hohne, ¨ J., Tangermann, M., BrainTools Cluster of Excellence funded by the German Research M¨uller, K.R., Nikulin, V.V. (2014). SPOc: a novel framework Foundation (DFG, grant number EXC 1086) and the project SuitAble for relating the amplitude of neuronal oscillations to behaviorally (DFG, grant number 387670982). The authors would also like to relevant parameters. NeuroImage, 86, 111–122. acknowledge support by the state of Baden-Wurttember ¨ g, Germany, Dal Seno, B., Matteucci, M., Mainardi, L. (2010). Online Detection through bwHPC and the German Research Foundation (DFG, INST of p300 and Error Potentials in a Bci Speller. Computational 39/963-1 FUGG). intelligence and neuroscience, 2010. Farquhar, J., & Hill, N.J. (2013). Interactions between pre-processing Data Availability All datasets used in this work are listed in Table 1. and classification methods for event-related-potential classifica- We used both publicly available data and restricted access data that was tion. Neuroinformatics, 11(2), 175–192. recorded in our lab and the subjects’ permission we collected did not Fatemi, M., & Daliri, M.R. (2020). Nonlinear sparse partial least include publishing raw EEG data. (See information sharing statement.) squares: an investigation of the effect of nonlinearity and sparsity on the decoding of intracranial data. Journal of Neural Data Availability The code that was used to produce the benchmark Engineering, 17(1), 016055. (on the public datasets) can be found in the above mentioned Feess, D., Krell, M.M., Metzen, J.H. (2013). Comparison of sensor repository. The classifier implementation is also available there. As selection mechanisms for an ERP-based brain-computer interface. it follows the popular sklearn API, it can be used as a drop- PloS One, 8(7), e67543. in replacement for many python pipelines. (See information sharing Foodeh, R., Khorasani, A., Shalchyan, V., Daliri, M.R. (2016). statement.) Minimum noise estimate filter: a novel automated artifacts removal method for field potentials. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(8), 1143–1152. Compliance with Ethical Standards Guger, C., Daban, S., Sellers, E., Holzner, C., Krausz, G., Carabalona, R., Gramatica, F., Edlinger, G. (2009). How many people are Conflict of interests The authors declare they have no conflicts of able to control a P300-based brain–computer interface (BCI)? interest. Neuroscience Letters, 462(1), 94–98. Hoffmann, U., Vesin, J.M., Ebrahimi, T., Diserens, K. (2008). Open Access This article is licensed under a Creative Commons An efficient P300-based brain–computer interface for disabled Attribution 4.0 International License, which permits use, sharing, subjects. Journal of Neuroscience Methods, 167(1), 115–125. adaptation, distribution and reproduction in any medium or format, as Hohne, ¨ J., & Tangermann, M. (2012). How stimulation speed long as you give appropriate credit to the original author(s) and the affects event-related potentials and BCI performance. In source, provide a link to the Creative Commons licence, and indicate 2012 annual international conference of the IEEE engi- if changes were made. The images or other third party material in neering in medicine and biology society (pp. 1802–1805). this article are included in the article’s Creative Commons licence, https://doi.org/10.1109/EMBC.2012.6346300. unless indicated otherwise in a credit line to the material. If material Holm, S. (1979). A simple sequentially rejective multiple test is not included in the article’s Creative Commons licence and your procedure. Scandinavian Journal of Statistics, 65–70. intended use is not permitted by statutory regulation or exceeds H¨ubner, D., Verhoeven, T., Schmid, K., Muller, ¨ K.R., Tangermann, the permitted use, you will need to obtain permission directly from M., Kindermans, P.J. (2017). Learning from label proportions the copyright holder. To view a copy of this licence, visit http:// in brain-computer interfaces: online unsupervised learning with creativecommonshorg/licenses/by/4.0/. guarantees. PloS one, 12(4). H¨ubner, D., Verhoeven, T., Muller, ¨ K.R., Kindermans, P.J., Tanger- mann, M. (2018). Unsupervised learning for brain-computer interfaces based on event-related potentials: Review and online References comparison. IEEE Computational Intelligence Magazine, 13(2), 66–77. Allison, B.Z., & Pineda, J.A. (2006). Effects of SOA and flash pattern Jayaram, V., & Barachant, A. (2018). MOABB: Trustworthy algorithm manipulations on ERPs, performance, and preference: implica- benchmarking for BCIs. Journal of Neural Engineering, 15(6), tions for a BCI system. International Journal of Psychophysiol- 066011. ogy, 59(2), 127–140. Jayaram, V., Alamgir, M., Altun, Y., Scholkopf, B., Grosse-Wentrup, Arico, ` P., Aloise, F., Schettini, F., Salinari, S., Mattia, D., Cincotti, F. M. (2016). Transfer learning in brain-computer interfaces. IEEE (2014). Influence of p300 latency jitter on event related potential- Computational Intelligence Magazine, 11(1), 20–31. based brain–computer interface performance. Journal of Neural Kolkhorst, H., Tangermann, M., Burgard, W. (2018). Guess what Engineering, 11(3), 035008. I attend: Interface-free object selection using brain signals. In Barachant, A., & Congedo, M. (2014). A plug&play P300 BCI using IEEE/RSJ international conference on intelligent robots and information geometry. arXiv:14090107. systems (IROS), (Vol. 2018 pp. 7111–7116): IEEE. Neuroinform (2021) 19:461–476 475 Krauledat, M., Dornhege, G., Blankertz, B., Losch, F., Curio, G., Sannelli, C., Dickhaus, T., Halder, S., Hammer, E.M., Muller, ¨ K.R., Muller, K.R. (2004). Improving speed and accuracy of brain- Blankertz, B. (2010). On optimal channel configurations for computer interfaces using readiness potential features. In The SMR-based brain–computer interfaces. Brain Topography, 23(2), 26th annual international conference of the IEEE engineering in 186–193. medicine and biology society, (Vol. 2 pp. 4511–4515): IEEE. Scholk ¨ opf, B., Smola, A., Muller, ¨ K.R. (1997). Kernel principal Lal, T.N., Schroder, M., Hinterberger, T., Weston, J., Bogdan, M., component analysis. In International conference on artificial Birbaumer, N., Scholkopf, B. (2004). Support vector channel neural networks (pp. 583–588): Springer. selection in BCI. IEEE Transactions on Biomedical Engineering, Schreuder, M., Blankertz, B., Tangermann, M. (2010). A new 51(6), 1003–1010. auditory multi-class brain-computer interface paradigm: Spa- Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for tial hearing as an informative cue. PLOS One, 5(4), 1–14. large-dimensional covariance matrices. Journal of Multivariate https://doi.org/10.1371/journal.pone.0009813. Analysis, 88(2), 365–411. Sellers, E.W., Arbel, Y., Donchin, E. (2012). BCIs that use P300 event- Lotte, F., Congedo, M., Lecuyer, ´ A., Lamarche, F., Arnaldi, B. related potentials. In Brain-computer interfaces: principles and (2007). A review of classification algorithms for EEG-based practice (p. 215): Oxford University Press. brain–computer interfaces. Journal of Neural Engineering, 4(2), Sosulski, J., & Tangermann, M. (2019). Spatial filters for auditory R1. evoked potentials transfer between different experimental condi- Lotte, F., Bougrain, L., Cichocki, A., Clerc, M., Congedo, M., tions. In Proceedings of the 8th Graz brain-computer interface Rakotomamonjy, A., Yger, F. (2018). A review of classification conference, (Vol. 2019 pp. 273–278). algorithms for EEG-based brain–computer interfaces: a 10 year Srinivasan, R. (2012). Acquiring brain signals from outside the brain. update. Journal of Neural Engineering, 15(3), 031005. In Brain-computer interfaces: principles and practice. chap 6 Musso, M., Bambadian, A., Denzer, S., Umarova, R., Hubner, ¨ (pp. 105–122). D., Tangermann, M. (2016). A novel BCI based rehabilitation Sugi, M., Hagimoto, Y., Nambu, I., Gonzalez, A., Takei, Y., Yano, S., approach for aphasia rehabilitation. In Proceedings of the 6th Hokari, H., Wada, Y. (2018). Improving the performance of an international brain-computer interface meeting (p. 104). auditory brain-computer interface using virtual sound sources by Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., shortening stimulus onset asynchrony. Frontiers in Neuroscience, Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, 12, 108. V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Van Veen, G., Barachant, A., Andreev, A., Cattan, G., Rodrigues, P.C., Perrot, M., Duchesnay, E. (2011). Scikit-learn: Machine learning Congedo, M. (2019). Building Brain Invaders: EEG data of an in Python. Journal of Machine Learning Research, 12, 2825– experimental validation. arXiv:190505182. 2830. Wilcoxon, F., Katti, S., Wilcox, R.A. (1970). Critical values and Riccio, A., Simione, L., Schettini, F., Pizzimenti, A., Inghilleri, M., probability levels for the Wilcoxon rank sum test and the Wilcoxon Olivetti Belardinelli, M., Mattia, D., Cincotti, F. (2013). Attention signedranktest. Selected Tables in Mathematical Statistics, 1, and P300-based BCI performance in people with amyotrophic 171–259. lateral sclerosis. Frontiers in Human Neuroscience, 7, 732. Winkler, I., Brandl, S., Horn, F., Waldburger, E., Allefeld, C., Tanger- Rivet, B., Souloumiac, A., Attina, V., Gibert, G. (2009). xDAWN algo- mann, M. (2014). Robust artifactual independent component clas- rithm to enhance evoked potentials: application to brain–computer sification for BCI practitioners. Journal of Neural Engineering, interface. IEEE Transactions on Biomedical Engineering, 56(8), 11(3), 035013. 2035–2043. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Vaughan, T.M. (2002). Brain–computer interfaces for communi- Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). cation and control. Clinical Neurophysiology, 113(6), 767–791. Imagenet large scale visual recognition challenge. International https://doi.org/10.1016/S1388-2457(02)00057-3. Journal of Computer Vision, 115(3), 211–252. Rutkowski, T.M., & Mori, H. (2015). Tactile and bone-conduction Publisher’s Note Springer Nature remains neutral with regard to auditory brain computer interface for vision and hearing impaired jurisdictional claims in published maps and institutional affiliations. users. Journal of Neuroscience Methods, 244, 45–51. 476 Neuroinform (2021) 19:461–476 Affiliations 1 2 1,3,4 Jan Sosulski · Jan-Philipp Kemmer · Michael Tangermann Jan Sosulski jan.sosulski@blbt.uni-freiburg.de Jan-Philipp Kemmer jan-philipp.kemmer@venus.uni-freiburg.de Brain State Decoding Lab, Cluster of Excellence BrainLinks-BrainTools, Department of Computer Science, University of Freiburg, Freiburg, Germany University of Freiburg, Freiburg, Germany Autonomous Intelligent Systems Lab, Department of Computer Science, University of Freiburg, Freiburg, Germany Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands

Journal

NeuroinformaticsSpringer Journals

Published: Dec 14, 2020

Keywords: Event related potentials; Robust classification; Learning from small datasets; Noise transfer learning; Brain–computer interface; Covariance matrix enhancement

References