Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Real time anomaly detection and categorisation

Real time anomaly detection and categorisation The ability to quickly and accurately detect anomalous structure within data sequences is an inference challenge of growing importance. This work extends recently proposed post-hoc (offline) anomaly detection methodology to the sequential setting. The resultant procedure is capable of real-time analysis and categorisation between baseline and two forms of anomalous structure: point and collective anomalies. Various theoretical properties of the procedure are derived. These, together with an extensive simulation study, highlight that the average run length to false alarm and the average detection delay of the proposed online algorithm are very close to that of the offline version. Experiments on simulated and real data are provided to demonstrate the benefits of the proposed method. Keywords Anomaly detection · SCAPA · Streaming data · Real time 1 Introduction which are not necessarily anomalous when compared to either their local or the global data context, but together form The detection of anomalies in time series has received con- an anomalous pattern. Figure 1 provides several examples. In siderable attention in both the statistics (Chen and Liu 1993) this paper, collective anomalies and epidemic changepoints and machine learning (Chandola et al. 2009) literature. This are used interchangeably. is no surprise given the broad range of applications, from The epidemic changepoint model assumes that data fol- fraud detection (Ferdousi and Maeda 2006) to fault detec- lows some baseline, or typical distribution, everywhere tion (Theissler 2017; Zhao et al. 2018), that this area lends except for some anomalous time windows during which it fol- itself to. In recent years, the proliferation of sensors within lows another distribution. The detection of epidemic changes the internet of things (IoT) has led to the emergence of real in mean was first studied by Bruce and Jennie (1985) with time detection of anomalies in streaming (high frequency) applications to epidemiology. Since then, research in this data as an important new challenge. area has been driven by various applications including the Anomalies can be classified in a number of different ways detection of copy number variants in DNA (Bardwell and (Chandola et al. 2009). In this work, following the defini- Fearnhead 2017; Jeng et al. 2013; Olshen et al. 2004) and tions of Fisch et al. (2022a), we distinguish between point the analysis of brain imaging data (Aston and Kirch 2012; and collective anomalies. Point anomalies, also known as Stoehr et al. 2019). In particular, much of the pertinent lit- outliers, global anomalies or contextual anomalies (Chan- erature has concentrated on the epidemic change in mean dola et al. 2009), are single observations that are anomalous setting. See Yao (1993) and Gut and Steinebach (2005)for with regards to their local or global data context. Conversely, details. collective anomalies, also known as abnormal regions (Bard- More recently, the detection of joint epidemic changes well and Fearnhead 2017), or epidemic changepoints (Bruce in mean and variance as well as point anomalies was con- and Jennie 1985), are sequences of contiguous observations sidered by Fisch et al. (2022a). In parallel, there has also been some work on detecting anomalies within the online B Idris A. Eckley setting. Gut and Steinebach (2005) consider the problem of i.eckley@lancaster.ac.uk detecting epidemic changepoints sequentially while Wang et al. (2011) and Ahmad et al. (2017) propose methods STOR-i Centre for Doctoral Training, Lancaster University, LA1 4YF Lancaster, UK for the online detection of point anomalies. Other recent contributions include the MOSUM work by Kirch and col- Department of Mathematics and Statistics, Lancaster University, LA1 4YF Lancaster, UK 123 55 Page 2 of 15 Statistics and Computing (2022) 32 :55 (a) (b) (c) Fig. 1 Time series containing collective and point anomalies. Typical data shown in grey, anomalous segments in red and point anomalies shown in blue laborators (see, e.g., Eichinger and Kirch (2018)) that permits 2 Background the online detection of changepoints, and is thus able to iden- tify collective anomalies. Various outlier-robust Kalman filter CAPA, introduced by Fisch et al. (2022a), seeks to jointly approaches have also been proposed in recent years. See for detect and distinguish between point and collective anoma- example Ting et al. (2007), Agamennoni et al. (2011), Ruck- lies within an offline, univariate time series setting. The deschel et al. (2014), Chang (2014) and references therein. heart of the approach is founded upon an epidemic change- Fearnhead and Rigaill (2019) have introduced a computation- point model. To this end, consider a stochastic process ally efficient changepoint detection approach that is robust x ∼ D(θ (t )), drawn from some distribution, D, indexed to the presence of anomalies. Similarly the OSTAD pack- by a set of model parameters, θ(t ). Collective anomalies can age, developed by Iturria et al. (2020), can be used online to then be modelled as epidemic changes of the set of param- identify contextual anomalies. eters θ(t ). I.e., time windows in which θ(t ) deviates from The main contribution of this paper is to extend the offline the typical, and potentially unknown, set of parameters θ . Collective And Point Anomaly (CAPA) algorithm of Fisch Formally, et al. (2022a) to the online setting to detect both collective and θ s < t ≤ e point anomalies in streaming data, formalising early heuris- ⎪ 1 1 1 tic ideas appearing in Bezahaf et al. (2019). We call this θ(t ) = algorithm Sequential-CAPA (SCAPA). We explore various θ s < t ≤ e ⎪ K K K practical aspects of working in this online setting, including θ otherwise. (i) computational and storage costs, (ii) the typical (baseline) parameter estimation and (iii) penalty selection. In addition, Here K denotes the number of collective anomalies, while we provide various theoretical and empirical guarantees to s , e , and θ correspond to the start point, end point and the i i i the resulting costs and accuracy trade-offs that arise from our unknown parameter(s) of the ith collective anomaly respec- focus on the online setting. tively. The article is organised as follows. In Sect. 2 we introduce The number and locations of collective anomalies are esti- the literature on offline detection of anomalous time series mated by choosing K ,(s , e ),...,(s , e ), and θ such that 1 1 k K 0 regions, particularly focusing on the recently proposed CAPA they minimise the penalised cost approach. Sect. 3 proceeds to extend this methodology to ⎡ ⎛ ⎞ ⎤ the online setting, introducing the Sequential Collective and Point Anomaly (SCAPA) algorithm. Theoretical properties ⎣ ⎝ ⎠ ⎦ C(x ,θ )+ min C(x ,θ ) +β . t 0 t j C of the proposed methodology are investigated in Sect. 4. Fur- θ t ∈∪[ / s +1,e ] j =1 t =s +1 i i j ther results, together with a set of simulation studies is given (2.1) in Sect. 5, indicating how these can be used to inform prac- titioners on how to select the hyper-parameters of SCAPA. C(·, ·) is a cost function, e.g., twice the negative log- Finally, we apply SCAPA to the monitoring of a sensor on a likelihood, and β is a penalty term for introducing a publically available, industrial machine-level data in Sect. 6. collective anomaly, which seeks to prevent overfitting. A All proofs can be found in the supplementary material. minimum segment length, l, can be imposed by adding the constraint e − s ≥ l for k = 1, 2,..., K , if collective k k anomalies of interest are assumed to be of length at least l ≥ 1. 123 Statistics and Computing (2022) 32 :55 Page 3 of 15 55 Minimising the cost function (2.1) exactly by solving a with respect to K ,(s , e ), ...,(s , e ), and O, subject to 1 1 k K dynamic programme like the PELT method (Killick et al. the constraint e − s ≥ l ≥ 2for k = 1, 2,..., K . Here, k k 2012) is not possible. This is because the parameter of the β corresponds to a penalty for a point anomaly. typical distribution, θ , is shared across segments, and intro- The CAPA algorithm then minimises the cost in (2.3)by duces dependence. Fisch et al. (2022a) suggest removing this solving the dynamic programme dependence in θ by obtaining a robust estimate θ over the 0 0 whole data and then minimising x −ˆ μ t 0 C (t ) = min C (t − 1) + , C (t − 1) + log ⎡ ⎛ ⎞ ⎤ σ ˆ ˆ ⎣ ⎝ ⎠ ⎦ 2 C(x , θ )+ min C(x ,θ ) +β , t 0 t j C × x −ˆ μ + 1 + β , min C (k) + (t − k) t 0 O 0≤k<t −l t ∈∪[ / s +1,e ] j =1 t =s +1 i i j (2.2) 2 (x −¯x ) i (k+1):t i =k+1 log + 1 + β , (t − k) as an approximation to (2.1) over just the number and loca- tion of collective anomalies. The main focus of Fisch et al. taking C (0) = 0. (2022a) was on the case where anomalies are characterised In practice, as is common in many time series settings, by an atypical mean and or variance. In this case, the authors some form of pre-processing of the series may be required to suggest minimising ensure it is of a suitable form for the CAPA framework. For example, some form of deseasonalisation may be appropri- x −ˆ μ t 0 log(σ ˆ ) + ate. σ ˆ t ∈∪[ / s +1,e ] i i ⎡ ⎛ ⎞ (x −¯x ) t (s +1):e j j t =s +1 ⎣ ⎝ ⎠ 3 Sequential CAPA + (e − s ) log j j (e − s ) j j j =1 We now introduce our Sequential CAPA procedure. In extending CAPA to the online setting three main challenges +1 + β , arise. Specifically, any approach developed should be mind- ful of the following: (i) that the computational and storage cost of the dynamic programme increase with time; (ii) subject to a minimum segment length l of at least 2. The above the typical (baseline) parameters have to be learned online expression arises from setting the cost function to twice the and (iii) penalty selection. We address each of these three negative log-likelihood of the Gaussian. The robust estimates challenges in turn, proposing solutions in the following sec- for mean and variance, μ ˆ and σ ˆ , can be obtained from the 0 0 tions, prior to formally introducing the SCAPA algorithm in median and the inter-quartile range. Sect. 3.3. The main weakness of the above penalised cost is that point anomalies will be fitted as collective anomalies in a 3.1 Increasing Computational and Storage Cost segment of length l. To remedy this, point anomalies are mod- elled as epidemic changes of length one in variance (only). As noted in Sect. 2, CAPA infers collective and point anoma- The set of point anomalies is denoted as O. To infer both lies by solving a set of dynamic programme recursions. collective and point anomalies we minimise However both the computational cost of each recursion, and the storage cost, increase linearly in the total number of obser- x −ˆ μ t 0 vations. This is unsuitable for the online setting in which both log(σ ˆ ) + σ ˆ storage and computational resources are finite. t ∈∪[ / s +1,e ]∪O i i In practice, this problem can be surmounted by imposing + log((x −ˆ μ ) ) + 1 + β t 0 O a maximum length m for collective anomalies. This can be t ∈O ⎡ ⎛ ⎞ achieved by adding the set of constraints (x −¯x ) t (s +1):e j j t =s +1 ⎣ ⎝ ⎠ + (e − s ) log j j e − s ≤ m ∀i = 1, 2,..., K (3.1) i i (e − s ) j j j =1 to the optimisation problem in equation (2.3). The resulting + 1 +β , (2.3) problem can then be solved using the following dynamic programme 123 55 Page 4 of 15 Statistics and Computing (2022) 32 :55 these estimates tend to be considerably more accurate than x −ˆ μ t 0 C (t ) = min C (t − 1) + , C (t − 1) those of other commonly used methods such as the quan- σ ˆ tile filter (Justusson 1981) and the p -algorithm (Jain and + log x −ˆ μ + 1 + β , Chlamtac 1985). This is due to the fact that the quantile filter t 0 O is not consistent, and that the p -algorithm is not robust with min (C (k) + (t − k) respect to outliers, thus losing a critical property of quantile t −m≤k<t −l estimators. (x −¯x ) i (k+1):t i =k+1 Pseudo-code for the SA-based method is given in Algo- log + 1 +β . (t − k) rithm 1. Using a burn in period to stabilise the quantile estimates is recommended, as even the exact order statistics As a consequence of restriction (3.1), each recursion only take some time to initially converge. SA-based methods can requires a finite number of calculations. Moreover, only a also be used to calculate other important statistics in an online finite number of the optimal costs, C (t ), need to be stored fashion. For example, Sharia (2010) applied SA-techniques in memory. The practical implications of this additional con- to learn auto-regressive parameters sequentially. Such esti- straint are likely to be limited. Within this setting collective mators can be used to inflate the penalties used to account anomalies encompassing fewer than m observations will be for deviations from the i.i.d. assumptions. This is discussed detected as before. However, for those scenarios where an in more detail in Sect. 5. anomaly encompasses more than m observations, these will be fitted as a succession of collective anomalies each of length 3.3 Penalty selection less than m, provided that their signal strength (cf Sect. 5.1 for a definition) is large enough. As one might anticipate, We now turn to the important question of penalty selection. within this setting long anomalous segments with low signal In the offline setting, penalties are typically chosen to control strength would not be detectable any more as a result of the false positives under the null hypothesis. For example, Fisch approximation. et al. (2022a) suggested using penalties 3.2 Sequential estimation of the typical parameters β (a,λ) = 2 1 + λ + 2λ ,β (λ) = 2λ, (3.2) C O a − 1 As described in Sect. 2, the dynamic programme used by indexed by a single parameter, λ, for CAPA when consi- CAPA requires robust estimates of the set of typical param- dering the change in mean and variance setting. Here, the eters θ = (μ ,σ ). Fisch et al. (2022a) estimate μ and 0 0 0 0 penalty for collective anomalies, β , depends on the length a σ on the full data using the median and inter-quartile range of the putative collective anomaly. The motivation for these respectively. In an online setting, however, these quantiles penalties is to ensure that the estimates for the number of have to be learnt as the data is observed. collective anomalies and the set of point anomalies, K and A range of methods have been proposed that aim to esti- O, satisfy mate the cumulative distribution function (CDF) of the data sequentially and use it to estimate quantiles. For example, −λ −λ 2 ˆ ˆ Pr(K = 0, O =∅) ≥ 1 − C ne − C (ne ) , (3.3) 1 2 Tierney (1983) proposed a method based on techniques from Stochastic Approximation (SA) to estimate the αth quantile under the null hypothesis that no point or collective anomaly x of an unknown distribution function. Moreover, Tierney (α) is present in the data. Consequently, setting λ = log(n) (1983) also established that, in the i.i.d. setting, the resulting asymptotically controls the number of false positives of a sequential estimates xˆ → x almost surely as the num- (α),n (α) time series of length n. ber of observations n →∞. Under the same assumptions, In the online setting, however, the concept of the length they also showed that n(xˆ − x ) converges in distri- (α),n (α) of a time series does not exist. Consequently, fixed constants bution to a Normal distribution. These consistency results are used for the penalties instead. This means that, unless the are important for an online implementation of CAPA, as errors are bounded, false positives will be observed eventu- Fisch et al. (2022a) showed that the consistency of CAPA ally. In common with Lorden (1971) and Pollak (1985), we requires the robustly estimated mean and variance to be suggest choosing λ to be as small as possible, to maximise log(n) within O of the true typical mean and variance. power against anomalies, whilst maintaining the average run The memory required to obtain the SA-estimate is finite length (ARL), the average time between false positives, at an and small. Moreover, the standard errors of the SA-estimate acceptable level. Practical guidance on the choice of λ can and sample quantiles are close even for relatively small sam- be taken from Proposition 1, which provides an asymptotic ple sizes, as can be seen from Fig. 2. Further, we note that result for the relationship between the log-ARL and λ, under 123 Statistics and Computing (2022) 32 :55 Page 5 of 15 55 Fig. 2 a) Example time series with collective and point anomaliesaswellasthe b) median and c) IQR estimated sequentially over time using different methods: The quantile filter by Justusson (1981) (Filter), the p -method of Jain and Chlamtac (1985)(P squared) and the Stochastic Approximation based method by Tierney (1983)(SA) 0 1000 2000 3000 4000 5000 Time (a) Example time series Filter Exact P squared SA 0 1000 2000 3000 4000 5000 Time (b) Sequentially estimated median Filter Exact P squared SA 0 1000 2000 3000 4000 5000 Time (c) Sequentially estimated IQR a certain model form. This relationship is empirically verified tion is then standardised using the typical mean and variance for other models using simulations in Sect. 5. (μ ,σ ), before being passed to the finite horizon dynamic programme. Detailed pseudocode can be found in Algo- rithm 2 of the supplementary material Sequential Collective And Point Anomaly Given the above The sequential nature of SCAPA’s analysis is displayed solutions to the three identified challenges, we are able in Fig. 3 across three plots, each representing the output of to extend CAPA to an online setting. We call the resul- the analysis at different time points. Note how a collective tant approach Sequential Collective And Point Anomaly anomaly is detected, initially, as a sequence of point anoma- (SCAPA). The basic steps of the algorithm are as follows: lies until the number of observations equals the minimum When an observation comes in, it is used to update the segment length. sequential estimates of the typical parameters. The observa- Estimated IQR Estimated median data 05 10 15 20 −1.0 −0.5 0.0 0.5 1.0 1.5 −20 0 20 40 60 80 100 55 Page 6 of 15 Statistics and Computing (2022) 32 :55 12 12 8 8 4 4 0 0 0 50 100 0 50 100 150 050 100 150 (a) (b) (c) Fig. 3 The evolution in the detection of a collective anomaly with a min- x have been labelled as point anomalies and c) t = 105 where 101:104 imum segment length of l = 5. The times shown are a) t = 100 just the observations x have been labelled as a collective anomaly 101:105 prior to the anomalous observations, b) t = 104 where the observations 4 Theory log(ARL) ∼ λ/2 We now turn to consider the theoretical properties of SCAPA. as λ →∞. In particular, we investigate the average run length (ARL) and the average detection delay (ADD). Here, the ARL corre- As a consequence of the above, the probability of false sponds to the expected number of baseline datapoints SCAPA alarm is proportional to exp(−λ/2). As discussed in the processes before detecting a false positive. Conversely, the previous section, this can be used to inform the choice of ADD corresponds to the expected number of observations penalty in practice if an acceptable probability of false alarm between the onset of a collective anomaly and the time at is given. For comparison, we also provide additional simu- which a collective anomaly is first detected. We will place lation results for the log-average run length in a selection of a particular emphasis on the effects of the maximum seg- Laplace and t-distributed settings (see Fig. 4). ment length, m on the ADD, as the results following from We now turn to investigate the effects of the maximum that analysis provide practical guidance on how to choose m. segment length, m, on the ADD. To simplify the exposition Proofs of all propositions presented below may be found in of these results, we assume that the collective anomaly begins the Appendix. at time τ = 0. Formally, consider the series For simplicity of exposition, we will restrict our attention to the change in mean setting, in which the penalised cost is i .i .d. x , x ,... ∼ N (μ, 1) (4.1) 1 2 x − μ t 0 + [0 + β ] and assume that the typical mean, μ , is equal to 0 and known. t ∈∪[ / s +1,e ]∪O t ∈O 0 i i ⎡ ⎤ For a maximum segment length m, we then define ADD to K j x −¯x t (s +1):e j j be the ADD of SCAPA with a maximum segment length m. ⎣ ⎦ + + β 0 Additionally, we define ADD to be the ADD of SCAPA j =1 t =s +1 without maximum segment length. The following proposi- tion shows that imposing a maximum segment length does Recalling the penalty selection approach outlined in Sect. not affect the ADD, provided that the maximum segment 3.3. In this setting, the ARL of SCAPA can be related to the length increases at a rate faster than the penalty. penalty constant, λ, via the following result: Proposition 1 Assume we observe a data sequence with typi- Proposition 2 Let x , x ,.. follow the distribution specified 1 2 cal mean, μ , and the typical variance, σ , both known. Then in (4.1). Moreover, let the known baseline mean and variance 2 λ the ARL of SCAPA on i.i.d. N (μ ,σ )-distributed observa- be 0 and 1 respectively. Then, if m > (1 + ) for some 0 2 tions x , x ,... then satisfies > 0, 1 2 123 Statistics and Computing (2022) 32 :55 Page 7 of 15 55 ν s 3 1 5 2 10 4 15 10 (a) (b) Fig. 4 Simulation results for the log-average run length in a selection of (a) t-distributed and (b) Laplace distributed noise settings for different values of λ ADD = ADD + o(1) should be at least of magnitude , where μ is the smallest m ∞ 2 change in mean of interest to ensure power. as λ →∞. Given Proposition 2, it is natural to consider what happens 5 Simulation study in the converse setting. I.e. what happens if the maximum segment length increases at a slower rate than the penalty. We now turn to examine the performance of SCAPA in var- Proposition 3 Let x , x ,.. follow the distribution specified 1 2 ious simulated settings. We start by considering the case in (4.1). Moreover, let the known typical mean and variance where a single collective anomaly is present to evaluate 1− be 0 and 1 respectively. Then, if 1 ≤ m <λ for some SCAPA via its ARL and ADD performance in Sect. 5.1. > 0 The effect of auto-correlation is also examined. This is fol- lowed by a comparison with CAPA on time series containing log(ADD ) ∼ λ/2 multiple anomalies in Sect. 5.2. as λ →∞. 5.1 A single anomaly In other words, the log-ADD has the same exponential rate as the log-ARL on non-anomalous data. Prior to describing our first simulation scenario, we begin As previously discussed, limits on the number of possible by noting that the ARL and ADD are functions of β and interventions often determine a tolerable probability of false β . Further, as we have seen in equation (3.2) these are a alarm in practice. Proposition 1 therefore provides a mech- function of a single parameter λ. The aim of our simulation anism to determine a suitable penalty constant λ. Further, study, therefore, is to inform the choice of λ that gives a Propositions 2 and 3 can be used to help inform an appropri- suitable ARL/ADD trade off. In particular, ceteri paribus,a ate choice of maximum segment length, m. Specifically m weaker change gives rise to a larger delay than a stronger log(ARL) log(ARL) 55 Page 8 of 15 Statistics and Computing (2022) 32 :55 Fig. 5 The solid line shows the 15 log-ARL for SCAPA as a function of λ. The grey shaded region is a pointwise 95% bootstrapped confidence interval. Results shown from 500 replications 1 5 10 15 0.05 0.1 0.2 1 5 10 15 Fig. 6 The lines show the ADD for SCAPA as a function of λ for different strengths of collective anomaly ( = 0.05, 0.1 and 0.2). The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications change. In other words, we must control for the strength of and σ = 1. Consequently, the strength of the change only change when investigating the ADD. To do so, we take the depends on the mean, μ, of the collective anomaly and is definition of signal strength from Fisch et al. (2022a). given by For a collective anomaly with mean μ and variance σ the strength, , of a change is defined as = log 1 + . (5.1) 1 1 (μ − μ) 2 2 2 = log 1 + + = σ μ μ We investigate a number of differing strengths = 2 4 σ σ σ σ 0 {0.05, 0.1, 0.2} corresponding to mean changes of μ = = + − 2. σ σ {0.45, 0.65, 0.94}. In all the simulations reported below we set the minimum Here, μ and σ are the parameters of the typical distribution, segment length to be l = 2, the maximum segment length to 0 0 while and denote the strengths of the change in mean be m = 1000 and used a burn-in period of n = 1000 time μ σ 0 and variance respectively. points. To estimate the ARL, data from the typical regime To simplify the simulations, we assume that the standard was simulated and SCAPA ran until the first anomaly was deviation remains unaffected by collective anomalies, i.e. (erroneously) detected. To estimate the ADD, n observa- σ = σ . Without loss of generality, we then set μ = 0 tions were simulated from the typical regime followed by 0 0 ADD log(ARL) Statistics and Computing (2022) 32 :55 Page 9 of 15 55 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 15 10 15 Fig. 7 The lines show the log-ARL for SCAPA as a function of λ where penalties, β (λ) and β (λ) are the same as in the i.i.d. case (Fig. 5). C O the simulated time series are AR(1) processes with differing lag-1 auto- The grey shaded regions are pointwise 95% bootstrapped confidence correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two intervals. Results shown from 500 replications φ φ 0 0 200 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 40 0.3 0.4 0.4 0.4 100 0.5 0.5 50 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.85 0.85 0.85 1 5 10 15 1 5 10 15 1 5 10 15 λ λ (a) (b) (c) Fig. 8 The lines show the ADD for SCAPA as a function of λ for dif- 0.7, 0.8 and 0.85). The two penalties, β (λ) and β (λ) are the same as C O ferent strengths of collective anomaly a) = 0.05, b) = 0.1and in the i.i.d. case (Fig. 6). The grey shaded regions are pointwise 95% c) = 0.2. In each case the simulated residuals are AR(1) processes bootstrapped confidence intervals. Results shown from 500 replications with differing lag-1 auto-correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, simulated observations from a distribution with an altered between λ and the ADD over a range of different values for mean. We ran SCAPA on this data and calculated the detec- the mean change of the collective anomaly. tion delay as being the number of observations after n when Note that the log-ARL increases linearly with λ.This the anomaly was detected. structure is reminiscent of the theoretical exponential rela- tionship between λ and the ARL derived by Cao and Xie (2017), even though these results were derived for known 5.1.1 Case 1: IID gaussian errors pre and post change behaviour. Similarly, the ADD, increases linearly in λ, as can be seen For our initial simulations, we simulated from the assumed in Fig. 6. This is consistent with Proposition 2. model with standard Gaussian errors. Figure 5 depicts the log-ARL over a range of values for the penalties (3.2) indexed by the parameter λ as in (3.2) along with a bootstrapped 95% confidence interval. Similarly, Fig. 6 shows the relationship ADD log(ARL) ADD ADD 55 Page 10 of 15 Statistics and Computing (2022) 32 :55 0.1 0.2 0.3 0.4 0.5 0.6 5 0.7 0.8 0.85 15 10 15 Fig. 9 The lines show the log-ARL for SCAPA as a function of λ where estimate obtained from the burn-in period using a robust M-estimator. the simulated time series are AR(1) processes with differing lag-1 auto- The grey shaded regions are pointwise 95% bootstrapped confidence correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The intervals. Results shown from 500 replications two penalties, β (λ) and β (λ) are inflated by a function of φ,the C O φ φ φ 0 0 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.6 5000 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.85 0.85 0.85 0 0 0 15 15 15 λ λ (a) (b) (c) Fig. 10 The lines show the ADD for SCAPA as a function of λ for 0.7, 0.8 and 0.85). The two penalties, β (λ) and β (λ) are inflated by C O different strengths of collective anomaly a) = 0.05, b) = 0.1and a function of φ, the estimate obtained from the burn-in period using a c) = 0.2. In each case the simulated residuals are AR(1) processes robust M-estimator. The grey shaded regions are pointwise 95% boot- with differing lag-1 auto-correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, strapped confidence intervals. Results shown from 500 replications 5.1.2 Case 2: temporal dependence This process was simulated for a range of values of φ.As can be seen in Fig. 7, the presence of auto-correlation in the Whilst the i.i.d. data setting is appealing theoretically, many residuals leads to the spurious detection of collective anoma- observed time series are not independent (in time). Instead lies at a higher rate than for independent residuals. This is many data sequences display serial auto-correlation. To due to the fact that the cost functions of Sect. 2 assumed assess the robustness of SCAPA to temporal dependence we i.i.d. data. However, Bardwell et al. (2019)gavesomeempir- simulated an AR(1) error process as the typical distribution, ical evidence that changepoints could be recovered even x , with standard normal errors  , when auto-correlation is present by applying a correction t t or inflation factor to the penalty. This factor is the sum of the x = φx + e . auto-correlation function for the residuals from −∞ to ∞. t t −1 t This is equal to the long run variance, (1 +φ)/(1 −φ),for the ADD log(ARL) ADD ADD Statistics and Computing (2022) 32 :55 Page 11 of 15 55 Fig. 11 ROC curves for CAPA, SCAPA, RPOP, and MOSUM from 100 replications. A red dot indicates the behaviour under default parameters. Graphic curtailed at maximum 20 false positives for ease of viewing (a) (b) Fig. 12 A comparison of a) CAPA to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data AR(1) model. A similar correction exists for MA processes. 5.2 Multiple anomalies We repeated the simulations using this correction. The results in Fig. 9 show that the log-ARL of SCAPA with A natural comparison to make when investigating an online appropriately inflated penalties is almost identical to that of method is to compare its performance to its offline coun- the i.i.d. case. On the other hand, the ADD now depends on terpart. We therefore compare SCAPA and CAPA for the the auto-correlation due to the inflated penalty (see Fig. 10). detection of multiple anomalies using ROC curves in this Performing this correction requires knowledge of the section. In addition to this, we also compare SCAPA to a AR(1) parameter φ. In these simulations, the estimate of φ, robust changepoint detection algorithm proposed by (Fearn- φ, was estimated from the burn-in period using a robust M- head and Rigaill 2019), which we refer to as RPOP. We also estimator for the lag-1 autocorrelation of Rocke (1996)from compare to the MOSUM implementation provided by (Meier the R package robust (Wang et al. 2017). et al. 2021), an online though not robust changepoint detec- tion algorithm. 123 55 Page 12 of 15 Statistics and Computing (2022) 32 :55 13 ν ν 2 2 5 5 10 10 0 5 10 15 20 0 5 10 15 20 Number of False positives Number of False positives (a) (b) Fig. 13 ROC curves over 100 replications for a) CUSUM and b) SCAPA. Point anomalies were generated from a t-distribution with varying degrees of freedom (ν = 2, 5 or 10 respectively) (a) (b) Fig. 14 A comparison of a) CUSUM to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data To this end, we simulated time series with a total length The ROC curve resulting from this simulation can be of 10,000 observations with a number of point and collec- found in Fig. 11, alongside an example time series in Fig. 12 tive anomalies. The length of stay for the typical state and shown segmented by both CAPA and SCAPA. As expected for collective anomalies were sampled from a NB(5, 0.01) CAPA, which has access to the whole data, outperforms distribution and a NB(5, 0.03) distribution respectively. SCAPA. However, the gap in performance accuracy is small, Observations in the typical state were sampled from an in particular for typical choices of λ as described in Sect. 4. N (0, 1) distribution, while observations from the kth collec- tive anomaly were sampled from an N (μ ,σ ) distribution, where μ ,...,μ ∼ N (0, 2 ) and σ ,...σ ∼ (1, 1). 5.3 CUSUM comparison 1 K 1 K Point anomalies occurred in the typical state independently with probability p = 0.01 and were drawn from a t- A natural comparison that can be made to assess SCAPA’s simulated performance is with the widely used online change distribution with 2 degrees of freedom. point detection method CUSUM (Page 1954). Both methods Number of True positives Number of True positives Statistics and Computing (2022) 32 :55 Page 13 of 15 55 Fig. 15 Machine temperature data 08−Dec 18−Dec 28−Dec 07−Jan 17−Jan 27−Jan 06−Feb 16−Feb Date Fig. 16 Machine temperature data. The burn-in period is shaded in blue. Anomalies detected by SCAPA and CAPA are shaded in red and green respectively, and brown when they overlap. Dashed vertical lines show the hand labelled anomalies given by an engineer working on the machine 08−Dec 18−Dec 28−Dec 07−Jan 17−Jan 27−Jan 06−Feb 16−Feb Date use a test statistic based on the log-likelihood ratio and can state were point anomalies, simulated from a t-distribution be configured with known typical mean and variance. The with degree of freedom ν ∈{2, 5, 10}. The ROC curve result- difference between the two methods is that in SCAPA, col- ing from this simulation can be found in Fig. 13, alongside lective and point anomalies are detected jointly with separate an example time series in Fig. 14 shown segmented by both penalties whereas CUSUM is not designed to be robust to methods. Between them, these plots highlight SCAPA’s abil- point anomalies. In our simulations, the CUSUM approach ity to distinguish explicitly between point anomalies and is implemented by setting the penalty for point anomalies in collective anomalies, whereas an approach such as CUSUM SCAPA to an arbitrarily large value (β = 10 ) so that no suffers. Note in particular that, as expected, CUSUM gives point anomalies are detected. a higher number of false positives than SCAPA when point The data was simulated in a similar way to that of Sect. anomalies are from distributions with heavier tails. 5.2 with the difference being that 20% of points in the typical Metric Metric 55 Page 14 of 15 Statistics and Computing (2022) 32 :55 Table 1 Labelled anomalies from the NAB obtained from https://github.com/numenta/NAB/blob/master/labels/combined_windows.json along with the time it was detected (in bold) Anomaly Start time End time Given reason Detection time 1 17:50 15/12/2013 17:00 17/12/2013 Planned shutdown 16:50 16/12/2013 2 14:20 27/01/2014 13:30 29/01/2014 Onset of problem 21:25 28/01/2014 3 14:55 07/02/2014 14:05 09/02/2014 Catastrophic system failure 3:15 08/02/2014 6 Machine temperature data Table 2 Collective anomalies found using CAPA Start time End time The Numenta Anomaly Benchmark (NAB) (Lavin and 11:30:00 GMT 08/12/2013 23:05:00 GMT 10/12/2013 Ahmad 2015; Ahmad et al. 2017) provides a number of data 23:35:00 GMT 15/12/2013 18:40:00 GMT 16/12/2013 sets that can be used to compare different anomaly detection 11:25:00 GMT 27/01/2014 13:50:00 GMT 31/01/2014 approaches. The data can be obtained from https://github. 09:20:00 GMT 07/02/2014 12:05:00 GMT 09/02/2014 com/numenta/NAB. One example consists of heat sensor data from an inter- nal component of a large industrial machine. The data is displayed in Fig. 15. There are n = 22, 695 observations be decreased, however, as noted elsewhere in this paper this spanning 2nd December 2013 - 19th February 2014 sampled would increase the frequency of false alarms. every five minutes. Lavin and Ahmad (2015) use an initial, or burn-in, period Supplementary Information The online version contains supplemen- to allow their algorithms to learn about the data. In line with tary material available at https://doi.org/10.1007/s11222-022-10112- their approach, we set the burn in period to be the first 15% of the data (2nd December 2013 until the 14th December 2013, Acknowledgements This work was supported by EPSRC grant num- as shown by the blue shaded area in Fig. 16). We used the burn bers EP/N031938/1 (StatScale), EP/R004935/1 (NG-CDI) and EP/L015 in to obtain a robust M-estimator for the lag-1 autocorrelation 692/1 (STOR-i). Fisch also gratefully acknowledges EPSRC (EP/S515 of the observations after standardisation by the sequential 127/1) and British Telecommunications plc (BT) for providing finan- mean and variance estimate. Using the robust M estimator cial support for his PhD via an Industrial CASE award. Finally, the authors thank David Yearling, Trevor Burbridge, Stephen Cassidy, and of Rocke (1996) from the R package robust (Wang et al. Kjeld Jensen in BT Research for helpful discussions while this work 2017), we obtained an autocorrelation estimate φ = 0.974. was being undertaken. In line with the approach taken in Sect. 5.1.2, we therefore set the penalties to: Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as ˆ ˆ 1 + φ 1 + φ long as you give appropriate credit to the original author(s) and the β = 2 × × log(n), β = 2 × × log(n). C O ˆ ˆ source, provide a link to the Creative Commons licence, and indi- 1 − φ 1 − φ cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, Figure 16 shows the three anomalies SCAPA detected unless indicated otherwise in a credit line to the material. If material shaded in red. These corresponded to a set of hand labelled is not included in the article’s Creative Commons licence and your anomalous regions given by an engineer working on the intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- machine shown by the dashed vertical lines. The positions right holder. To view a copy of this licence, visit http://creativecomm of these are given in Table 1. It should be noted that the data ons.org/licenses/by/4.0/. labels in the NAB consist of anomalous periods, rather than points. However, all approaches applied to the data as part of the NAB only return points of anomalous behaviour, high- lighting SCAPA’s potential to provide new insights into the References data. The detection of the more subtle second anomaly in a Agamennoni, G., Nieto, J.I., Nebot, E.M.: An Outlier-robust Kalman timely fashion is important as this was claimed in the NAB Filter. In 2011 IEEE International Conference on Robotics and Automation, pages 1551–1558. IEEE (2011) literature to be the cause of the catastrophic system failure Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised Real-time (third anomaly). We can see that the time at which SCAPA Anomaly Detection for Streaming Data. Neurocomputing, 262, first detected it in Table 1. If users of the system deemed 134 – 147. Online Real-Time Learning Strategies for Data Streams this to be too long of a delay the penalties used above could (2017) 123 Statistics and Computing (2022) 32 :55 Page 15 of 15 55 Aston, J.A.D., Kirch, C.: Evaluating Stationarity Via Change-point Lavin, A., Ahmad, S.: Evaluating Real-time Anomaly Detection Alternatives with Applications to Fmri Data. Ann. Appl. Stat. 6(4), Algorithms – the Numenta Anomaly Benchmark. IEEE 14th 1906–1948 (2012) International Conference on Machine Learning and Applications Bardwell, L., Fearnhead, P.: Bayesian Detection of Abnormal Segments (ICMLA), 38–44 (2015) in Multiple Time Series. Bayesian Anal. 12(1), 193–218 (2017) Lorden, G.: Procedures for Reacting to a Change in Distribution. Ann. Bardwell, L., Fearnhead, P., Eckley, I.A., Smith, S., Spott, M.: Most Math. Statist. 42(6), 1897–1908 (1971) Recent Changepoint Detection in Panel Data. Technometrics Meier, A., Kirch, C., Cho, H.: MOSUM: A Package for Moving Sums 61(1), 88–98 (2019) in Change-point Analysis. J. Stat. Softw. 97(8), 1–42 (2021) Bezahaf, M., Hernandez, M.P., Bardwell, L., Davies, E., Broadbent, Olshen, A.B., Venkatraman, E.S., Lucito, R., Wigler, M.: Circular M., King, D., Hutchison, D.: Self-generated Intent-based System. Binary Segmentation for the Analysis of Array-based dna Copy In 2019 10th International Conference on Networks of the Future Number Data. Biostatistics 5(4), 557–572 (2004) (NoF), 138–140 (2019) Page, E.S.: Continuous Inspection Schemes. Biometrika 41(1/2), 100– Bruce, L., Jennie, K.: The Cusum Test of Homogeneity with an Appli- 115 (1954) cation in Spontaneous Abortion Epidemiology. Stat. Med. 4(4), Pollak, M.: Optimal Detection of a Change in Distribution. Ann. Statist. 469–488 (1985) 13(1), 206–227 (1985) Cao, Y., Xie, Y.: Robust Sequential Change-point Detection by Convex Rocke, D.M.: Robustness Properties of S-estimators of Multivariate Optimization. 2017 IEEE International Symposium on Informa- Location and Shape in High Dimension. Ann. Stat. 24(3), 1327– tion Theory (ISIT), 1287–1291 (2017) 1345 (1996) Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. Ruckdeschel, P., Spangl, B., Pupashenko, D.: Robust Kalman Tracking ACM Comput. Surv. 41(3), 15:1-15:58 (2009) and Smoothing with Propagating and Non-propagating Outliers. Chang, G.: Robust Kalman Filtering Based on Mahalanobis Distance Stat. Pap. 55(1), 93–123 (2014) as Outlier Judging Criterion. J. Geodesy 88(4), 391–401 (2014) Sharia, T.: Efficient On-line Estimation of Autoregressive Parameters. Chen, C., Liu, L.-M.: Joint Estimation of Model Parameters and Out- Math. Methods Statist. 19(2), 163–186 (2010) lier Effects in Time Series. J. Am. Stat. Assoc. 88(421), 284–297 Stoehr, C., Aston, J.A.D., Kirch, C.: Detecting Changes in the Covari- (1993) ance Structure of Functional Time Series with Application to fMRI Eichinger, B., Kirch, C.: A MOSUM Procedure for the Estimation of Data. arXiv e-prints, arXiv:1903.00288 (2019) Multiple Random Change Points. Bernoulli 24(1), 526–564 (2018) Theissler, A.: Detecting Known and Unknown Faults in Automotive Fearnhead, P., Rigaill, G.: Changepoint Detection in the Presence of Systems Using Ensemble-based Anomaly Detection. Knowl.- Outliers. J. Am. Stat. Assoc. 114(525), 169–183 (2019) Based Syst. 123, 163–173 (2017) Ferdousi, Z., Maeda, A.: Unsupervised Outlier Detection in Time Series Tierney, L.: A Space-efficient Recursive Procedure for Estimating a Data. 22nd International Conference on Data Engineering Work- Quantile of an Unknown Distribution. SIAM J. Sci. Stat. Comput. shops (ICDEW’06), x121–x121 (2006) 4(4), 706–711 (1983) Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: A Linear Time Method for Ting, J.-A., Theodorou, E., Schaal, S.: Learning An Outlier-robust the Detection of Point and Collective Anomalies. Stat. Anal. Data Kalman Filter. In European Conference on Machine Learning, Min. (2022a). https://doi.org/10.1002/sam.11586 748–756. Springer (2007) Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: Subset Multivariate Collec- Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., tive and Point Anomaly Detection. J. Comput. Graph. Stat. 31(2), Schwan, K.: Statistical Techniques for Online Anomaly Detection 574–585 (2022b) in Data Centers. In 12th IFIP/IEEE International Symposium on Gut, A., Steinebach, J.: A Two-step Sequential Procedure for Detecting Integrated Network Management (IM 2011) and Workshops, 385– an Epidemic Change. Extremes 8(4), 311–326 (2005) 392 (2011) Iturria, A., Carrasco, J., Charramendieta, S., Conde, A., Herrera, F.: Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., otsad: A Package for Online Time-series Anomaly Detectors. Neu- Maronna, R., Zivot, E., Rocke, D., Martin, D., Maechler, M., rocomputing 374, 49–53 (2020) Konis., K.: robust: Port of the S+ “Robust Library”. R package Jain, R., Chlamtac, I.: The p Algorithm for Dynamic Calculation of version 0.4-18 (2017) Quantiles and Histograms without Storing Observations. Com- Yao, Q.: Tests for Change-points with Epidemic Alternatives. mun. ACM 28(10), 1076–1085 (1985) Biometrika 80(1), 179–191 (1993) Jeng, X.J., Cai, T.T., Li, H.: Simultaneous Discovery of Rare and Com- Zhao, H., Liu, H., Hu, W., Yan, X.: Anomaly Detection and Fault mon Segment Variants. Biometrika 100(1), 157–172 (2013) Analysis of Wind Turbine Components Based on Deep Learning Justusson, B.I.: Median Filtering: Statistical Properties, 161–196. Network. Renewable Energy 127, 825–834 (2018) Springer, Berlin Heidelberg, Berlin, Heidelberg (1981) Killick, R., Fearnhead, P., Eckley, I.A.: Optimal Detection of Change- Publisher’s Note Springer Nature remains neutral with regard to juris- points with a Linear Computational Cost. J. Am. Stat. Assoc. dictional claims in published maps and institutional affiliations. 107(500), 1590–1598 (2012) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics and Computing Springer Journals

Real time anomaly detection and categorisation

Loading next page...
 
/lp/springer-journals/real-time-anomaly-detection-and-categorisation-xglbLD90Ab

References (39)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
ISSN
0960-3174
eISSN
1573-1375
DOI
10.1007/s11222-022-10112-3
Publisher site
See Article on Publisher Site

Abstract

The ability to quickly and accurately detect anomalous structure within data sequences is an inference challenge of growing importance. This work extends recently proposed post-hoc (offline) anomaly detection methodology to the sequential setting. The resultant procedure is capable of real-time analysis and categorisation between baseline and two forms of anomalous structure: point and collective anomalies. Various theoretical properties of the procedure are derived. These, together with an extensive simulation study, highlight that the average run length to false alarm and the average detection delay of the proposed online algorithm are very close to that of the offline version. Experiments on simulated and real data are provided to demonstrate the benefits of the proposed method. Keywords Anomaly detection · SCAPA · Streaming data · Real time 1 Introduction which are not necessarily anomalous when compared to either their local or the global data context, but together form The detection of anomalies in time series has received con- an anomalous pattern. Figure 1 provides several examples. In siderable attention in both the statistics (Chen and Liu 1993) this paper, collective anomalies and epidemic changepoints and machine learning (Chandola et al. 2009) literature. This are used interchangeably. is no surprise given the broad range of applications, from The epidemic changepoint model assumes that data fol- fraud detection (Ferdousi and Maeda 2006) to fault detec- lows some baseline, or typical distribution, everywhere tion (Theissler 2017; Zhao et al. 2018), that this area lends except for some anomalous time windows during which it fol- itself to. In recent years, the proliferation of sensors within lows another distribution. The detection of epidemic changes the internet of things (IoT) has led to the emergence of real in mean was first studied by Bruce and Jennie (1985) with time detection of anomalies in streaming (high frequency) applications to epidemiology. Since then, research in this data as an important new challenge. area has been driven by various applications including the Anomalies can be classified in a number of different ways detection of copy number variants in DNA (Bardwell and (Chandola et al. 2009). In this work, following the defini- Fearnhead 2017; Jeng et al. 2013; Olshen et al. 2004) and tions of Fisch et al. (2022a), we distinguish between point the analysis of brain imaging data (Aston and Kirch 2012; and collective anomalies. Point anomalies, also known as Stoehr et al. 2019). In particular, much of the pertinent lit- outliers, global anomalies or contextual anomalies (Chan- erature has concentrated on the epidemic change in mean dola et al. 2009), are single observations that are anomalous setting. See Yao (1993) and Gut and Steinebach (2005)for with regards to their local or global data context. Conversely, details. collective anomalies, also known as abnormal regions (Bard- More recently, the detection of joint epidemic changes well and Fearnhead 2017), or epidemic changepoints (Bruce in mean and variance as well as point anomalies was con- and Jennie 1985), are sequences of contiguous observations sidered by Fisch et al. (2022a). In parallel, there has also been some work on detecting anomalies within the online B Idris A. Eckley setting. Gut and Steinebach (2005) consider the problem of i.eckley@lancaster.ac.uk detecting epidemic changepoints sequentially while Wang et al. (2011) and Ahmad et al. (2017) propose methods STOR-i Centre for Doctoral Training, Lancaster University, LA1 4YF Lancaster, UK for the online detection of point anomalies. Other recent contributions include the MOSUM work by Kirch and col- Department of Mathematics and Statistics, Lancaster University, LA1 4YF Lancaster, UK 123 55 Page 2 of 15 Statistics and Computing (2022) 32 :55 (a) (b) (c) Fig. 1 Time series containing collective and point anomalies. Typical data shown in grey, anomalous segments in red and point anomalies shown in blue laborators (see, e.g., Eichinger and Kirch (2018)) that permits 2 Background the online detection of changepoints, and is thus able to iden- tify collective anomalies. Various outlier-robust Kalman filter CAPA, introduced by Fisch et al. (2022a), seeks to jointly approaches have also been proposed in recent years. See for detect and distinguish between point and collective anoma- example Ting et al. (2007), Agamennoni et al. (2011), Ruck- lies within an offline, univariate time series setting. The deschel et al. (2014), Chang (2014) and references therein. heart of the approach is founded upon an epidemic change- Fearnhead and Rigaill (2019) have introduced a computation- point model. To this end, consider a stochastic process ally efficient changepoint detection approach that is robust x ∼ D(θ (t )), drawn from some distribution, D, indexed to the presence of anomalies. Similarly the OSTAD pack- by a set of model parameters, θ(t ). Collective anomalies can age, developed by Iturria et al. (2020), can be used online to then be modelled as epidemic changes of the set of param- identify contextual anomalies. eters θ(t ). I.e., time windows in which θ(t ) deviates from The main contribution of this paper is to extend the offline the typical, and potentially unknown, set of parameters θ . Collective And Point Anomaly (CAPA) algorithm of Fisch Formally, et al. (2022a) to the online setting to detect both collective and θ s < t ≤ e point anomalies in streaming data, formalising early heuris- ⎪ 1 1 1 tic ideas appearing in Bezahaf et al. (2019). We call this θ(t ) = algorithm Sequential-CAPA (SCAPA). We explore various θ s < t ≤ e ⎪ K K K practical aspects of working in this online setting, including θ otherwise. (i) computational and storage costs, (ii) the typical (baseline) parameter estimation and (iii) penalty selection. In addition, Here K denotes the number of collective anomalies, while we provide various theoretical and empirical guarantees to s , e , and θ correspond to the start point, end point and the i i i the resulting costs and accuracy trade-offs that arise from our unknown parameter(s) of the ith collective anomaly respec- focus on the online setting. tively. The article is organised as follows. In Sect. 2 we introduce The number and locations of collective anomalies are esti- the literature on offline detection of anomalous time series mated by choosing K ,(s , e ),...,(s , e ), and θ such that 1 1 k K 0 regions, particularly focusing on the recently proposed CAPA they minimise the penalised cost approach. Sect. 3 proceeds to extend this methodology to ⎡ ⎛ ⎞ ⎤ the online setting, introducing the Sequential Collective and Point Anomaly (SCAPA) algorithm. Theoretical properties ⎣ ⎝ ⎠ ⎦ C(x ,θ )+ min C(x ,θ ) +β . t 0 t j C of the proposed methodology are investigated in Sect. 4. Fur- θ t ∈∪[ / s +1,e ] j =1 t =s +1 i i j ther results, together with a set of simulation studies is given (2.1) in Sect. 5, indicating how these can be used to inform prac- titioners on how to select the hyper-parameters of SCAPA. C(·, ·) is a cost function, e.g., twice the negative log- Finally, we apply SCAPA to the monitoring of a sensor on a likelihood, and β is a penalty term for introducing a publically available, industrial machine-level data in Sect. 6. collective anomaly, which seeks to prevent overfitting. A All proofs can be found in the supplementary material. minimum segment length, l, can be imposed by adding the constraint e − s ≥ l for k = 1, 2,..., K , if collective k k anomalies of interest are assumed to be of length at least l ≥ 1. 123 Statistics and Computing (2022) 32 :55 Page 3 of 15 55 Minimising the cost function (2.1) exactly by solving a with respect to K ,(s , e ), ...,(s , e ), and O, subject to 1 1 k K dynamic programme like the PELT method (Killick et al. the constraint e − s ≥ l ≥ 2for k = 1, 2,..., K . Here, k k 2012) is not possible. This is because the parameter of the β corresponds to a penalty for a point anomaly. typical distribution, θ , is shared across segments, and intro- The CAPA algorithm then minimises the cost in (2.3)by duces dependence. Fisch et al. (2022a) suggest removing this solving the dynamic programme dependence in θ by obtaining a robust estimate θ over the 0 0 whole data and then minimising x −ˆ μ t 0 C (t ) = min C (t − 1) + , C (t − 1) + log ⎡ ⎛ ⎞ ⎤ σ ˆ ˆ ⎣ ⎝ ⎠ ⎦ 2 C(x , θ )+ min C(x ,θ ) +β , t 0 t j C × x −ˆ μ + 1 + β , min C (k) + (t − k) t 0 O 0≤k<t −l t ∈∪[ / s +1,e ] j =1 t =s +1 i i j (2.2) 2 (x −¯x ) i (k+1):t i =k+1 log + 1 + β , (t − k) as an approximation to (2.1) over just the number and loca- tion of collective anomalies. The main focus of Fisch et al. taking C (0) = 0. (2022a) was on the case where anomalies are characterised In practice, as is common in many time series settings, by an atypical mean and or variance. In this case, the authors some form of pre-processing of the series may be required to suggest minimising ensure it is of a suitable form for the CAPA framework. For example, some form of deseasonalisation may be appropri- x −ˆ μ t 0 log(σ ˆ ) + ate. σ ˆ t ∈∪[ / s +1,e ] i i ⎡ ⎛ ⎞ (x −¯x ) t (s +1):e j j t =s +1 ⎣ ⎝ ⎠ 3 Sequential CAPA + (e − s ) log j j (e − s ) j j j =1 We now introduce our Sequential CAPA procedure. In extending CAPA to the online setting three main challenges +1 + β , arise. Specifically, any approach developed should be mind- ful of the following: (i) that the computational and storage cost of the dynamic programme increase with time; (ii) subject to a minimum segment length l of at least 2. The above the typical (baseline) parameters have to be learned online expression arises from setting the cost function to twice the and (iii) penalty selection. We address each of these three negative log-likelihood of the Gaussian. The robust estimates challenges in turn, proposing solutions in the following sec- for mean and variance, μ ˆ and σ ˆ , can be obtained from the 0 0 tions, prior to formally introducing the SCAPA algorithm in median and the inter-quartile range. Sect. 3.3. The main weakness of the above penalised cost is that point anomalies will be fitted as collective anomalies in a 3.1 Increasing Computational and Storage Cost segment of length l. To remedy this, point anomalies are mod- elled as epidemic changes of length one in variance (only). As noted in Sect. 2, CAPA infers collective and point anoma- The set of point anomalies is denoted as O. To infer both lies by solving a set of dynamic programme recursions. collective and point anomalies we minimise However both the computational cost of each recursion, and the storage cost, increase linearly in the total number of obser- x −ˆ μ t 0 vations. This is unsuitable for the online setting in which both log(σ ˆ ) + σ ˆ storage and computational resources are finite. t ∈∪[ / s +1,e ]∪O i i In practice, this problem can be surmounted by imposing + log((x −ˆ μ ) ) + 1 + β t 0 O a maximum length m for collective anomalies. This can be t ∈O ⎡ ⎛ ⎞ achieved by adding the set of constraints (x −¯x ) t (s +1):e j j t =s +1 ⎣ ⎝ ⎠ + (e − s ) log j j e − s ≤ m ∀i = 1, 2,..., K (3.1) i i (e − s ) j j j =1 to the optimisation problem in equation (2.3). The resulting + 1 +β , (2.3) problem can then be solved using the following dynamic programme 123 55 Page 4 of 15 Statistics and Computing (2022) 32 :55 these estimates tend to be considerably more accurate than x −ˆ μ t 0 C (t ) = min C (t − 1) + , C (t − 1) those of other commonly used methods such as the quan- σ ˆ tile filter (Justusson 1981) and the p -algorithm (Jain and + log x −ˆ μ + 1 + β , Chlamtac 1985). This is due to the fact that the quantile filter t 0 O is not consistent, and that the p -algorithm is not robust with min (C (k) + (t − k) respect to outliers, thus losing a critical property of quantile t −m≤k<t −l estimators. (x −¯x ) i (k+1):t i =k+1 Pseudo-code for the SA-based method is given in Algo- log + 1 +β . (t − k) rithm 1. Using a burn in period to stabilise the quantile estimates is recommended, as even the exact order statistics As a consequence of restriction (3.1), each recursion only take some time to initially converge. SA-based methods can requires a finite number of calculations. Moreover, only a also be used to calculate other important statistics in an online finite number of the optimal costs, C (t ), need to be stored fashion. For example, Sharia (2010) applied SA-techniques in memory. The practical implications of this additional con- to learn auto-regressive parameters sequentially. Such esti- straint are likely to be limited. Within this setting collective mators can be used to inflate the penalties used to account anomalies encompassing fewer than m observations will be for deviations from the i.i.d. assumptions. This is discussed detected as before. However, for those scenarios where an in more detail in Sect. 5. anomaly encompasses more than m observations, these will be fitted as a succession of collective anomalies each of length 3.3 Penalty selection less than m, provided that their signal strength (cf Sect. 5.1 for a definition) is large enough. As one might anticipate, We now turn to the important question of penalty selection. within this setting long anomalous segments with low signal In the offline setting, penalties are typically chosen to control strength would not be detectable any more as a result of the false positives under the null hypothesis. For example, Fisch approximation. et al. (2022a) suggested using penalties 3.2 Sequential estimation of the typical parameters β (a,λ) = 2 1 + λ + 2λ ,β (λ) = 2λ, (3.2) C O a − 1 As described in Sect. 2, the dynamic programme used by indexed by a single parameter, λ, for CAPA when consi- CAPA requires robust estimates of the set of typical param- dering the change in mean and variance setting. Here, the eters θ = (μ ,σ ). Fisch et al. (2022a) estimate μ and 0 0 0 0 penalty for collective anomalies, β , depends on the length a σ on the full data using the median and inter-quartile range of the putative collective anomaly. The motivation for these respectively. In an online setting, however, these quantiles penalties is to ensure that the estimates for the number of have to be learnt as the data is observed. collective anomalies and the set of point anomalies, K and A range of methods have been proposed that aim to esti- O, satisfy mate the cumulative distribution function (CDF) of the data sequentially and use it to estimate quantiles. For example, −λ −λ 2 ˆ ˆ Pr(K = 0, O =∅) ≥ 1 − C ne − C (ne ) , (3.3) 1 2 Tierney (1983) proposed a method based on techniques from Stochastic Approximation (SA) to estimate the αth quantile under the null hypothesis that no point or collective anomaly x of an unknown distribution function. Moreover, Tierney (α) is present in the data. Consequently, setting λ = log(n) (1983) also established that, in the i.i.d. setting, the resulting asymptotically controls the number of false positives of a sequential estimates xˆ → x almost surely as the num- (α),n (α) time series of length n. ber of observations n →∞. Under the same assumptions, In the online setting, however, the concept of the length they also showed that n(xˆ − x ) converges in distri- (α),n (α) of a time series does not exist. Consequently, fixed constants bution to a Normal distribution. These consistency results are used for the penalties instead. This means that, unless the are important for an online implementation of CAPA, as errors are bounded, false positives will be observed eventu- Fisch et al. (2022a) showed that the consistency of CAPA ally. In common with Lorden (1971) and Pollak (1985), we requires the robustly estimated mean and variance to be suggest choosing λ to be as small as possible, to maximise log(n) within O of the true typical mean and variance. power against anomalies, whilst maintaining the average run The memory required to obtain the SA-estimate is finite length (ARL), the average time between false positives, at an and small. Moreover, the standard errors of the SA-estimate acceptable level. Practical guidance on the choice of λ can and sample quantiles are close even for relatively small sam- be taken from Proposition 1, which provides an asymptotic ple sizes, as can be seen from Fig. 2. Further, we note that result for the relationship between the log-ARL and λ, under 123 Statistics and Computing (2022) 32 :55 Page 5 of 15 55 Fig. 2 a) Example time series with collective and point anomaliesaswellasthe b) median and c) IQR estimated sequentially over time using different methods: The quantile filter by Justusson (1981) (Filter), the p -method of Jain and Chlamtac (1985)(P squared) and the Stochastic Approximation based method by Tierney (1983)(SA) 0 1000 2000 3000 4000 5000 Time (a) Example time series Filter Exact P squared SA 0 1000 2000 3000 4000 5000 Time (b) Sequentially estimated median Filter Exact P squared SA 0 1000 2000 3000 4000 5000 Time (c) Sequentially estimated IQR a certain model form. This relationship is empirically verified tion is then standardised using the typical mean and variance for other models using simulations in Sect. 5. (μ ,σ ), before being passed to the finite horizon dynamic programme. Detailed pseudocode can be found in Algo- rithm 2 of the supplementary material Sequential Collective And Point Anomaly Given the above The sequential nature of SCAPA’s analysis is displayed solutions to the three identified challenges, we are able in Fig. 3 across three plots, each representing the output of to extend CAPA to an online setting. We call the resul- the analysis at different time points. Note how a collective tant approach Sequential Collective And Point Anomaly anomaly is detected, initially, as a sequence of point anoma- (SCAPA). The basic steps of the algorithm are as follows: lies until the number of observations equals the minimum When an observation comes in, it is used to update the segment length. sequential estimates of the typical parameters. The observa- Estimated IQR Estimated median data 05 10 15 20 −1.0 −0.5 0.0 0.5 1.0 1.5 −20 0 20 40 60 80 100 55 Page 6 of 15 Statistics and Computing (2022) 32 :55 12 12 8 8 4 4 0 0 0 50 100 0 50 100 150 050 100 150 (a) (b) (c) Fig. 3 The evolution in the detection of a collective anomaly with a min- x have been labelled as point anomalies and c) t = 105 where 101:104 imum segment length of l = 5. The times shown are a) t = 100 just the observations x have been labelled as a collective anomaly 101:105 prior to the anomalous observations, b) t = 104 where the observations 4 Theory log(ARL) ∼ λ/2 We now turn to consider the theoretical properties of SCAPA. as λ →∞. In particular, we investigate the average run length (ARL) and the average detection delay (ADD). Here, the ARL corre- As a consequence of the above, the probability of false sponds to the expected number of baseline datapoints SCAPA alarm is proportional to exp(−λ/2). As discussed in the processes before detecting a false positive. Conversely, the previous section, this can be used to inform the choice of ADD corresponds to the expected number of observations penalty in practice if an acceptable probability of false alarm between the onset of a collective anomaly and the time at is given. For comparison, we also provide additional simu- which a collective anomaly is first detected. We will place lation results for the log-average run length in a selection of a particular emphasis on the effects of the maximum seg- Laplace and t-distributed settings (see Fig. 4). ment length, m on the ADD, as the results following from We now turn to investigate the effects of the maximum that analysis provide practical guidance on how to choose m. segment length, m, on the ADD. To simplify the exposition Proofs of all propositions presented below may be found in of these results, we assume that the collective anomaly begins the Appendix. at time τ = 0. Formally, consider the series For simplicity of exposition, we will restrict our attention to the change in mean setting, in which the penalised cost is i .i .d. x , x ,... ∼ N (μ, 1) (4.1) 1 2 x − μ t 0 + [0 + β ] and assume that the typical mean, μ , is equal to 0 and known. t ∈∪[ / s +1,e ]∪O t ∈O 0 i i ⎡ ⎤ For a maximum segment length m, we then define ADD to K j x −¯x t (s +1):e j j be the ADD of SCAPA with a maximum segment length m. ⎣ ⎦ + + β 0 Additionally, we define ADD to be the ADD of SCAPA j =1 t =s +1 without maximum segment length. The following proposi- tion shows that imposing a maximum segment length does Recalling the penalty selection approach outlined in Sect. not affect the ADD, provided that the maximum segment 3.3. In this setting, the ARL of SCAPA can be related to the length increases at a rate faster than the penalty. penalty constant, λ, via the following result: Proposition 1 Assume we observe a data sequence with typi- Proposition 2 Let x , x ,.. follow the distribution specified 1 2 cal mean, μ , and the typical variance, σ , both known. Then in (4.1). Moreover, let the known baseline mean and variance 2 λ the ARL of SCAPA on i.i.d. N (μ ,σ )-distributed observa- be 0 and 1 respectively. Then, if m > (1 + ) for some 0 2 tions x , x ,... then satisfies > 0, 1 2 123 Statistics and Computing (2022) 32 :55 Page 7 of 15 55 ν s 3 1 5 2 10 4 15 10 (a) (b) Fig. 4 Simulation results for the log-average run length in a selection of (a) t-distributed and (b) Laplace distributed noise settings for different values of λ ADD = ADD + o(1) should be at least of magnitude , where μ is the smallest m ∞ 2 change in mean of interest to ensure power. as λ →∞. Given Proposition 2, it is natural to consider what happens 5 Simulation study in the converse setting. I.e. what happens if the maximum segment length increases at a slower rate than the penalty. We now turn to examine the performance of SCAPA in var- Proposition 3 Let x , x ,.. follow the distribution specified 1 2 ious simulated settings. We start by considering the case in (4.1). Moreover, let the known typical mean and variance where a single collective anomaly is present to evaluate 1− be 0 and 1 respectively. Then, if 1 ≤ m <λ for some SCAPA via its ARL and ADD performance in Sect. 5.1. > 0 The effect of auto-correlation is also examined. This is fol- lowed by a comparison with CAPA on time series containing log(ADD ) ∼ λ/2 multiple anomalies in Sect. 5.2. as λ →∞. 5.1 A single anomaly In other words, the log-ADD has the same exponential rate as the log-ARL on non-anomalous data. Prior to describing our first simulation scenario, we begin As previously discussed, limits on the number of possible by noting that the ARL and ADD are functions of β and interventions often determine a tolerable probability of false β . Further, as we have seen in equation (3.2) these are a alarm in practice. Proposition 1 therefore provides a mech- function of a single parameter λ. The aim of our simulation anism to determine a suitable penalty constant λ. Further, study, therefore, is to inform the choice of λ that gives a Propositions 2 and 3 can be used to help inform an appropri- suitable ARL/ADD trade off. In particular, ceteri paribus,a ate choice of maximum segment length, m. Specifically m weaker change gives rise to a larger delay than a stronger log(ARL) log(ARL) 55 Page 8 of 15 Statistics and Computing (2022) 32 :55 Fig. 5 The solid line shows the 15 log-ARL for SCAPA as a function of λ. The grey shaded region is a pointwise 95% bootstrapped confidence interval. Results shown from 500 replications 1 5 10 15 0.05 0.1 0.2 1 5 10 15 Fig. 6 The lines show the ADD for SCAPA as a function of λ for different strengths of collective anomaly ( = 0.05, 0.1 and 0.2). The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications change. In other words, we must control for the strength of and σ = 1. Consequently, the strength of the change only change when investigating the ADD. To do so, we take the depends on the mean, μ, of the collective anomaly and is definition of signal strength from Fisch et al. (2022a). given by For a collective anomaly with mean μ and variance σ the strength, , of a change is defined as = log 1 + . (5.1) 1 1 (μ − μ) 2 2 2 = log 1 + + = σ μ μ We investigate a number of differing strengths = 2 4 σ σ σ σ 0 {0.05, 0.1, 0.2} corresponding to mean changes of μ = = + − 2. σ σ {0.45, 0.65, 0.94}. In all the simulations reported below we set the minimum Here, μ and σ are the parameters of the typical distribution, segment length to be l = 2, the maximum segment length to 0 0 while and denote the strengths of the change in mean be m = 1000 and used a burn-in period of n = 1000 time μ σ 0 and variance respectively. points. To estimate the ARL, data from the typical regime To simplify the simulations, we assume that the standard was simulated and SCAPA ran until the first anomaly was deviation remains unaffected by collective anomalies, i.e. (erroneously) detected. To estimate the ADD, n observa- σ = σ . Without loss of generality, we then set μ = 0 tions were simulated from the typical regime followed by 0 0 ADD log(ARL) Statistics and Computing (2022) 32 :55 Page 9 of 15 55 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 15 10 15 Fig. 7 The lines show the log-ARL for SCAPA as a function of λ where penalties, β (λ) and β (λ) are the same as in the i.i.d. case (Fig. 5). C O the simulated time series are AR(1) processes with differing lag-1 auto- The grey shaded regions are pointwise 95% bootstrapped confidence correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two intervals. Results shown from 500 replications φ φ 0 0 200 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 40 0.3 0.4 0.4 0.4 100 0.5 0.5 50 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.85 0.85 0.85 1 5 10 15 1 5 10 15 1 5 10 15 λ λ (a) (b) (c) Fig. 8 The lines show the ADD for SCAPA as a function of λ for dif- 0.7, 0.8 and 0.85). The two penalties, β (λ) and β (λ) are the same as C O ferent strengths of collective anomaly a) = 0.05, b) = 0.1and in the i.i.d. case (Fig. 6). The grey shaded regions are pointwise 95% c) = 0.2. In each case the simulated residuals are AR(1) processes bootstrapped confidence intervals. Results shown from 500 replications with differing lag-1 auto-correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, simulated observations from a distribution with an altered between λ and the ADD over a range of different values for mean. We ran SCAPA on this data and calculated the detec- the mean change of the collective anomaly. tion delay as being the number of observations after n when Note that the log-ARL increases linearly with λ.This the anomaly was detected. structure is reminiscent of the theoretical exponential rela- tionship between λ and the ARL derived by Cao and Xie (2017), even though these results were derived for known 5.1.1 Case 1: IID gaussian errors pre and post change behaviour. Similarly, the ADD, increases linearly in λ, as can be seen For our initial simulations, we simulated from the assumed in Fig. 6. This is consistent with Proposition 2. model with standard Gaussian errors. Figure 5 depicts the log-ARL over a range of values for the penalties (3.2) indexed by the parameter λ as in (3.2) along with a bootstrapped 95% confidence interval. Similarly, Fig. 6 shows the relationship ADD log(ARL) ADD ADD 55 Page 10 of 15 Statistics and Computing (2022) 32 :55 0.1 0.2 0.3 0.4 0.5 0.6 5 0.7 0.8 0.85 15 10 15 Fig. 9 The lines show the log-ARL for SCAPA as a function of λ where estimate obtained from the burn-in period using a robust M-estimator. the simulated time series are AR(1) processes with differing lag-1 auto- The grey shaded regions are pointwise 95% bootstrapped confidence correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The intervals. Results shown from 500 replications two penalties, β (λ) and β (λ) are inflated by a function of φ,the C O φ φ φ 0 0 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.6 5000 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.85 0.85 0.85 0 0 0 15 15 15 λ λ (a) (b) (c) Fig. 10 The lines show the ADD for SCAPA as a function of λ for 0.7, 0.8 and 0.85). The two penalties, β (λ) and β (λ) are inflated by C O different strengths of collective anomaly a) = 0.05, b) = 0.1and a function of φ, the estimate obtained from the burn-in period using a c) = 0.2. In each case the simulated residuals are AR(1) processes robust M-estimator. The grey shaded regions are pointwise 95% boot- with differing lag-1 auto-correlation (φ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, strapped confidence intervals. Results shown from 500 replications 5.1.2 Case 2: temporal dependence This process was simulated for a range of values of φ.As can be seen in Fig. 7, the presence of auto-correlation in the Whilst the i.i.d. data setting is appealing theoretically, many residuals leads to the spurious detection of collective anoma- observed time series are not independent (in time). Instead lies at a higher rate than for independent residuals. This is many data sequences display serial auto-correlation. To due to the fact that the cost functions of Sect. 2 assumed assess the robustness of SCAPA to temporal dependence we i.i.d. data. However, Bardwell et al. (2019)gavesomeempir- simulated an AR(1) error process as the typical distribution, ical evidence that changepoints could be recovered even x , with standard normal errors  , when auto-correlation is present by applying a correction t t or inflation factor to the penalty. This factor is the sum of the x = φx + e . auto-correlation function for the residuals from −∞ to ∞. t t −1 t This is equal to the long run variance, (1 +φ)/(1 −φ),for the ADD log(ARL) ADD ADD Statistics and Computing (2022) 32 :55 Page 11 of 15 55 Fig. 11 ROC curves for CAPA, SCAPA, RPOP, and MOSUM from 100 replications. A red dot indicates the behaviour under default parameters. Graphic curtailed at maximum 20 false positives for ease of viewing (a) (b) Fig. 12 A comparison of a) CAPA to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data AR(1) model. A similar correction exists for MA processes. 5.2 Multiple anomalies We repeated the simulations using this correction. The results in Fig. 9 show that the log-ARL of SCAPA with A natural comparison to make when investigating an online appropriately inflated penalties is almost identical to that of method is to compare its performance to its offline coun- the i.i.d. case. On the other hand, the ADD now depends on terpart. We therefore compare SCAPA and CAPA for the the auto-correlation due to the inflated penalty (see Fig. 10). detection of multiple anomalies using ROC curves in this Performing this correction requires knowledge of the section. In addition to this, we also compare SCAPA to a AR(1) parameter φ. In these simulations, the estimate of φ, robust changepoint detection algorithm proposed by (Fearn- φ, was estimated from the burn-in period using a robust M- head and Rigaill 2019), which we refer to as RPOP. We also estimator for the lag-1 autocorrelation of Rocke (1996)from compare to the MOSUM implementation provided by (Meier the R package robust (Wang et al. 2017). et al. 2021), an online though not robust changepoint detec- tion algorithm. 123 55 Page 12 of 15 Statistics and Computing (2022) 32 :55 13 ν ν 2 2 5 5 10 10 0 5 10 15 20 0 5 10 15 20 Number of False positives Number of False positives (a) (b) Fig. 13 ROC curves over 100 replications for a) CUSUM and b) SCAPA. Point anomalies were generated from a t-distribution with varying degrees of freedom (ν = 2, 5 or 10 respectively) (a) (b) Fig. 14 A comparison of a) CUSUM to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data To this end, we simulated time series with a total length The ROC curve resulting from this simulation can be of 10,000 observations with a number of point and collec- found in Fig. 11, alongside an example time series in Fig. 12 tive anomalies. The length of stay for the typical state and shown segmented by both CAPA and SCAPA. As expected for collective anomalies were sampled from a NB(5, 0.01) CAPA, which has access to the whole data, outperforms distribution and a NB(5, 0.03) distribution respectively. SCAPA. However, the gap in performance accuracy is small, Observations in the typical state were sampled from an in particular for typical choices of λ as described in Sect. 4. N (0, 1) distribution, while observations from the kth collec- tive anomaly were sampled from an N (μ ,σ ) distribution, where μ ,...,μ ∼ N (0, 2 ) and σ ,...σ ∼ (1, 1). 5.3 CUSUM comparison 1 K 1 K Point anomalies occurred in the typical state independently with probability p = 0.01 and were drawn from a t- A natural comparison that can be made to assess SCAPA’s simulated performance is with the widely used online change distribution with 2 degrees of freedom. point detection method CUSUM (Page 1954). Both methods Number of True positives Number of True positives Statistics and Computing (2022) 32 :55 Page 13 of 15 55 Fig. 15 Machine temperature data 08−Dec 18−Dec 28−Dec 07−Jan 17−Jan 27−Jan 06−Feb 16−Feb Date Fig. 16 Machine temperature data. The burn-in period is shaded in blue. Anomalies detected by SCAPA and CAPA are shaded in red and green respectively, and brown when they overlap. Dashed vertical lines show the hand labelled anomalies given by an engineer working on the machine 08−Dec 18−Dec 28−Dec 07−Jan 17−Jan 27−Jan 06−Feb 16−Feb Date use a test statistic based on the log-likelihood ratio and can state were point anomalies, simulated from a t-distribution be configured with known typical mean and variance. The with degree of freedom ν ∈{2, 5, 10}. The ROC curve result- difference between the two methods is that in SCAPA, col- ing from this simulation can be found in Fig. 13, alongside lective and point anomalies are detected jointly with separate an example time series in Fig. 14 shown segmented by both penalties whereas CUSUM is not designed to be robust to methods. Between them, these plots highlight SCAPA’s abil- point anomalies. In our simulations, the CUSUM approach ity to distinguish explicitly between point anomalies and is implemented by setting the penalty for point anomalies in collective anomalies, whereas an approach such as CUSUM SCAPA to an arbitrarily large value (β = 10 ) so that no suffers. Note in particular that, as expected, CUSUM gives point anomalies are detected. a higher number of false positives than SCAPA when point The data was simulated in a similar way to that of Sect. anomalies are from distributions with heavier tails. 5.2 with the difference being that 20% of points in the typical Metric Metric 55 Page 14 of 15 Statistics and Computing (2022) 32 :55 Table 1 Labelled anomalies from the NAB obtained from https://github.com/numenta/NAB/blob/master/labels/combined_windows.json along with the time it was detected (in bold) Anomaly Start time End time Given reason Detection time 1 17:50 15/12/2013 17:00 17/12/2013 Planned shutdown 16:50 16/12/2013 2 14:20 27/01/2014 13:30 29/01/2014 Onset of problem 21:25 28/01/2014 3 14:55 07/02/2014 14:05 09/02/2014 Catastrophic system failure 3:15 08/02/2014 6 Machine temperature data Table 2 Collective anomalies found using CAPA Start time End time The Numenta Anomaly Benchmark (NAB) (Lavin and 11:30:00 GMT 08/12/2013 23:05:00 GMT 10/12/2013 Ahmad 2015; Ahmad et al. 2017) provides a number of data 23:35:00 GMT 15/12/2013 18:40:00 GMT 16/12/2013 sets that can be used to compare different anomaly detection 11:25:00 GMT 27/01/2014 13:50:00 GMT 31/01/2014 approaches. The data can be obtained from https://github. 09:20:00 GMT 07/02/2014 12:05:00 GMT 09/02/2014 com/numenta/NAB. One example consists of heat sensor data from an inter- nal component of a large industrial machine. The data is displayed in Fig. 15. There are n = 22, 695 observations be decreased, however, as noted elsewhere in this paper this spanning 2nd December 2013 - 19th February 2014 sampled would increase the frequency of false alarms. every five minutes. Lavin and Ahmad (2015) use an initial, or burn-in, period Supplementary Information The online version contains supplemen- to allow their algorithms to learn about the data. In line with tary material available at https://doi.org/10.1007/s11222-022-10112- their approach, we set the burn in period to be the first 15% of the data (2nd December 2013 until the 14th December 2013, Acknowledgements This work was supported by EPSRC grant num- as shown by the blue shaded area in Fig. 16). We used the burn bers EP/N031938/1 (StatScale), EP/R004935/1 (NG-CDI) and EP/L015 in to obtain a robust M-estimator for the lag-1 autocorrelation 692/1 (STOR-i). Fisch also gratefully acknowledges EPSRC (EP/S515 of the observations after standardisation by the sequential 127/1) and British Telecommunications plc (BT) for providing finan- mean and variance estimate. Using the robust M estimator cial support for his PhD via an Industrial CASE award. Finally, the authors thank David Yearling, Trevor Burbridge, Stephen Cassidy, and of Rocke (1996) from the R package robust (Wang et al. Kjeld Jensen in BT Research for helpful discussions while this work 2017), we obtained an autocorrelation estimate φ = 0.974. was being undertaken. In line with the approach taken in Sect. 5.1.2, we therefore set the penalties to: Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as ˆ ˆ 1 + φ 1 + φ long as you give appropriate credit to the original author(s) and the β = 2 × × log(n), β = 2 × × log(n). C O ˆ ˆ source, provide a link to the Creative Commons licence, and indi- 1 − φ 1 − φ cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, Figure 16 shows the three anomalies SCAPA detected unless indicated otherwise in a credit line to the material. If material shaded in red. These corresponded to a set of hand labelled is not included in the article’s Creative Commons licence and your anomalous regions given by an engineer working on the intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- machine shown by the dashed vertical lines. The positions right holder. To view a copy of this licence, visit http://creativecomm of these are given in Table 1. It should be noted that the data ons.org/licenses/by/4.0/. labels in the NAB consist of anomalous periods, rather than points. However, all approaches applied to the data as part of the NAB only return points of anomalous behaviour, high- lighting SCAPA’s potential to provide new insights into the References data. The detection of the more subtle second anomaly in a Agamennoni, G., Nieto, J.I., Nebot, E.M.: An Outlier-robust Kalman timely fashion is important as this was claimed in the NAB Filter. In 2011 IEEE International Conference on Robotics and Automation, pages 1551–1558. IEEE (2011) literature to be the cause of the catastrophic system failure Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised Real-time (third anomaly). We can see that the time at which SCAPA Anomaly Detection for Streaming Data. Neurocomputing, 262, first detected it in Table 1. If users of the system deemed 134 – 147. Online Real-Time Learning Strategies for Data Streams this to be too long of a delay the penalties used above could (2017) 123 Statistics and Computing (2022) 32 :55 Page 15 of 15 55 Aston, J.A.D., Kirch, C.: Evaluating Stationarity Via Change-point Lavin, A., Ahmad, S.: Evaluating Real-time Anomaly Detection Alternatives with Applications to Fmri Data. Ann. Appl. Stat. 6(4), Algorithms – the Numenta Anomaly Benchmark. IEEE 14th 1906–1948 (2012) International Conference on Machine Learning and Applications Bardwell, L., Fearnhead, P.: Bayesian Detection of Abnormal Segments (ICMLA), 38–44 (2015) in Multiple Time Series. Bayesian Anal. 12(1), 193–218 (2017) Lorden, G.: Procedures for Reacting to a Change in Distribution. Ann. Bardwell, L., Fearnhead, P., Eckley, I.A., Smith, S., Spott, M.: Most Math. Statist. 42(6), 1897–1908 (1971) Recent Changepoint Detection in Panel Data. Technometrics Meier, A., Kirch, C., Cho, H.: MOSUM: A Package for Moving Sums 61(1), 88–98 (2019) in Change-point Analysis. J. Stat. Softw. 97(8), 1–42 (2021) Bezahaf, M., Hernandez, M.P., Bardwell, L., Davies, E., Broadbent, Olshen, A.B., Venkatraman, E.S., Lucito, R., Wigler, M.: Circular M., King, D., Hutchison, D.: Self-generated Intent-based System. Binary Segmentation for the Analysis of Array-based dna Copy In 2019 10th International Conference on Networks of the Future Number Data. Biostatistics 5(4), 557–572 (2004) (NoF), 138–140 (2019) Page, E.S.: Continuous Inspection Schemes. Biometrika 41(1/2), 100– Bruce, L., Jennie, K.: The Cusum Test of Homogeneity with an Appli- 115 (1954) cation in Spontaneous Abortion Epidemiology. Stat. Med. 4(4), Pollak, M.: Optimal Detection of a Change in Distribution. Ann. Statist. 469–488 (1985) 13(1), 206–227 (1985) Cao, Y., Xie, Y.: Robust Sequential Change-point Detection by Convex Rocke, D.M.: Robustness Properties of S-estimators of Multivariate Optimization. 2017 IEEE International Symposium on Informa- Location and Shape in High Dimension. Ann. Stat. 24(3), 1327– tion Theory (ISIT), 1287–1291 (2017) 1345 (1996) Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. Ruckdeschel, P., Spangl, B., Pupashenko, D.: Robust Kalman Tracking ACM Comput. Surv. 41(3), 15:1-15:58 (2009) and Smoothing with Propagating and Non-propagating Outliers. Chang, G.: Robust Kalman Filtering Based on Mahalanobis Distance Stat. Pap. 55(1), 93–123 (2014) as Outlier Judging Criterion. J. Geodesy 88(4), 391–401 (2014) Sharia, T.: Efficient On-line Estimation of Autoregressive Parameters. Chen, C., Liu, L.-M.: Joint Estimation of Model Parameters and Out- Math. Methods Statist. 19(2), 163–186 (2010) lier Effects in Time Series. J. Am. Stat. Assoc. 88(421), 284–297 Stoehr, C., Aston, J.A.D., Kirch, C.: Detecting Changes in the Covari- (1993) ance Structure of Functional Time Series with Application to fMRI Eichinger, B., Kirch, C.: A MOSUM Procedure for the Estimation of Data. arXiv e-prints, arXiv:1903.00288 (2019) Multiple Random Change Points. Bernoulli 24(1), 526–564 (2018) Theissler, A.: Detecting Known and Unknown Faults in Automotive Fearnhead, P., Rigaill, G.: Changepoint Detection in the Presence of Systems Using Ensemble-based Anomaly Detection. Knowl.- Outliers. J. Am. Stat. Assoc. 114(525), 169–183 (2019) Based Syst. 123, 163–173 (2017) Ferdousi, Z., Maeda, A.: Unsupervised Outlier Detection in Time Series Tierney, L.: A Space-efficient Recursive Procedure for Estimating a Data. 22nd International Conference on Data Engineering Work- Quantile of an Unknown Distribution. SIAM J. Sci. Stat. Comput. shops (ICDEW’06), x121–x121 (2006) 4(4), 706–711 (1983) Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: A Linear Time Method for Ting, J.-A., Theodorou, E., Schaal, S.: Learning An Outlier-robust the Detection of Point and Collective Anomalies. Stat. Anal. Data Kalman Filter. In European Conference on Machine Learning, Min. (2022a). https://doi.org/10.1002/sam.11586 748–756. Springer (2007) Fisch, A.T.M., Eckley, I.A., Fearnhead, P.: Subset Multivariate Collec- Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., tive and Point Anomaly Detection. J. Comput. Graph. Stat. 31(2), Schwan, K.: Statistical Techniques for Online Anomaly Detection 574–585 (2022b) in Data Centers. In 12th IFIP/IEEE International Symposium on Gut, A., Steinebach, J.: A Two-step Sequential Procedure for Detecting Integrated Network Management (IM 2011) and Workshops, 385– an Epidemic Change. Extremes 8(4), 311–326 (2005) 392 (2011) Iturria, A., Carrasco, J., Charramendieta, S., Conde, A., Herrera, F.: Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., otsad: A Package for Online Time-series Anomaly Detectors. Neu- Maronna, R., Zivot, E., Rocke, D., Martin, D., Maechler, M., rocomputing 374, 49–53 (2020) Konis., K.: robust: Port of the S+ “Robust Library”. R package Jain, R., Chlamtac, I.: The p Algorithm for Dynamic Calculation of version 0.4-18 (2017) Quantiles and Histograms without Storing Observations. Com- Yao, Q.: Tests for Change-points with Epidemic Alternatives. mun. ACM 28(10), 1076–1085 (1985) Biometrika 80(1), 179–191 (1993) Jeng, X.J., Cai, T.T., Li, H.: Simultaneous Discovery of Rare and Com- Zhao, H., Liu, H., Hu, W., Yan, X.: Anomaly Detection and Fault mon Segment Variants. Biometrika 100(1), 157–172 (2013) Analysis of Wind Turbine Components Based on Deep Learning Justusson, B.I.: Median Filtering: Statistical Properties, 161–196. Network. Renewable Energy 127, 825–834 (2018) Springer, Berlin Heidelberg, Berlin, Heidelberg (1981) Killick, R., Fearnhead, P., Eckley, I.A.: Optimal Detection of Change- Publisher’s Note Springer Nature remains neutral with regard to juris- points with a Linear Computational Cost. J. Am. Stat. Assoc. dictional claims in published maps and institutional affiliations. 107(500), 1590–1598 (2012)

Journal

Statistics and ComputingSpringer Journals

Published: Aug 1, 2022

Keywords: Anomaly detection; SCAPA; Streaming data; Real time

There are no references for this article.