Access the full text.
Sign up today, get DeepDyve free for 14 days.
Sidney Siegel (1956)
Non-parametric Statistics for the Behavioral SciencesMcGraw-Hill
A. Lazarevic, Vipin Kumar (2005)
Feature bagging for outlier detection
F. Angiulli, C. Pizzuti (2002)
Fast Outlier Detection in High Dimensional Spaces
E. Knorr, R. Ng (1998)
Algorithms for mining distance-based outliers in large datasetsProceedings of the 24th International Conference on Very Large Data Bases (VLDB’98)
Nick Craswell (2009)
Precision at n
C. C. Aggarwal, P. S. Yu (2001)
Outlier detection for high dimensional dataProceedings of the International Conference on Managment of Data (SIGMOD’01).
C. Aggarwal, Saket Sathe (2017)
Outlier Ensembles - An Introduction
(1905)
Skew variation, a rejoinder
Edgar Chávez, G. Navarro, Ricardo Baeza-Yates, J. Marroquín (2001)
Searching in metric spacesACM Comput. Surv., 33
W. Jin, A. K. H. Tung, J. Han (2001)
Mining top-n local outliers in large databasesProceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01).
V. Chandola, A. Banerjee, V. Kumar (2009)
Anomaly detection: A surveyACM Computing Surveys, 41
C. Aggarwal (2001)
Re-designing distance functions and distance-based applications for high dimensional dataSIGMOD Rec., 30
F. Angiulli (2017)
Concentration Free Outlier Detection
Fabrizio Angiulli, Clara Pizzuti (2002)
Fast outlier detection in large high-dimensional data setsProceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02)
A. Zimek, Erich Schubert, H. Kriegel (2012)
A survey on unsupervised outlier detection in high‐dimensional numerical dataStatistical Analysis and Data Mining: The ASA Data Science Journal, 5
L. Akoglu, Hanghang Tong, Danai Koutra (2014)
Graph based anomaly detection and description: a surveyData Mining and Knowledge Discovery, 29
Fabian Keller, Emmanuel Müller, Klemens Böhm (2012)
HiCS: High contrast subspaces for density-based outlier rankingProceedings of the IEEE 28th International Conference on Data Engineering (ICDE’12)
Laurie Davies, U. Gather (1993)
The identification of multiple outliersJournal of the American Statistical Association, 88
A. Fiori, M. Zenga (2009)
Karl Pearson and the Origin of KurtosisInternational Statistical Review, 77
Jessica Lin, David Etter, Dave DeBarr (2008)
Exact and Approximate Reverse Nearest Neighbor Search for Multimedia Data
Ville Hautamäki, Ismo Kärkkäinen, Pasi Fränti (2004)
Outlier detection using k-nearest neighbour graphProceedings of the 17th International Conference on Pattern Recognition (ICPR’04)
Michael Schweinberger (2018)
Random GraphsFoundations of Data Science
Wen Jin, A. Tung, Jiawei Han, Wei Wang (2006)
Ranking Outliers Using Symmetric Neighborhood Relationship
J. Aucouturier, F. Pachet (2008)
A scale-free distribution of false positives for a large class of audio similarity measuresPattern Recognit., 41
Miloš Radovanović, A. Nanopoulos, M. Ivanović (2009)
Nearest neighbors in high-dimensional data: the emergence and influence of hubs
F. Angiulli (2017)
On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and HubnessJ. Mach. Learn. Res., 18
H. Inoue, K. Taura (2015)
SIMD- and Cache-Friendly Algorithm for Sorting an Array of StructuresProc. VLDB Endow., 8
F. Angiulli, Fabio Fassetti, L. Palopoli (2009)
Detecting outlying properties of exceptional objectsACM Trans. Database Syst., 34
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft (1999)
When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99)When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99).217--235.
F. T. Liu, K. M. Ting, Z.-H. Zhou (2012)
Isolation-based anomaly detectionACM Transactions on Knowledge Discovery from Data, 6
Charles Newman, Y. Rinott, A. Tversky (1983)
Nearest neighbors and Voronoi regions in certain point processesAdvances in Applied Probability, 15
Richard Bellman (1961)
Adaptive Control Processes: A Guided TourPrinceton University Press
Laurens van der Maaten, Eric Postma, Jaapvan den Herik (2009)
Dimensionality Reduction: A Comparative ReviewTechnical Report TiCC-TR 2009-005. Tilburg University
Varun Chandola, Arindam Banerjee, Vipin Kumar (2012)
Anomaly detection for discrete sequences: A surveyIEEE Transactions on Knowledge and Data Engineering, 24
Charu C. Aggarwal (2001)
Re-designing distance functions and distance-based applications for high dimensional dataACM SIGMOD Record, 30
Milos Radovanović, Alexandros Nanopoulos, Mirjana Ivanović (2015)
Reverse nearest neighbors in unsupervised distance-based outlier detectionIEEE Transactions on Knowledge and Data Engineering, 27
V. Barnett, T. Lewis (1994)
Outliers in Statistical DataJohn Wiley 8 Sons., 8
P. Westfall (2014)
Kurtosis as Peakedness, 1905–2014. R.I.P.The American Statistician, 68
M. Vlachos (2010)
Dimensionality Reduction
C. Aggarwal (2013)
Outlier Analysis
F. Angiulli, Fabio Fassetti (2009)
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasetsACM Trans. Knowl. Discov. Data, 3
O. Watanabe (2005)
Sequential sampling techniques for algorithmic learning theoryTheor. Comput. Sci., 348
G. W. Corder, D. I. Foreman (2014)
Nonparametric Statistics: A Step-by-Step ApproachWiley.
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos (2003)
LOCI: Fast outlier detection using the local correlation integralProceedings 19th International Conference on Data Engineering (ICDE’03)
Victoria Hodge, Jim Austin (2004)
A survey of outlier detection methodologiesArtificial Intelligence Review, 22
F. Angiulli, S. Basta, C. Pizzuti (2006)
Distance-based detection and prediction of outliersIEEE Transactions on Knowledge and Data Engineering, 18
Pierre Demartines (1994)
Analyse de Données par Réseaux de Neurones Auto-OrganisésPh.D. Dissertation. Institut National Polytechnique de Grenoble
F. Angiulli, C. Pizzuti (2005)
Outlier mining in large high-dimensional data setsIEEE Transactions on Knowledge and Data Engineering, 17
P. Erdös, A. Rényi (1959)
On random graphs IPublicationes Mathematicae Debrecen 6 (1959), 6
R. Bellman (2015)
Adaptive Control Processes - A Guided Tour (Reprint from 1961), 2045
H.-P. Kriegel, M. Schubert, A. Zimek (2008)
Angle-based outlier detection in high-dimensional dataProceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08)
B. Mohr, S. Blügel (2006)
Introduction to Parallel Computing
A. Arning, C. Aggarwal, P. Raghavan (1996)
A linear method for deviation detection in large databasesProceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96)
D. Francois, V. Wertz, M. Verleysen (2007)
The concentration of fractional distancesIEEE Transactions on Knowledge and Data Engineering, 19
Peter Sanders (1998)
Random permutations on distributed, external and hierarchical memoryInformation Processing Letters, 67
J. Han, M. Kamber (2001)
Data Mining, Concepts and TechniqueMorgan Kaufmann
M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander (2000)
LOF: Identifying density-based local outliersProceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00)., 2000
S. Ramaswamy, R. Rastogi, Kyuseok Shim (2000)
Efficient algorithms for mining outliers from large data sets
We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.
ACM Transactions on Knowledge Discovery from Data (TKDD) – Association for Computing Machinery
Published: Jan 27, 2020
Keywords: Outlier detection
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.