CFOF

Fabrizio Angiulli

doi:10.1145/3362158

Loading next page...

References (58)

Sidney Siegel (1956)
Non-parametric Statistics for the Behavioral Sciences
McGraw-Hill
A. Lazarevic, Vipin Kumar (2005)
Feature bagging for outlier detection
F. Angiulli, C. Pizzuti (2002)
Fast Outlier Detection in High Dimensional Spaces
E. Knorr, R. Ng (1998)
Algorithms for mining distance-based outliers in large datasets
Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98)
Nick Craswell (2009)
Precision at n
C. C. Aggarwal, P. S. Yu (2001)
Outlier detection for high dimensional data
Proceedings of the International Conference on Managment of Data (SIGMOD’01).
C. Aggarwal, Saket Sathe (2017)
Outlier Ensembles - An Introduction
(1905)
Skew variation, a rejoinder
Edgar Chávez, G. Navarro, Ricardo Baeza-Yates, J. Marroquín (2001)
Searching in metric spaces
ACM Comput. Surv., 33
W. Jin, A. K. H. Tung, J. Han (2001)
Mining top-n local outliers in large databases
Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01).
V. Chandola, A. Banerjee, V. Kumar (2009)
Anomaly detection: A survey
ACM Computing Surveys, 41
C. Aggarwal (2001)
Re-designing distance functions and distance-based applications for high dimensional data
SIGMOD Rec., 30
F. Angiulli (2017)
Concentration Free Outlier Detection
Fabrizio Angiulli, Clara Pizzuti (2002)
Fast outlier detection in large high-dimensional data sets
Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02)
A. Zimek, Erich Schubert, H. Kriegel (2012)
A survey on unsupervised outlier detection in high‐dimensional numerical data
Statistical Analysis and Data Mining: The ASA Data Science Journal, 5
L. Akoglu, Hanghang Tong, Danai Koutra (2014)
Graph based anomaly detection and description: a survey
Data Mining and Knowledge Discovery, 29
Fabian Keller, Emmanuel Müller, Klemens Böhm (2012)
HiCS: High contrast subspaces for density-based outlier ranking
Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE’12)
Laurie Davies, U. Gather (1993)
The identification of multiple outliers
Journal of the American Statistical Association, 88
A. Fiori, M. Zenga (2009)
Karl Pearson and the Origin of Kurtosis
International Statistical Review, 77
Jessica Lin, David Etter, Dave DeBarr (2008)
Exact and Approximate Reverse Nearest Neighbor Search for Multimedia Data
Ville Hautamäki, Ismo Kärkkäinen, Pasi Fränti (2004)
Outlier detection using k-nearest neighbour graph
Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04)
Michael Schweinberger (2018)
Random Graphs
Foundations of Data Science
Wen Jin, A. Tung, Jiawei Han, Wei Wang (2006)
Ranking Outliers Using Symmetric Neighborhood Relationship
J. Aucouturier, F. Pachet (2008)
A scale-free distribution of false positives for a large class of audio similarity measures
Pattern Recognit., 41
Miloš Radovanović, A. Nanopoulos, M. Ivanović (2009)
Nearest neighbors in high-dimensional data: the emergence and influence of hubs
F. Angiulli (2017)
On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness
J. Mach. Learn. Res., 18
H. Inoue, K. Taura (2015)
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
Proc. VLDB Endow., 8
F. Angiulli, Fabio Fassetti, L. Palopoli (2009)
Detecting outlying properties of exceptional objects
ACM Trans. Database Syst., 34
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft (1999)
When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99)
When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99).217--235.
F. T. Liu, K. M. Ting, Z.-H. Zhou (2012)
Isolation-based anomaly detection
ACM Transactions on Knowledge Discovery from Data, 6
Charles Newman, Y. Rinott, A. Tversky (1983)
Nearest neighbors and Voronoi regions in certain point processes
Advances in Applied Probability, 15
Richard Bellman (1961)
Adaptive Control Processes: A Guided Tour
Princeton University Press
Laurens van der Maaten, Eric Postma, Jaapvan den Herik (2009)
Dimensionality Reduction: A Comparative Review
Technical Report TiCC-TR 2009-005. Tilburg University
Varun Chandola, Arindam Banerjee, Vipin Kumar (2012)
Anomaly detection for discrete sequences: A survey
IEEE Transactions on Knowledge and Data Engineering, 24
Charu C. Aggarwal (2001)
Re-designing distance functions and distance-based applications for high dimensional data
ACM SIGMOD Record, 30
Milos Radovanović, Alexandros Nanopoulos, Mirjana Ivanović (2015)
Reverse nearest neighbors in unsupervised distance-based outlier detection
IEEE Transactions on Knowledge and Data Engineering, 27
V. Barnett, T. Lewis (1994)
Outliers in Statistical Data
John Wiley 8 Sons., 8
P. Westfall (2014)
Kurtosis as Peakedness, 1905–2014. R.I.P.
The American Statistician, 68
M. Vlachos (2010)
Dimensionality Reduction
C. Aggarwal (2013)
Outlier Analysis
F. Angiulli, Fabio Fassetti (2009)
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets
ACM Trans. Knowl. Discov. Data, 3
O. Watanabe (2005)
Sequential sampling techniques for algorithmic learning theory
Theor. Comput. Sci., 348
G. W. Corder, D. I. Foreman (2014)
Nonparametric Statistics: A Step-by-Step Approach
Wiley.
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos (2003)
LOCI: Fast outlier detection using the local correlation integral
Proceedings 19th International Conference on Data Engineering (ICDE’03)
Victoria Hodge, Jim Austin (2004)
A survey of outlier detection methodologies
Artificial Intelligence Review, 22
F. Angiulli, S. Basta, C. Pizzuti (2006)
Distance-based detection and prediction of outliers
IEEE Transactions on Knowledge and Data Engineering, 18
Pierre Demartines (1994)
Analyse de Données par Réseaux de Neurones Auto-Organisés
Ph.D. Dissertation. Institut National Polytechnique de Grenoble
F. Angiulli, C. Pizzuti (2005)
Outlier mining in large high-dimensional data sets
IEEE Transactions on Knowledge and Data Engineering, 17
P. Erdös, A. Rényi (1959)
On random graphs I
Publicationes Mathematicae Debrecen 6 (1959), 6
R. Bellman (2015)
Adaptive Control Processes - A Guided Tour (Reprint from 1961)
, 2045
H.-P. Kriegel, M. Schubert, A. Zimek (2008)
Angle-based outlier detection in high-dimensional data
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08)
B. Mohr, S. Blügel (2006)
Introduction to Parallel Computing
A. Arning, C. Aggarwal, P. Raghavan (1996)
A linear method for deviation detection in large databases
Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96)
D. Francois, V. Wertz, M. Verleysen (2007)
The concentration of fractional distances
IEEE Transactions on Knowledge and Data Engineering, 19
Peter Sanders (1998)
Random permutations on distributed, external and hierarchical memory
Information Processing Letters, 67
J. Han, M. Kamber (2001)
Data Mining, Concepts and Technique
Morgan Kaufmann
M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander (2000)
LOF: Identifying density-based local outliers
Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00)., 2000
S. Ramaswamy, R. Rastogi, Kyuseok Shim (2000)
Efficient algorithms for mining outliers from large data sets

Publisher: Association for Computing Machinery
Copyright: Copyright © 2020 ACM
ISSN: 1556-4681
eISSN: 1556-472X
DOI: 10.1145/3362158
Publisher site: See Article on Publisher Site

Abstract

We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.

Journal

ACM Transactions on Knowledge Discovery from Data (TKDD) – Association for Computing Machinery

Published: Jan 27, 2020

Keywords: Outlier detection

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

CFOF

CFOF

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

CFOF

CFOF

References (58)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies