Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Zhenyuan Wang; Chih-Fong Tsai; Wei-Chao Lin

doi:10.1108/dta-01-2021-0027

Loading next page...

References (41)

Rémi Domingues, M. Filippone, P. Michiardi, Jihane Zouaoui (2018)
A comparative evaluation of outlier detection algorithms: Experiments and analyses
Pattern Recognit., 74
P. García-Laencina, J. Sancho-Gómez, A. Figueiras-Vidal (2010)
Pattern classification with missing data: a review
Neural Computing and Applications, 19
J. Cano, F. Herrera, M. Lozano (2003)
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
IEEE Trans. Evol. Comput., 7
Victoria Hodge, J. Austin
White Rose Consortium ePrints Repository
D. Gupta, Bharat Richhariya, Parashjyoti Borah (2019)
A fuzzy twin support vector machine based on information entropy for class imbalance learning
Neural Computing and Applications
Bhagat Raghuwanshi, Sanyam Shukla (2019)
Class imbalance learning using UnderBagging based kernelized extreme learning machine
Neurocomputing, 329
LOF: identifying density-based local outliers
SIGMOD Record, 29
Wei-Chao Lin, Chih-Fong Tsai (2019)
Missing value imputation: a review and analysis of the literature (2006–2017)
Artificial Intelligence Review, 53
Huaping Guo, Jun Zhou, C. Wu (2020)
Ensemble learning via constraint projection and undersampling technique for class-imbalance problem
Soft Computing, 24
Qinbao Song, Yuchen Guo, M. Shepperd (2019)
A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction
IEEE Transactions on Software Engineering, 45
Lijun Yang, Qingsheng Zhu, Jinlong Huang, Quanwang Wu, Dongdong Cheng, Xiaolu Hong (2019)
Constraint nearest neighbor for instance reduction
Soft Computing
Shehroz Khan, M. Madden (2013)
One-class classification: taxonomy of study and review of techniques
The Knowledge Engineering Review, 29
C. Bellinger, Shiven Sharma, N. Japkowicz, Osmar Zaiane (2019)
Framework for extreme imbalance classification: SWIM—sampling with the majority class
Knowledge and Information Systems, 62
B. Krawczyk, I. Triguero, S. García, Michal Wozniak, F. Herrera (2019)
Instance reduction for one-class classification
Knowledge and Information Systems, 59
J. Demšar (2006)
Statistical Comparisons of Classifiers over Multiple Data Sets
J. Mach. Learn. Res., 7
Xu Han, Runbang Cui, Yanfei Lan, Yanzhe Kang, Jiang Deng, Ning Jia (2019)
A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets
International Journal of Machine Learning and Cybernetics
M. Saidi, M. Bechar, N. Settouti, Amine Chikh (2017)
Instances selection algorithm by ensemble margin
Journal of Experimental & Theoretical Artificial Intelligence, 30
M. Wasikowski, Xue-wen Chen (2010)
Combating the Small Sample Class Imbalance Problem Using Feature Selection
IEEE Transactions on Knowledge and Data Engineering, 22
V. Chandola, A. Banerjee, Vipin Kumar (2009)
Anomaly detection: A survey
ACM Comput. Surv., 41
D. Wilson, T. Martinez, R. Holte (2000)
Reduction Techniques for Instance-Based Learning Algorithms
Machine Learning, 38
J. Olvera-López, J. Carrasco-Ochoa, José Trinidad, J. Kittler (2010)
A review of instance selection methods
Artificial Intelligence Review, 34
Min-Wei Huang, Wei-Chao Lin, Chih-Fong Tsai (2018)
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets
Journal of Healthcare Engineering, 2018
B. Pes (2019)
Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
Neural Computing and Applications, 32
A survey of outlier detection methodologies
Artificial Intelligence Review, 22
Jie Sun, Hui Li, H. Fujita, Binbin Fu, Wenguo Ai (2020)
Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting
Inf. Fusion, 54
Wei-Chao Lin, Chih-Fong Tsai, Shih-Wen Ke, Chia-Wen Hung, W. Eberle (2015)
Learning to detect representative data for large scale instance selection
J. Syst. Softw., 106
M. Breunig, H. Kriegel, R. Ng, J. Sander (2000)
LOF: identifying density-based local outliers
D. Aha, D. Kibler, M. Albert (2004)
Instance-based learning algorithms
Machine Learning, 6
Zhenxiang Chen, Qiben Yan, Hongbo Han, Shanshan Wang, Lizhi Peng, L. Wang, Bo Yang (2017)
Machine learning based mobile malware detection using highly imbalanced network traffic
Inf. Sci., 433-434
Victoria López, Alberto Fernández, S. García, V. Palade, F. Herrera (2013)
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
Inf. Sci., 250
Fei Liu, K. Ting, Zhi-Hua Zhou (2008)
Isolation Forest
2008 Eighth IEEE International Conference on Data Mining
X. Guo, Yilong Yin, Cailing Dong, Gongping Yang, Guang-Tong Zhou (2008)
On the Class Imbalance Problem
2008 Fourth International Conference on Natural Computation, 4
D. Tax, R. Duin (1999)
Support vector domain description
Pattern Recognit. Lett., 20
S. García, J. Derrac, J. Cano, F. Herrera (2012)
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34
Chih-Fong Tsai, Wei-Chao Lin, Ya-Han Hu, Guan-Ting Yao (2019)
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Inf. Sci., 477
Qing Chen, Anguo Zhang, Tingwen Huang, Qianping He, Yongduan Song (2018)
Imbalanced dataset-based echo state networks for anomaly detection
Neural Computing and Applications, 32
S. R, Punniyamoorthy M. (2019)
Performance enhanced Boosted SVM for Imbalanced datasets
Appl. Soft Comput., 83
Paula Branco, L. Torgo, Rita Ribeiro (2016)
A Survey of Predictive Modeling on Imbalanced Domains
ACM Computing Surveys (CSUR), 49
(2012)
A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches
Sara Fotouhi, S. Asadi, M. Kattan (2019)
A comprehensive data level analysis for cancer diagnosis on imbalanced data
Journal of biomedical informatics, 90
Chih-Fong Tsai, Ya-Ting Sung (2020)
Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches
Knowl. Based Syst., 203

Publisher: Emerald Publishing
Copyright: © Emerald Publishing Limited
ISSN: 2514-9288
DOI: 10.1108/dta-01-2021-0027
Publisher site: See Article on Publisher Site

Abstract

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.Design/methodology/approachIn this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.FindingsThe experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.Originality/valueThe novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

Journal

Data Technologies and Applications – Emerald Publishing

Published: Oct 11, 2021

Keywords: Data mining; One-class classifiers; Class imbalance; Machine learning; Instance selection; Missing value imputation

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

References (41)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies