A novel semi-supervised self-training method based on resampling for Twitter fake account identification

Ziming Zeng; Tingting Li; Shouqiang Sun; Jingjing Sun; Jie Yin

doi:10.1108/dta-07-2021-0196

Loading next page...

References (47)

Alberto Fernández, S. García, F. Herrera, N. Chawla (2018)
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
J. Artif. Intell. Res., 61
Rizal Perdana, T. Muliawati, Reddy Alexandro (2015)
BOT SPAMMER DETECTION IN TWITTER USING TWEET SIMILARITY AND TIME INTERVAL ENTROPY
, 8
Peng Xu, Haisong Xu, C. Diao, Z. Ye (2017)
Self-training-based spectral image reconstruction for art paintings with multispectral imaging.
Applied optics, 56 30
Aytuğ Onan (2018)
An ensemble scheme based on language function analysis and feature engineering for text genre classification
Journal of Information Science, 44
J. Son, Ilchae Jung, Kayoung Park, Bohyung Han (2015)
Tracking-by-Segmentation with Online Gradient Boosting Decision Tree
2015 IEEE International Conference on Computer Vision (ICCV)
S. Cateni, V. Colla, M. Vannucci (2014)
A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Neurocomputing, 135
Junnan Li, Qingsheng Zhu, Quanwang Wu, Dongdong Cheng (2020)
An effective framework based on local cores for self-labeled semi-supervised classification
Knowl. Based Syst., 197
Onur Varol, Emilio Ferrara, Clayton Davis, F. Menczer, A. Flammini (2017)
Online Human-Bot Interactions: Detection, Estimation, and Characterization
Justin Johnson, T. Khoshgoftaar (2019)
Survey on deep learning with class imbalance
Journal of Big Data, 6
Iacopo Pozzana, Emilio Ferrara (2018)
Measuring Bot and Human Behavioral Dynamics
, 8
Weimiao Feng, Jianguo Sun, Liguo Zhang, Cuiling Cao, Qing Yang (2016)
A support vector machine based naive Bayes algorithm for spam filtering
2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)
Nikan Chavoshi, Hossein Hamooni, A. Mueen (2016)
DeBot: Twitter Bot Detection via Warped Correlation
2016 IEEE 16th International Conference on Data Mining (ICDM)
Yudong Zhang, Shuihua Wang, Preetha Phillips, G. Ji (2014)
Binary PSO with mutation operator for feature selection using decision tree applied to spam detection
Knowl. Based Syst., 64
Bing Zhou, Yiyu Yao, Jigang Luo (2014)
Cost-sensitive three-way email spam filtering
Journal of Intelligent Information Systems, 42
Yaru Zhan, Yanqin Bai, Wei Zhang, Shihui Ying (2018)
A P-ADMM for sparse quadratic kernel-free least squares semi-supervised support vector machine
Neurocomputing, 306
Philipp Probst, Marvin Wright, A. Boulesteix (2018)
Hyperparameters and tuning strategies for random forest
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9
M. Vannucci, V. Colla (2017)
Genetic Algorithms Based Resampling for the Classification of Unbalanced Datasets
(2019)
Research progress of event summarization based on social media
V. Subrahmanian, A. Azaria, Skylar Durst, Vadim Kagan, A. Galstyan, Kristina Lerman, Linhong Zhu, Emilio Ferrara, A. Flammini, F. Menczer, R. Waltzman, Andrew Stevens, A. Dekhtyar, Shuyang Gao, T. Hogg, F. Kooti, Y. Liu, Onur Varol, Prashant Shiralkar, V.G.Vinod Vydiswaran, Q. Mei, Tim Huang (2016)
The DARPA Twitter Bot Challenge
ArXiv, abs/1601.05140
S. Cresci, M. Petrocchi, A. Spognardi, Stefano Tognazzi (2019)
Better Safe Than Sorry: An Adversarial Approach to Improve Social Bot Detection
Proceedings of the 10th ACM Conference on Web Science
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, L. Beyer (2019)
S4L: Self-Supervised Semi-Supervised Learning
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Estée Walt, J. Eloff (2018)
Using Machine Learning to Detect Fake Identities: Bots vs Humans
IEEE Access, 6
Jia Lu, Yanlu Gong (2020)
A co-training method based on entropy and multi-criteria
Applied Intelligence, 51
Zilu Liang, Mario Martell (2019)
Combining Resampling and Machine Learning to Improve Sleep-Wake Detection of Fitbit Wristbands
2019 IEEE International Conference on Healthcare Informatics (ICHI)
(2016)
Clustering-based under-sampling ensemble method for software defect prediction
Nabil Zerrouki, F. Harrou, Ying Sun, A. Houacine (2018)
Vision-Based Human Action Classification Using Adaptive Boosting Algorithm
IEEE Sensors Journal, 18
Liu Ka (2015)
A Weibo Bot-users Indentification Model Based on Random Forest
(2018)
Self-training method based on semi-supervised clustering and data editing
Chunmeng Xia, Ke Han, Yong Qi, Yang Zhang, Dong-Jun Yu (2018)
A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15
Sneha Kudugunta, Emilio Ferrara (2018)
Deep Neural Networks for Bot Detection
Inf. Sci., 467
Van Tran, N. Nguyen, H. Fujita, Dinh Hoang, D. Hwang (2017)
A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields
Knowl. Based Syst., 132
N. Mawass, P. Honeine, L. Vercouter (2020)
SimilCatch: Enhanced social spammers detection on Twitter using Markov Random Fields
Inf. Process. Manag., 57
O. Loyola-González, R. Monroy, Jorge Rodríguez, Armando López-Cuevas, J. Mata-Sánchez (2019)
Contrast Pattern-Based Classification for Bot Detection on Twitter
IEEE Access, 7
Aytuğ Onan, S. Korukoglu (2016)
Exploring Performance of Instance Selection Methods in Text Sentiment Classification
(2019)
Improved naive Bayes self-training algorithm based on weighted K-nearest neighbor
(2019)
Semi-supervised self-training PU learning based on novel spy technology
Tingting Li, Jia Lu (2020)
Divide-and-conquer ensemble self-training method based on probability difference
Journal of Ambient Intelligence and Humanized Computing
Emilio Ferrara, Onur Varol, Clayton Davis, F. Menczer, A. Flammini (2014)
The rise of social bots
Communications of the ACM, 59
Swati Shilaskar, A. Ghatol, P. Chatur (2017)
Medical decision support system for extremely imbalanced datasets
Inf. Sci., 384
I. Idris, A. Selamat, N. Nguyen, S. Omatu, O. Krejcar, K. Kuča, M. Penhaker (2015)
A combined negative selection algorithm-particle swarm optimization for an email spam detection system
Eng. Appl. Artif. Intell., 39
Ahmed Al-Rawi, J. Groshek, Li Zhang (2019)
What the fake? Assessing the extent of networked political spamming and bots in the propagation of #fakenews on Twitter
Online Inf. Rev., 43
(2018)
Anomaly detection based on synthetic minority oversampling technique and deep belief network
David Berthelot, Nicholas Carlini, I. Goodfellow, Nicolas Papernot, Avital Oliver, Colin Raffel (2019)
MixMatch: A Holistic Approach to Semi-Supervised Learning
ArXiv, abs/1905.02249
Aytuğ Onan, S. Korukoglu (2017)
A feature selection model based on genetic rank aggregation for text sentiment classification
Journal of Information Science, 43
Fred Morstatter, Liang Wu, Tahora Nazer, Kathleen Carley, Huan Liu (2016)
A new approach to bot detection: Striking the balance between precision and recall
2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Haitao Gan, Zhenhua Li, Wei Wu, Zhizeng Luo, Rui Huang (2018)
Safety-aware Graph-based Semi-Supervised Learning
Expert Syst. Appl., 107
Dadmehr Rahbari, M. Nickray (2020)
Task offloading in mobile fog computing by classification and regression tree
Peer-to-Peer Networking and Applications, 13

Publisher: Emerald Publishing
Copyright: © Emerald Publishing Limited
ISSN: 2514-9288
DOI: 10.1108/dta-07-2021-0196
Publisher site: See Article on Publisher Site

Abstract

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.Design/methodology/approachIn the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.FindingsThe proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.Originality/valueThis paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

Journal

Data Technologies and Applications – Emerald Publishing

Published: Jun 22, 2022

Keywords: Bot accounts; Class imbalance data; Semi-supervised learning; Self-training method; Resampling technique

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

References (47)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies