A novel pre-processing workflow for popularity prediction in social media

Yıldırım, Hüseyin Buğra
Users in Twitter are in continuous interaction with each other through posts and reactions such as likes and retweets. Tweets often get a little reaction from people, with only a few of them receiving a prominent response. Thus, reaction numbers result in having a heavy right-skewed distribution. Furthermore, some tweets show unexpected response performance that cannot be depicted by standard features and are often dependent on extraordinary situations such as being the first reporter and mass reaction. Heavily skewed distribution of social media dataset and variation between expected and the observed reactions are mainly two distorting factors for model prediction. This thesis initially addresses the concept of outliers and uncertainty in reaction numbers in social media datasets. A method for identifying social media outliers is proposed, and the adverse effects of outliers on modeling are presented. Finally, a SMOTE-based data augmentation method, where a discretization is applied and synthetic data is generated predominantly from the clusters with fewer instances, is presented. The results show that the models where outlier removal and data augmentation are applied achieve slightly better prediction performance than those constructed without them. This research presents practical implications for studies that aim to predict the popularity of tweets.
Citation Formats
H. B. Yıldırım, “A novel pre-processing workflow for popularity prediction in social media,” M.S. - Master of Science, Middle East Technical University, 2021.