Abstract:
Online Social Networks (OSNs) are platforms that have gained immense traction from society today. Social media has reshaped our social world and has been playing a pivotal role in sculpting our personal and professional goals. While it provides invaluable information to millions of individuals daily, it has also become one of the most popular places for spam campaigns. In this paper, we design an algorithm for the recognition of spam campaigns, specifically focusing on a phone-numbers based approach. We build a system for spam campaign recognition with an emphasis on phone numbers in the light of the malicious activity that is vandalizing our online experience. This research focuses on data extracted from monitoring the following social networking channels: Tumblr, Twitter, and Flickr. The paper serves as an analytical lens for spam posts accumulated over four months. Regular expressions are used for data cleaning to identify posts containing phone numbers. We collected over 18 million spam posts and filtered the spam-containing posts using regular expressions. Next, we used a Bayesian Model called Latent Dirichlet Allocation (LDA) to perform a statistical model for detecting the category of the posts. We further use the bag-of-words and the tf-idf means to this data and apply cosine similarity for the similarity measure.