Cost Based Heterogeneous Learning Framework for Real Time Spam Detection in Social Networks

Cost Based Heterogeneous Learning Framework for Real Time Spam Detection in Social Networks

Abstract:

With the widespread use of social networks, spam messages against them have become a major issue. Spam detection methods can be broadly divided into expert-based and machine learning-based detection methods. When experts participate in spam detection, the detection accuracy is fairly high. However, this method is highly time-consuming and expensive. Conversely, methods using machine learning have the advantage of automation, but their accuracy is relatively low. This paper proposes a spam-detection framework that combines and fully exploits the advantages of both methods. To reduce the workload of the experts, all messages are first analyzed via a primary machine learning filter, and those that are determined to be normal messages are allowed through, whereas suspicious messages are flagged. The flagged messages are subsequently analyzed by an expert to enhance the overall system accuracy. In the filtering process, cost-based machine learning is used to prevent the fatal error of misidentifying a spam message as a normal message. In addition, to obviate the continuously evolving spam trends, a module that periodically updates the expert-diagnosis results on the training dataset is incorporated into the framework. The results of experiments conducted, on an imbalanced dataset of spam tweets and normal tweets in a ratio similar to the actual situation in real life, indicate that the proposed framework has a spam-detection rate of almost 92.8%, which is higher than that of the conventional machine learning technique. Furthermore, the proposed framework delivered stable high performance even in an environment where social network messages changed continuously, unlike the conventional technique, which exhibited large performance deviations.