A Novel Preprocessing Technique for Toxic Comment Classification

A Novel Preprocessing Technique for Toxic Comment Classification

Abstract:

The threat of online abuse and harassment is increasing day by day in the cyber community. To tackle this problem, many platforms have devised policies. But these policies require prior identification of the content that is inappropriate and offensive. Furthermore, the data contains various aspects of negativity, for example, a particular piece of comment can express, disgust, disbelief, and threat at the same time. It points that even the negativity/toxicity exhibited in a comment can have various facets. Hence, the challenge is to identify what exactly is exhibited in comments so that respective policies can be formulated and applied to penalize the offender. This study makes use of two approaches to identify these underlying toxicities in the comments. The first approach is to train separate classifiers against each facet of the toxicity in comments. The second approach deals with the problem as a multi-label classification problem. Different machine learning approaches including logistic regression, Naïve Bayes, and decision tree classification are employed to carry out this study. The dataset is taken from Kaggle and 10-fold cross-validation is used to report the robustness of the model. The study uses a novel preprocessing scheme that transforms the multi-label classification problem into the multi-class classification problem. The preprocessing strategy has shown a significant improvement in the accuracies when employed for simple classification models encouraging its use for other sophisticated models as well. Experimental results show that in both the binary classification and the multi-classification, logistic regression turns out to be a better performer. This indicates the potential use of the preprocessing for the neural classification models.