SEHC A Benchmark Setup to Identify Online Hate Speech in English

SEHC A Benchmark Setup to Identify Online Hate Speech in English

Abstract:

Thanks to the digital age, online speech and information may now be disseminated anonymously without regard for repercussions. Regulators face a unique problem with social media platforms because of the speed and volume of material and the lack of editorial supervision. The existing datasets on hate speech or offensive language identification lack diversity in the dataset’s content. In this article, we create a multi-domain hate speech corpus (MHC) of English tweets that includes hate speech against religion, nationality, ethnicity, and gender in general and cover diverse domains, such as current affairs, politics, terrorism, technology, natural disasters, and human/drugs trafficking. Each instance in our dataset is manually annotated as hate or non-hate. We use the existing state-of-the-art models and present a stacked-ensemble-based hate speech classifier (SEHC) to identify hate speech from Twitter data. Our results indicate that the proposed method may serve as a strong baseline for future studies using this dataset.