Mulki et al. A Levantine Twitter Dataset for Hate Speech and Abusive Language (L-HSAB)

Dataset of Hate Speech and Abusive language. It is streamed from Twitter from March 2018 to February 2019 using relevant terms and from specified users, which have 100k followers. The samples are annotated by three one male and two female academics with Levantine Arabic as their native language. The agreement between the annotators is measured by the pairwise percentage agreement: 78.43%, 87.24%, 78.77. The Cohen's Kappa is 0.599, 0.758, 0.594 and Krippendorf's Alpha: 76.5%. The agreements are distributed like the following: 3/3 agreement (4,222), 2/3 agreement (1,624), conflict (154).

Data and Resources

Additional Info

Field Value
Authors Mulki, H., Haddad, H., Bechikh, C. and Alshabani, H.
Author contact email Mulki, H., Haddad, H., Bechikh, C. and Alshabani, H.
Publication / paper reference Mulki, H., Haddad, H., Bechikh, C. and Alshabani, H., 2019. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.111-118.
Publication / paper link https://www.aclweb.org/anthology/W19-3512.pdf
Dataset about page https://github.com/Hala-Mulki/L-HSAB-First-Arabic-Levantine-HateSpeech-Dataset
Language(s) covered Arabic
Source data platform(s) Twitter
Annotation schema description Ternary (Hate, Abusive, Normal)
Phenomena annotated group-directed + person-directed hate speech and abusive language
Level of instances Single comment / post
Data statement link N/A
Total umber of instances in dataset 5,846
Proportion of positive/abusive instances 0.38
Submitter Philine Zeinert
Submitter Email phze@itu.dk
State active