Mubarak et al. Abuse in Arabic Social Media Dataset

Dataset 1 includes offensive Arabic tweets sampled in March 2014 using obscene keywords and hashtags used for pornographic pages (available as a .txt file word list). Dataset 2 includes deleted comments from an Arabic social media (Al Jazeera). Data was labelled using CrowdFlower by 3 people each, with 85% and 87% of agreement respectively.

Data and Resources

Additional Info

Field Value
Authors Mubarak, H., Darwish, K. and Magdy, W.
Author contact email Mubarak, H., Darwish, K. and Magdy, W.
Publication / paper reference Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56.
Publication / paper link https://www.aclweb.org/anthology/W17-3008
Dataset about page https://alt.qcri.org/~hmubarak/offensive/
Language(s) covered Arabic
Source data platform(s) Twitter,AlJazeera
Annotation schema description Ternary (Obscene, Offensive but not obscene, Clean)
Phenomena annotated Incivility
Level of instances Single comment / post
Data statement link
Total umber of instances in dataset 1,100; 32,000
Proportion of positive/abusive instances 0.59; 0.81
Submitter Laila Sprejer
Submitter Email sprejerlaila@gmail.com
State active