Ibrohim and Budi Multi-label Hate Speech and Abusive Language Detection in Indonesian from Twitter

Dataset of hate speech and abusive language sampled from Twitter by using keywords and keyphrases. The dataset includes posts from March 2018 until September 2018 and integrated posts from three previously created datasets. The data is annotated by crowd-sourcing in two phases ((1) hate/abuse/both/none; (2) target, categories, level) with a total amount of 30 workers and 3 annotations per tweet. The annotators were selected by Indonesian as a native language, age of 20-30 years, experienced Twitter-user, no member of a political party. 14 males and 16 females participated in the annotation with various jobs, ethnicities and religions. Trained linguists created gold standard questions. For the first task (hate/abuse/non), just the samples with full agreement remained. For the second annotation task, samples with majority agreement (2/3) remained, which were annotated by the best annotators from the first annotation round.

Data and Resources

Additional Info

Field Value
Paper Authors Okky Ibrohim, M., Budi, I.
Author contact email Okky Ibrohim, M., Budi, I.
Publication / paper reference Okky Ibrohim, M. and Budi, I., 2019. Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.46-57.
Publication / paper link https://www.aclweb.org/anthology/W19-3506.pdf
Publication Year
Dataset about page https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection
Language(s) covered Indonesian
Source data platform(s) Twitter
Phenomena annotated hate speech and abusive language
Level of instances Single comment / post
Data statement link N/A
Total number of instances in dataset 13169
Proportion of positive/abusive instances 0.42
Submitter Philine Zeinert
Submitter Email phze@itu.dk
State active