Fortuna et al. A Hierarchically-Labeled Portugese Hate Speech Dataset From Twitter

Dataset contains hate speech in Portuguese sampled from Twitter with 81 categories. The dataset is manually annotated for Hate Speech using a hierarchical structure of classes. A multiclass and multilabel approach was considered. 5,668 messages were collected on Twitter, from 1,156 distinct users and classified as containing hate speech. The data were filtered by hate-related keywords and profiles. The tweets were annotated by experts and every tweet by 3 annotators (From a team of 18 annotators). Majority vote determined label assignments. The overall Fleiss Kappa is 0.17. For the classes, after one researcher annotated then 500 tweets were checked by another to calculate Cohens Kappa with K = 0.72.

Data and Resources

Additional Info

Field Value
Authors Fortuna, P., Rocha da Silva, J., Soler-Company, J., Warner, L. and Nunes, S.
Author contact email Fortuna, P., Rocha da Silva, J., Soler-Company, J., Warner, L. and Nunes, S.
Publication / paper reference Fortuna, P., Rocha da Silva, J., Soler-Company, J., Warner, L. and Nunes, S., 2019. A Hierarchically-Labeled Portuguese Hate Speech Dataset. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.94-104.
Publication / paper link https://www.aclweb.org/anthology/W19-3510.pdf
Dataset about page https://rdm.inesctec.pt/dataset/cs-2017-008
Language(s) covered Portuguese
Source data platform(s) Twitter
Annotation schema description Binary (Hate, Not) Multi-level (81 categories, identified inductively; categories have different granularities and content can be assigned to multiple categories at once)
Phenomena annotated group-directed multiple identities inductively categorized
Level of instances Single comment / post
Data statement link N/A
Total umber of instances in dataset 3,059
Proportion of positive/abusive instances 0.32
Submitter Philine Zeinert
Submitter Email phze@itu.dk
State active