Kurrek et al. Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage from Reddit

The dataset addresses discriminations across sexuality, ethnicity, and gender. Posts are sampled slur-based ("f*ggot","n*gger","tr*nny") from several subreddits from October 2007 to September 2019, where comments were collected for studying slurs-use also in a non-derogatory context.

The data was annotated by a diverse cohort of university members (based on field and year of study, age, sexuality, ethnicity, gender), with Shannon equitability indices of 0.90, 0.92, and 0.87 across sexuality, ethnicity, and gender. The annotators were trained over a 2-day-session and the final annotations done in groups of 2 created by a maximum diversity in profiles and re-shuffled after every 1k annotations. The Cohen's Kappa overall is 0.60 with an agreement of 78.6%. Disagreements between two annotators were solved by the authors.

Data and Resources

Additional Info

Field Value
Authors Kurrek, J., Saleem, H. M., & Ruths, D.
Author contact email Kurrek, J., Saleem, H. M., & Ruths, D.
Publication / paper reference Kurrek, J., Saleem, H. M., & Ruths, D. (2020, November). Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage. In Proceedings of the Fourth Workshop on Online Abuse and Harms (pp. 138-149).
Publication / paper link https://www.aclweb.org/anthology/2020.alw-1.17.pdf
Dataset about page https://github.com/networkdynamics/slur-corpus
Language(s) covered English
Source data platform(s) Reddit
Annotation schema description Multi-label (Derogatory, Non Derogatory Non Appropriative, Homonym, Appropriative, Noise)
Phenomena annotated Slur-usage in context of sexuality, ethnicity, and gender
Level of instances Single comment / post
Data statement link N/A
Total umber of instances in dataset 40,000
Proportion of positive/abusive instances 0.52 (20,531 Derogatory)
Submitter Philine Zeinert
Submitter Email phze@itu.dk
State active