About hatespeechdata.com

This site catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

If you are new to CKAN, you can also refer to the hep documentation here.


The site is maintained by Leon Derczynski, Florence Enock and Hannah Kirk

Adding a resource

To submit your dataset, please register as a user and upload it. Accompanying data statements preferred for all corpora. You can find a markdown template for a data statement here. Datasets undergo an approval process before being made live on the site. You can find the user guide for using CKAN here.


If you use these resources, please cite (and read!) our paper:

Vidgen B, Derczynski L (2020) Directions in Abusive Language Training Data, a Systematic Review: Garbage In, Garbage Out. PLoS ONE 15(12): e0243300. https://doi.org/10.1371/journal.pone.0243300.

And if you would like to find other resources for researching online hate, visit The Alan Turing Institute's Online Hate Research Hub or read The Alan Turing Institute's Reading List on Online Hate and Abuse Research.

If you're looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at 'Resources and benchmark corpora for hate speech detection: a systematic review' by Poletto et al. in Language Resources and Evaluation.