About hatespeechdata.com

This site catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.


The site is maintained by Leon Derczynski and Bertie Vidgen.

Adding a resource

To submit your dataset, please register as a user and upload it. Accompanying data statements preferred for all corpora. You can find a markdown template for a data statement here. Datasets undergo an approval process before being made live on the site.


If you use these resources, please cite (and read!) our paper:

Vidgen B, Derczynski L (2020) Directions in Abusive Language Training Data, a Systematic Review: Garbage In, Garbage Out. PLoS ONE 15(12): e0243300. https://doi.org/10.1371/journal.pone.0243300.

And if you would like to find other resources for researching online hate, visit The Alan Turing Institute's Online Hate Research Hub or read The Alan Turing Institute's Reading List on Online Hate and Abuse Research.

If you're looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at 'Resources and benchmark corpora for hate speech detection: a systematic review' by Poletto et al. in Language Resources and Evaluation.