-
ViHSD - Vietnamese Hate Speech Detection on Soical Media Texts
A large-scaled dataset for Vietnamese Hate Speech Detection on Social media texts. The dataset is crawled from Facebook and Youtube, and is manually annotated by human. -
Kurrek et al. Towards a Comprehensive Taxonomy and Large-Scale Annotated Corp...
The dataset addresses discriminations across sexuality, ethnicity, and gender. Posts are sampled slur-based ("f*ggot","n*gger","tr*nny") from several subreddits from October... -
Wulczyn et al. Personal Attacks on Wikipedia Dataset
Dataset of hateful Wikipedia comments. The sampling of the data was a combination of random + oversampled on banned comments. Annotation was crowdsourced, and each comment was... -
Founta et al. Hate and Abusive Speech on Twitter
Dataset of tweets collected from 30th March 2017 to 9th April 2017 with a boosted random sampling technique, by using text analysis and preliminary crowdsourcing rounds to... -
Multi-lingual of Dirty, Naughty, Obscene, and Otherwise Bad Words from Shutte...
The repo contains a list of words that Shutterstpck uses to filter results from our autocomplete server and recommendation engine. Can be installed in a npm project by: npm... -
YouTube Blacklist Words
YouTube Blacklist Words List includes; a list of unacceptable words, inappropriate words, a list of swear words, offensive words, curse words, insulting words, all cuss words,... -
WordPress Comment Blacklist Words
WordPress Comment Blacklist Words, WordPress Comment Moderation, and WordPress Comment Spam The WordPress Comment Blacklist Words/Phrases include; a list of swears, unacceptable... -
Davidson et al. Crowd-sourced Hate Speech On Twitter Dataset
Dataset of hateful tweets sampled from Twitter using keywords. Labelled by Crowdflower, 3+ people annotated each tweet. Majority decision was taken with 92% annotator agreement. -
Fernando Hate Speech Dataset in Sinhalese from Twitter
Datasets contain racism and sexism in Sinhalese from Twitter. The data was sampled using pre-identified keywords from surveys and experts. The data was annotated by experts... -
Caselli et al. Implicit/Explicit Expansion on OLID
This dataset expands the OLID/OffensEval (OLID (Zampieri et al., 2019a), Offensive Language Identification Dataset) by adding the explicitness of the message. The OLID data was... -
Breitfeller et al. Microaggressions Dataset
Dataset of self-reported microaggressions from microaggressions.com. 2,934 posts were collected targeted towards gender (1,314 posts), race (1,278 posts), sexuality (461 posts),... -
Alfina et al. Hate Speech Detection in the Indonesian Language from Twitter
Dataset of hate speech in Indonesian, including hatred for religion, race, ethnicity, and gender. Posts from Twitter are sampled using relevant hashtags to contentious political... -
Mathur et al. Hinglish Sexism on Twitter Dataset
Dataset of Hinglish sexist Tweets sampled by crawling popular hashtags and well-known people. Tweets were labelled by experts, with an average Cohen's kappa of 0.83. -
Ljubešić et al. Slovene Moderated News Comments
Dataset of moderated news comments from Slovene RTV MCC. Comments were labelled by expert annotators based on the type of inappropriate content. Note that this data is encrypted. -
Ljubešić et al. Croatian Moderated News Comments
Dataset of moderated news comments from Croatian 24sata. Comments were labelled by expert annotators based on the type of inappropriate content. Note that this data is encrypted. -
Fortuna et al. A Hierarchically-Labeled Portugese Hate Speech Dataset From Tw...
Dataset contains hate speech in Portuguese sampled from Twitter with 81 categories. The dataset is manually annotated for Hate Speech using a hierarchical structure of classes.... -
Mubarak et al. Abuse in Arabic Social Media Dataset
Dataset 1 includes offensive Arabic tweets sampled in March 2014 using obscene keywords and hashtags used for pornographic pages (available as a .txt file word list). Dataset 2... -
Ibrohim and Budi Abuse in Indonesian Twitter Dataset
Dataset of abusive tweets sampled with offensive terms. Tweets were annotated by 20 volunteer annotators and labelled by at least 3 people each. Only tweets with 100% annotators... -
Ibrohim and Budi Multi-label Hate Speech and Abusive Language Detection in In...
Dataset of hate speech and abusive language sampled from Twitter by using keywords and keyphrases. The dataset includes posts from March 2018 until September 2018 and integrated... -
Alakrot et al. Dataset Construction for the Detection of Anti-Social Behaviou...
Datasets contain offensive comments from YouTube. The data was sampled from 2015 to 2017 and collected in July 2017. Channels with controversial comments about celebrities were... -
Bretschneider and Peters Cyberbullying on WoW and LoL Forum Dataset
Dataset collected from the World of Warcraft (dataset 1) and League of Legends (dataset 2) forum. 20 topics were selected for each dataset based on offensive terms from... -
Mulki et al. A Levantine Twitter Dataset for Hate Speech and Abusive Language...
Dataset of Hate Speech and Abusive language. It is streamed from Twitter from March 2018 to February 2019 using relevant terms and from specified users, which have 100k... -
Bretschneider and Peters Prejudice on Facebook Dataset
Dataset of Facebook posts and comments published in response to them from the Facebook pages “Pegida” (dataset 1), “Ich bin Patriot, aber kein Nazi” (“I’m a patriot, not a... -
Albadi et al. Arabic Religious Hate on Twitter
Dataset of Arabic religious hate tweets sampled using neutral religious names as keywords. Annotation was crowdsourced using CrowdFlower, with a minimum of 3 annotations per... -
Jha and Mamidi Sexism on Twitter Dataset
Dataset of sexist tweets sampling based on benevolent sexist key phrases from which 712 tweets were manually selected by the authors, and validated by three non-activist...