HASOC 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Hate Speech Dataset for Hindi, German and English. Three datasets sampled from Twitter and Facebook sampled by topics, hashtags, other keywords and the timeline of users (last posts). Inter-annotator-agreements (Kappa coefficient) on subtask A is for English: 0.36, Hindi: 0.59 and German: 0.43 and inter-rater agreements for all three subtasks given (i.e. subtask A: English: 72%, Hindi: 83%, German: 96%).

Data and Resources

Additional Info

Field Value
Authors Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019
Author contact email Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019
Publication / paper reference Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019. Overview of the HASOC track at FIRE 2019. In: Proceedings of the 11th Forum for Information Retrieval Evaluation,.
Publication / paper link https://dl.acm.org/doi/10.1145/3368567.3368584
Dataset about page https://hasocfire.github.io/hasoc/2019/dataset.html
Language(s) covered English,Hindi,German
Source data platform(s) Twitter,Facebook
Annotation schema description hierarchal (A: hate/offensive or neither; B: hatespeech, offensive or profane; C: targeted or untargeted)
Phenomena annotated group-directed and person-directed hate-offensive
Level of instances Single comment / post
Data statement link N/A
Total umber of instances in dataset English: 7005, Hindi: 5983, German: 4669
Proportion of positive/abusive instances English: 0.36, Hindi: 0.51, German: 0.24
Submitter Philine Zeinert
Submitter Email phze@itu.dk
State active