Authors Year Paper Type of task Dataset source Annotators Granularity Extractive / Abstractive Size Link dataset Instructions Collection aim # workers
Zaidan et al. 2007 Using “Annotator Rationales” to Improve Machine Learning for Text Categorization Sentiment classification Reviews (IMDB) Authors Snippets E 1800 with rationales + 200 without link To justify why a review is positive, highlight the most important words and phrases that would tell someone to see the movie ML learning  
Yano et al. 2010 Shedding (a Thousand Points of) Light on Biased Language Bias classification American political blog posts Crowdsourcing (MTurk) Snippets E 1k sentences   workers are asked to check the box to indicate the region which “give away” the bias Task insight 5?
Abedin et al. 2011 Learning Cause Identifiers from Annotator Rationales Identify cause of aviation incident Aviation safety reports Author and student worker Snippets E 1233 docs with rationales + 1000 unlabeled   the annotators are asked to “do their best to mark enough rationales to pro- vide convincing support for the class of interest”, but are not expected to “go out of their way to mark everything”. Reduce required data size, ML learning 2
MCDonnell et al. 2016 Why is that relevant? collecting annotator rationales for relevance judgments Webpage relevance Webpages Crowdsourcing (MTurk) Sentences E + A 10000+ link   Improve data quality, transparency, verification  
Chhatwall et al. 2018 Explainable text classification in legal document review a case study of explainable predictive coding Determine responsiveness of legal documents Legal documents Legal domain experts Snippets E + A 688,294 documents including email, Microsoft Office documents, PDFs, and other text-based documents. Only rationales for responsive documents - Justification of coding a document as responsive ML learning  
Carton et al. 2018 Extractive Adversarial Networks: High-Recall Explanations for Identifying Personal Attacks in Social Media Posts Personal attacks in comments Wikipedia revision comments Students Snippets E 1089   to high- light sections of comments that they considered to constitute personal attacks Explanation verification 40
Bao et al. 2018 Deriving Machine Attention from Human Rationales Sentiment classification Reviews (BeerAdvocate, Tripadvisor) Students Snippets E 200 link to highlight rationales that are short and coherent, yet sufficient for supporting the label ML learning 5
Ramirez et al. 2019 Understanding the impact of text highlighting in crowdsourcing tasks Topic classification Reviews (Amazon), Systematic Literature Review (SLR) Crowdsourcing (Figure Eight) Snippets + Sentences E + A 400, 135 + 150 link Explain you decision. Tell us why you think the paper is relevant / If you were to select one or more sentences most useful for your decisions, which ones would you select? Enrich data / improve classification by humans 449, 424 + 464 = 1337
Wang et al. 2019 Learning from Explanations with Neural Execution Tree Relation extraction + Sentiment classification Reviews (restaurant, laptop) Crowdsourcing (MTurk) Sentences A 170 (TACRED), 203 (SemEval) 40 (Laptop), 45 (Restaurant)     Augment labels  
Sharma et al 2021 A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support Empathy expression TalkLife, Reddit Crowdsourcing (Upwork) Snippets E approx. 10k link Along with the categorical annotations, crowdworkers were also asked to highlight portions of the response post that formed the rationale behind their annotation. Gold rationales 8
Sen et al. 2020 Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? Sentiment classification Yelp Crowdsourcing (MTurk) Snippets E 5000 reviews, each review 3, annotators link Participants are asked to complete two tasks: 1) Identify the sentiment of the review as positive, negative, or neither, and 2) Highlight (ALL) the words that are indicative of the chosen sentiment. Gold rationales 3
Kanchinadam. et al. 2020 Rationale-based Human-in-the-Loop via Supervised Attention Sentiment classification Reviews (IMDB) Crowdsourcing (MTurk) Snippets E 22k link   ML learning  
Kutlu et al. 2020 Annotator rationales for labeling tasks in crowdsourcing Webpage relevance Webpages Crowdsourcing (MTurk) Sentences E + A 10000+   Please copy and paste text 2-3 sentences from the web page which you believe support your decision / Please describe in words why you agree or disagree with Tom’s decision. Improve data quality, transparency, and verification  
Socher et al. 2013 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Sentiment classification Sentences from reviews (IMDB) Crowdsourcing (MTurk) Snippets + Sentences E 215154 phrases   - ML learning  
Sap et al. 2020 Social Bias Frames: Reasoning about Social and Power Implications of Language Implications in text classification Twitter, Reddit, Gap, Stormfront Crowdsourcing (MTurk) Sentences A 34k implications link What aspect…/.. of this group is reference or implied in this post? Use simple phrases Gold rationales, ML learning  
Vidgen et al. 2021 Introducing CAD: the Contextual Abuse Dataset Abusive content Reddit experts/trained annotators Snippets E approx. 25k link For each entry they highlighted the part of the text which contains the abuse Gold rationales 2
Mohseni et al. 2021 Quantitative Evaluation of Machine Learning Explanations: A Human-Grounded Benchmark Sentiment and topic classification 20news, Reviews (IMDB) Crowdsourcing (MTurk) Snippets E 100IMDB, 100 20news link “select words and phrases which explain the positive (or negative) sentiment of the movie review” -> label is given Gold rationales 200
Arous et al. 2021 Marta: Leveraging human rationales for explainable text classification Topic relevance classification Wikipedia articles Crowdsourcing (MTurk) Snippets E 1413 link Then, workers were asked to annotate the article and provide a snippet from the text as a justification. ML learning, explainability 58
Chalkidis et al. 2021 Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases alleged violation prediction European Court of Human Rights cases Legal domain experts Paragraphs E 11k silver, 50 gold rationales link to identify the relevant facts (paragraphs) Gold rationales 1 for gold rats
Hayati et al. 2021 Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica Style classification StanfordPoliteness, Standford Sentiment Treebank, tweet dataset for offensiveness, dataset for emotion classification Crowdsourcing (Prolific2) Words E 500 texts link Each worker was asked what styles they perceive each of the texts to exhibit. If they think the text has certain styles, workers then highlight the words in the text which they believe make them think the text has those styles Feature importance verification 622
Mathew et al. 2021 HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection Hate speech detection Twitter and Gap Crowdsourcing (MTurk) Snippets E 9,055 Twitter + 11,093 Gab link if the text is considered as hate speech, or offensive by majority of the annotators, we further ask the annotators to annotate parts of the text, which are words or phrases that could be a potential reason for the given annotation. Explanability and reduce bias for ML models 253
Malik et al. 2021 ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation Predict outcome of legal cases Indian Supreme Court cases Legal domain experts Sentences E 56 link 1. predict the judgement, and 2. mark the sentences that they think are explanation for the judgement Gold rationales 5
Wang et al. 2022 Ranking-Constrained Learning with Rationales for Text Classification Topic classification AIvsCR Arxiv (scientific articles) Authors Snippets E 394 annotated with rationales + 2k docs   To justify why .. is .., highlight the most important words and phrases, but not all Enrich data with rationales 2
Guzman et al. 2022 RaFoLa: A Rationale-Annotated Corpus for Detecting Indicators of Forced Labour Forced Labour Indicators news articles two annotators (30+ with masters degree) Snippets E 989 link …we are asking you to identify the risks of forced labour in news articles…. tag what phrases or sentences led you to decide the presence of that indicator … and highlighting the phrases/sentences that support your decision. Enrich data with rationales 2
Jayaram et al. 2021 Human Rationales as Attribution Priors for Explainable Stance Detection Stance detection (pro/con) VAST: comments for The New York Times Crowdsourcing (MTurk) Words E 775 link workers are asked to (1) classify the stance of an argument with respect to a topic and (2) select the k most important words in the argument (for each example, we provide an acceptable range of values for k). A word is considered to be important if masking it would make (1) more difficult. Improve ML model in data-scarce setting 3
Meldo et al. 2020 The natural language explanation algorithms for the lung cancer computer-aided diagnosis system Lung cancer image classification LUNA16 (lung photos) Medical domain experts (doctors) Sentences A 240 - - Gold rationales  
Zini et al. 2022 On the Evaluation of the Plausibility and Faithfulness of Sentiment Analysis Explanations Sentiment classification Reviews (Rotten Tomatoes) Data scientists Words E 1973 rationale sentences link   Gold rationales 10
Srivastava et al. 2020 Robustness to Spurious Correlations via Human Annotations Medical diagnosis, handwriting, police domain Multiple Crowdsourcing (MTurk) Sentences A 3 datasets link What transformation do you think happened to the image?, What factors do you think led to the individual being stopped and [arrested/not arrested]? reduce spurious correlations 3 per annotation
Lu et al. 2022 A Rationale-Centric Framework for Human-in-the-loop Machine Learning Sentiment classification Reviews (IMDB) Crowdsourcing Snippets E 5073 rationales in 855 movie reviews link   eliminating the effect of spurious patterns by leveraging human knowledge  
Beckh et al. 2024 Limitations of Feature Attribution in Long Text Classification of Standard] Assessing the AI readiness of standards and specifications Technical documents Human experts Snippets E 1000 documents   Annotators were instructed to find and annotate evidence that is enough to justify a label. Task insight, gold rationales 1