Human-annotated Rationales

Authors	Year	Paper	Type of task	Dataset source	Annotators	Granularity	Extractive / Abstractive	Size	Link dataset	Instructions	Collection aim	# workers
Zaidan et al.	2007	Using “Annotator Rationales” to Improve Machine Learning for Text Categorization	Sentiment classification	Reviews (IMDB)	Authors	Snippets	E	1800 with rationales + 200 without	link	To justify why a review is positive, highlight the most important words and phrases that would tell someone to see the movie	ML learning
Yano et al.	2010	Shedding (a Thousand Points of) Light on Biased Language	Bias classification	American political blog posts	Crowdsourcing (MTurk)	Snippets	E	1k sentences		workers are asked to check the box to indicate the region which “give away” the bias	Task insight	5?
Abedin et al.	2011	Learning Cause Identifiers from Annotator Rationales	Identify cause of aviation incident	Aviation safety reports	Author and student worker	Snippets	E	1233 docs with rationales + 1000 unlabeled		the annotators are asked to “do their best to mark enough rationales to pro- vide convincing support for the class of interest”, but are not expected to “go out of their way to mark everything”.	Reduce required data size, ML learning	2
MCDonnell et al.	2016	Why is that relevant? collecting annotator rationales for relevance judgments	Webpage relevance	Webpages	Crowdsourcing (MTurk)	Sentences	E + A	10000+	link		Improve data quality, transparency, verification
Chhatwall et al.	2018	Explainable text classification in legal document review a case study of explainable predictive coding	Determine responsiveness of legal documents	Legal documents	Legal domain experts	Snippets	E + A	688,294 documents including email, Microsoft Office documents, PDFs, and other text-based documents. Only rationales for responsive documents	-	Justification of coding a document as responsive	ML learning
Carton et al.	2018	Extractive Adversarial Networks: High-Recall Explanations for Identifying Personal Attacks in Social Media Posts	Personal attacks in comments	Wikipedia revision comments	Students	Snippets	E	1089		to high- light sections of comments that they considered to constitute personal attacks	Explanation verification	40
Bao et al.	2018	Deriving Machine Attention from Human Rationales	Sentiment classification	Reviews (BeerAdvocate, Tripadvisor)	Students	Snippets	E	200	link	to highlight rationales that are short and coherent, yet sufficient for supporting the label	ML learning	5
Ramirez et al.	2019	Understanding the impact of text highlighting in crowdsourcing tasks	Topic classification	Reviews (Amazon), Systematic Literature Review (SLR)	Crowdsourcing (Figure Eight)	Snippets + Sentences	E + A	400, 135 + 150	link	Explain you decision. Tell us why you think the paper is relevant / If you were to select one or more sentences most useful for your decisions, which ones would you select?	Enrich data / improve classification by humans	449, 424 + 464 = 1337
Wang et al.	2019	Learning from Explanations with Neural Execution Tree	Relation extraction + Sentiment classification	Reviews (restaurant, laptop)	Crowdsourcing (MTurk)	Sentences	A	170 (TACRED), 203 (SemEval) 40 (Laptop), 45 (Restaurant)			Augment labels
Sharma et al	2021	A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support	Empathy expression	TalkLife, Reddit	Crowdsourcing (Upwork)	Snippets	E	approx. 10k	link	Along with the categorical annotations, crowdworkers were also asked to highlight portions of the response post that formed the rationale behind their annotation.	Gold rationales	8
Sen et al.	2020	Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words?	Sentiment classification	Yelp	Crowdsourcing (MTurk)	Snippets	E	5000 reviews, each review 3, annotators	link	Participants are asked to complete two tasks: 1) Identify the sentiment of the review as positive, negative, or neither, and 2) Highlight (ALL) the words that are indicative of the chosen sentiment.	Gold rationales	3
Kanchinadam. et al.	2020	Rationale-based Human-in-the-Loop via Supervised Attention	Sentiment classification	Reviews (IMDB)	Crowdsourcing (MTurk)	Snippets	E	22k	link		ML learning
Kutlu et al.	2020	Annotator rationales for labeling tasks in crowdsourcing	Webpage relevance	Webpages	Crowdsourcing (MTurk)	Sentences	E + A	10000+		Please copy and paste text 2-3 sentences from the web page which you believe support your decision / Please describe in words why you agree or disagree with Tom’s decision.	Improve data quality, transparency, and verification
Socher et al.	2013	Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank	Sentiment classification	Sentences from reviews (IMDB)	Crowdsourcing (MTurk)	Snippets + Sentences	E	215154 phrases		-	ML learning
Sap et al.	2020	Social Bias Frames: Reasoning about Social and Power Implications of Language	Implications in text classification	Twitter, Reddit, Gap, Stormfront	Crowdsourcing (MTurk)	Sentences	A	34k implications	link	What aspect…/.. of this group is reference or implied in this post? Use simple phrases	Gold rationales, ML learning
Vidgen et al.	2021	Introducing CAD: the Contextual Abuse Dataset	Abusive content	Reddit	experts/trained annotators	Snippets	E	approx. 25k	link	For each entry they highlighted the part of the text which contains the abuse	Gold rationales	2
Mohseni et al.	2021	Quantitative Evaluation of Machine Learning Explanations: A Human-Grounded Benchmark	Sentiment and topic classification	20news, Reviews (IMDB)	Crowdsourcing (MTurk)	Snippets	E	100IMDB, 100 20news	link	“select words and phrases which explain the positive (or negative) sentiment of the movie review” -> label is given	Gold rationales	200
Arous et al.	2021	Marta: Leveraging human rationales for explainable text classification	Topic relevance classification	Wikipedia articles	Crowdsourcing (MTurk)	Snippets	E	1413	link	Then, workers were asked to annotate the article and provide a snippet from the text as a justification.	ML learning, explainability	58
Chalkidis et al.	2021	Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases	alleged violation prediction	European Court of Human Rights cases	Legal domain experts	Paragraphs	E	11k silver, 50 gold rationales	link	to identify the relevant facts (paragraphs)	Gold rationales	1 for gold rats
Hayati et al.	2021	Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica	Style classification	StanfordPoliteness, Standford Sentiment Treebank, tweet dataset for offensiveness, dataset for emotion classification	Crowdsourcing (Prolific2)	Words	E	500 texts	link	Each worker was asked what styles they perceive each of the texts to exhibit. If they think the text has certain styles, workers then highlight the words in the text which they believe make them think the text has those styles	Feature importance verification	622
Mathew et al.	2021	HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection	Hate speech detection	Twitter and Gap	Crowdsourcing (MTurk)	Snippets	E	9,055 Twitter + 11,093 Gab	link	if the text is considered as hate speech, or offensive by majority of the annotators, we further ask the annotators to annotate parts of the text, which are words or phrases that could be a potential reason for the given annotation.	Explanability and reduce bias for ML models	253
Malik et al.	2021	ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation	Predict outcome of legal cases	Indian Supreme Court cases	Legal domain experts	Sentences	E	56	link	1. predict the judgement, and 2. mark the sentences that they think are explanation for the judgement	Gold rationales	5
Wang et al.	2022	Ranking-Constrained Learning with Rationales for Text Classification	Topic classification AIvsCR	Arxiv (scientific articles)	Authors	Snippets	E	394 annotated with rationales + 2k docs		To justify why .. is .., highlight the most important words and phrases, but not all	Enrich data with rationales	2
Guzman et al.	2022	RaFoLa: A Rationale-Annotated Corpus for Detecting Indicators of Forced Labour	Forced Labour Indicators	news articles	two annotators (30+ with masters degree)	Snippets	E	989	link	…we are asking you to identify the risks of forced labour in news articles…. tag what phrases or sentences led you to decide the presence of that indicator … and highlighting the phrases/sentences that support your decision.	Enrich data with rationales	2
Jayaram et al.	2021	Human Rationales as Attribution Priors for Explainable Stance Detection	Stance detection (pro/con)	VAST: comments for The New York Times	Crowdsourcing (MTurk)	Words	E	775	link	workers are asked to (1) classify the stance of an argument with respect to a topic and (2) select the k most important words in the argument (for each example, we provide an acceptable range of values for k). A word is considered to be important if masking it would make (1) more difficult.	Improve ML model in data-scarce setting	3
Meldo et al.	2020	The natural language explanation algorithms for the lung cancer computer-aided diagnosis system	Lung cancer image classification	LUNA16 (lung photos)	Medical domain experts (doctors)	Sentences	A	240	-	-	Gold rationales
Zini et al.	2022	On the Evaluation of the Plausibility and Faithfulness of Sentiment Analysis Explanations	Sentiment classification	Reviews (Rotten Tomatoes)	Data scientists	Words	E	1973 rationale sentences	link		Gold rationales	10
Srivastava et al.	2020	Robustness to Spurious Correlations via Human Annotations	Medical diagnosis, handwriting, police domain	Multiple	Crowdsourcing (MTurk)	Sentences	A	3 datasets	link	What transformation do you think happened to the image?, What factors do you think led to the individual being stopped and [arrested/not arrested]?	reduce spurious correlations	3 per annotation
Lu et al.	2022	A Rationale-Centric Framework for Human-in-the-loop Machine Learning	Sentiment classification	Reviews (IMDB)	Crowdsourcing	Snippets	E	5073 rationales in 855 movie reviews	link		eliminating the effect of spurious patterns by leveraging human knowledge
Beckh et al.	2024	Limitations of Feature Attribution in Long Text Classification of Standard]	Assessing the AI readiness of standards and specifications	Technical documents	Human experts	Snippets	E	1000 documents		Annotators were instructed to find and annotate evidence that is enough to justify a label.	Task insight, gold rationales	1
Salles et al.	2025	HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese	Hate speech detection	Instagram comments	Researchers	Snippets	E	3500	link	Annotators were instructed to highlight only the portions of text that supported the offensive label	Gold rationales	1