Evaluation Software Errors in the EMNLP 2023 Proceedings

Widespread Errors in NLG Evaluation

If you’ve ever written a paper involving natural language generation, there’s a good chance you evaluated your model using the Rouge metric. Somewhere between 10% and 20% of recent NLP papers use Rouge, either as the main metric (e.g., summarization) or as part of a panel of metrics (e.g., caption generation). This makes Rouge one of the most frequently used metrics in NLP research today. Consequently, the validity of thousands of NLP results depends on Rouge scores being reproducible, comparable, and correct.

Yet, among the thousands of NLP papers evaluate using Rouge, only a fraction of papers cite specific Rouge software packages (33%) or configuration parameters (5%). Choice of Rouge package and parameters can dramatically affect Rouge scores. When papers omit these details, it makes reported Rouge scores difficult to compare and reproduce.

Furthermore, many nonstandard Rouge packages have serious implementation errors that result in incorrect Rouge scores. The validity of these incorrect scores is unknown: many nonstandard packages differ meaningfully from the standard ROUGE-1.5.5 implementation, yet unlike ROUGE-1.5.5, they have never been validated against human judgement. Prior work suggests that several thousand papers evaluate using incorrect Rouge packages.

Recommendations for Reviewers

Model evaluation is a critical component of empirical NLP research. For several years, the Responsible NLP Research Checklist has asked authors to cite software and parameters for important software packages and specifically identifies Rouge as an example. However, approximately 10% of EMNLP 2023 papers failed to follow these suggestions. Because Rouge evaluation errors and discrepancies can affect the core findings of a paper, these suggestions need to be upgraded to strict requirements for acceptance:

Reviewers should strongly recommend rejection — for papers reporting or computing Rouge scores without providing an in-text Rouge software package citation.
Example of Rouge package citation »
We compute Rouge-1, Rouge-2, and Rouge-L $ F_1 $ scores using the nonstandard Rouge Python package “rouge-score” developed by Google Research. [1]

[1] Package downloaded from: https://pypi.org/project/rouge-score/0.1.2/
Reviewers should strongly recommend rejection — for papers using a Rouge package other than the standard ROUGE-1.5.5 implementation unless the paper contains an appendix section discussing the limitations of the package and rationale for using it.
Example of nonstandard Rouge package discussion »
We compute Rouge-L scores using the nonstandard MS COCO Rouge package. [1] This implementation differs from the standard ROUGE-1.5.5 implementation of Rouge in several ways, including incorrectly tokenizing sentences and computing nonstandard recall-biased $ F_{1.2} $ scores. A major limitation of this Rouge implementation is that it has never been validated against human judgement for any task, including the task featured in this paper. We evaluate using this package to maintain comparability with prior work.

[1] COCO caption toolkit: https://github.com/tylin/coco-caption/tree/3a9afb2
Reviewers should strongly recommend rejection — for papers reporting or computing Rouge scores without Rouge configuration parameters, such as stemming.
Example of Rouge configuration parameters »
We compute Rouge $ F_1 $ scores using the standard ROUGE-1.5.5 implementation and use the “ROUGE-1.5.5 -n 2 -m” command line parameters. The “-m” parameter enables Porter stemming, which is frequently used by prior work on this dataset. All other Rouge parameters, such as bootstrapping, inherit from the ROUGE-1.5.5 defaults.

Evaluation Software Errors at EMNLP 2023

(Completed: December 8–12, 2023)

This review examines all papers computing Rouge scores across the entire EMNLP 2023 Main Proceedings, Findings, and all other EMNLP 2023 collocated events. The main finding of the review is that nearly all Rouge scores are either irreproducible or incorrect.

The results below are built automatically from the dataset file, available to download above. For more details, read the timeline, methods, and limitations sections below.

215 total papers

with Rouge scores at EMNLP 2023

102 papers

in EMNLP 2023 Main Proceedings

95 papers

in Findings of EMNLP 2023

18 papers

in EMNLP 2023 Collocated Events

10 papers

with correct Rouge scores

127 papers

with unknown Rouge scores

78 papers

with incorrect Rouge scores

8 packages cited

with Rouge implementation errors

Correct Rouge Scores — EMNLP 2023 Papers
These papers appear to compute correct Rouge scores based on paper and code release review. Correct Rouge scores are computed by (1) the benchmark ROUGE-1.5.5 package, (2) an alternative Rouge package with Rouge scores identical to ROUGE-1.5.5, or (3) an alternative Rouge package specifically designed and used for non-English or multilingual evaluation.

EMNLP 2023 Main Proceedings
Better Quality Pre-training Data and T5 Models for African Languages
Authors: Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Owodunni, Odunayo Ogundepo, David Adelani, Jimmy Lin
Package: Multilingual Rouge
GEMINI: Controlling The Sentence-Level Summary Style in Abstractive Text Summarization
Authors: Guangsheng Bao, Zebin Ou, Yue Zhang
Package: Standard ROUGE-1.5.5 (custom wrapper)
DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining
Authors: Weifeng Jiang, Qianren Mao, Chenghua Lin, Jianxin Li, Ting Deng, Weiyi Yang, Zheng Wang
Package: Standard ROUGE-1.5.5
MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments
Authors: Debtanu Datta, Shubham Soni, Rajdeep Mukherjee, Saptarshi Ghosh
Package: Multilingual Rouge
Findings of the ACL: EMNLP 2023
Understanding Translationese in Cross-Lingual Summarization
Authors: Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu, Zhixu Li, Jie Zhou
Package: Multilingual Rouge
Towards a Unified Framework for Reference Retrieval and Related Work Generation
Authors: Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, Zhaochun Ren
Package: Standard ROUGE-1.5.5
Hierarchical Catalogue Generation for Literature Review: A Benchmark
Authors: Kun Zhu, Xiaocheng Feng, Xiachong Feng, Yingsheng Wu, Bing Qin
Package: Standard ROUGE-1.5.5
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
Authors: David Uthus, Santiago Ontanon, Joshua Ainslie, Mandy Guo
Package: Multilingual Rouge
D²TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie Zhou
Package: Multilingual Rouge
4th New Frontiers in Summarization Workshop
Zero-Shot Cross-Lingual Summarization via Large Language Models
Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Beiqi Zou, Zhixu Li, Jianfeng Qu, Jie Zhou
Package: Multilingual Rouge

Unknown Rouge Scores — EMNLP 2023 Papers
At the time of review (see review timeline below), these papers are missing Rouge software package citations. Because each Rouge package computes different Rouge scores, it is unclear whether Rouge scores reported in these papers are correct or comparable with prior work.

Proceedings of CoNLL 2023
Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation
Authors: Dama Sravani, Radhika Mamidi
Package: Unknown
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization
Authors: Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune
Package: Unknown
JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
Authors: Yuiga Wada, Kanta Kaneda, Komei Sugiura
Package: Unknown
MuLER: Detailed and Scalable Reference-based Evaluation
Authors: Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend
Package: Unknown
EMNLP 2023 Main Proceedings
Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
Authors: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Wang
Package: Unknown
GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Authors: Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
Package: Unknown
Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
Authors: Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo Kang
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
Authors: Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang
Package: Unknown
Location-Aware Visual Question Generation with Lightweight Models
Authors: Nicholas Suwono, Justin Chen, Tun Hung, Ting-Hao Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun
Notes: Abstract Mentions Rouge Scores
Package: Unknown
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan Plummer, Kate Saenko, Jianmo Ni, Mandy Guo
Package: Unknown
OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization
Authors: Shmuel Amar, Liat Schiff, Ori Ernst, Asi Shefer, Ori Shapira, Ido Dagan
Package: Unknown
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Authors: Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor
Package: Unknown
Promoting Topic Coherence and Inter-Document Consorts in Multi-Document Summarization via Simplicial Complex and Sheaf Graph
Authors: Yash Atri, Arun Iyer, Tanmoy Chakraborty, Vikram Goyal
Package: Unknown
TempTabQA: Temporal Question Answering for Semi-Structured Tables
Authors: Vivek Gupta, Pranshu Kandoi, Mahek Vora, Shuo Zhang, Yujie He, Ridho Reinanda, Vivek Srikumar
Package: Unknown
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Learning Retrieval Augmentation for Personalized Dialogue Generation
Authors: Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang
Package: Unknown
Indicative Summarization of Long Discussions
Authors: Shahbaz Syed, Dominik Schwabe, Khalid Khatib, Martin Potthast
Package: Unknown
Evaluating Large Language Models on Controlled Generation Tasks
Authors: Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, Xuezhe Ma
Package: Unknown
CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types
Authors: Zishan Guo, Linhao Yu, Minghui Xu, Renren Jin, Deyi Xiong
Package: Unknown
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Authors: Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang
Package: Unknown
Prompting Large Language Models with Chain-of-Thought for Few-Shot Knowledge Base Question Generation
Authors: Yuanyuan Liang, Jianing Wang, Hanlun Zhu, Lei Wang, Weining Qian, Yunshi Lan
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Interactive Text Generation
Authors: Felix Faltings, Michel Galley, Kianté Brantley, Baolin Peng, Weixin Cai, Yizhe Zhang, Jianfeng Gao, Bill Dolan
Package: Unknown
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai
Package: Unknown
QUDeval: The Evaluation of Questions Under Discussion Discourse Parsing
Authors: Yating Wu, Ritika Mangla, Greg Durrett, Junyi Li
Package: Unknown
EntSUMv2: Dataset, Models and Evaluation for More Abstractive Entity-Centric Summarization
Authors: Dhruv Mehra, Lingjue Xie, Ella Hofmann-Coyle, Mayank Kulkarni, Daniel Preotiuc-Pietro
Package: Unknown
MediaHG: Rethinking Eye-catchy Features in Social Media Headline Generation
Authors: Boning Zhang, Yang Yang
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Authors: Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Lei Hou, Juanzi Li, Xu Bin
Notes: Received Paper Award
Package: Unknown
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
Authors: Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, Jonathan Berant
Package: Unknown
Abstractive Open Information Extraction
Authors: Kevin Pei, Ishan Jindal, Kevin Chang
Package: Unknown
ReTAG: Reasoning Aware Table to Analytic Text Generation
Authors: Deepanway Ghosal, Preksha Nema, Aravindan Raghuveer
Package: Unknown
Evaluation of African American Language Bias in Natural Language Generation
Authors: Nicholas Deas, Jessica Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, Kathleen McKeown
Package: Unknown
Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
Authors: Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu
Package: Unknown
PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
Authors: Wookje Han, Jinsol Park, Kyungjae Lee
Package: Unknown
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei Liu
Package: Unknown
Gender Biases in Automatic Evaluation Metrics for Image Captioning
Authors: Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng
Package: Unknown
SOUL: Towards Sentiment and Opinion Understanding of Language
Authors: Yue Deng, Wenxuan Zhang, Sinno Pan, Lidong Bing
Package: Unknown
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu
Package: Unknown
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization
Authors: Xiutian Zhao, Ke Wang, Wei Peng
Package: Unknown
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur Parikh
Package: Unknown
A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot
Authors: Aanisha Bhattacharyya, Yaman Singla, Balaji Krishnamurthy, Rajiv Shah, Changyou Chen
Package: Unknown
Active Learning for Natural Language Generation
Authors: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor
Package: Unknown
Reducing Sequence Length by Predicting Edit Spans with Large Language Models
Authors: Masahiro Kaneko, Naoaki Okazaki
Package: Unknown
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Authors: Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao
Package: Unknown
We Are What We Repeatedly Do: Inducing and Deploying Habitual Schemas in Persona-Based Responses
Authors: Benjamin Kane, Lenhart Schubert
Package: Unknown
Countering Misinformation via Emotional Response Generation
Authors: Daniel Russo, Shane Kaszefski-Yaschuk, Jacopo Staiano, Marco Guerini
Package: Unknown
Models See Hallucinations: Evaluating the Factuality in Video Captioning
Authors: Hui Liu, Xiaojun Wan
Package: Unknown
Select, Prompt, Filter: Distilling Large Language Models for Summarizing Conversations
Authors: Minh-Quang Pham, Sathish Indurthi, Shamil Chollampatt, Marco Turchi
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Impressions: Visual Semiotics and Aesthetic Impact Understanding
Authors: Julia Kruk, Caleb Ziems, Diyi Yang
Package: Unknown
AutoTrial: Prompting Language Models for Clinical Trial Design
Authors: Zifeng Wang, Cao Xiao, Jimeng Sun
Package: Unknown
Multi-Source Multi-Type Knowledge Exploration and Exploitation for Dialogue Generation
Authors: Xuanfan Ni, Hongliang Dai, Zhaochun Ren, Piji Li
Package: Unknown
Context Compression for Auto-regressive Transformers with Sentinel Tokens
Authors: Siyu Ren, Qi Jia, Kenny Zhu
Package: Unknown
Reconstruct Before Summarize: An Efficient Two-Step Framework for Condensing and Summarizing Meeting Transcripts
Authors: Haochen Tan, Han Wu, Wei Shao, Xinyun Zhang, Mingjie Zhan, Zhaohui Hou, Ding Liang, Linqi Song
Package: Unknown
MaNtLE: Model-agnostic Natural Language Explainer
Authors: Rakesh Menon, Kerem Zaman, Shashank Srivastava
Package: Unknown
PTP: Boosting Stability and Performance of Prompt Tuning with Perturbation-Based Regularizer
Authors: Lichang Chen, Jiuhai Chen, Heng Huang, Minhao Cheng
Package: Unknown
CLAIR: Evaluating Image Captions with Large Language Models
Authors: David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John Canny
Package: Unknown
q2d: Turning Questions into Dialogs to Teach Models How to Search
Authors: Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee Aharoni, Enav Weinreb
Package: Unknown
You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models
Authors: Alexander Baranov, Vladimir Kniazhevsky, Pavel Braslavski
Package: Unknown
IEKG: A Commonsense Knowledge Graph for Idiomatic Expressions
Authors: Ziheng Zeng, Kellen Cheng, Srihari Nanniyur, Jianing Zhou, Suma Bhat
Package: Unknown
Exploring the Boundaries of GPT-4 in Radiology
Authors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel Castro, Maria Wetscherek, Robert Tinn, Harshita Sharma, Fernando Pérez-García, Anton Schwaighofer, Pranav Rajpurkar, Sameer Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya Nori, Matthew Lungren, Ozan Oktay, Javier Alvarez-Valle
Package: Unknown
Self-Ensemble of N-best Generation Hypotheses by Lexically Constrained Decoding
Authors: Ryota Miyano, Tomoyuki Kajiwara, Yuki Arase
Package: Unknown
R2H: Building Multimodal Navigation Helpers that Respond to Help Requests
Authors: Yue Fan, Jing Gu, Kaizhi Zheng, Xin Wang
Package: Unknown
Unveiling the Essence of Poetry: Introducing a Comprehensive Dataset and Benchmark for Poem Summarization
Authors: Ridwan Mahbub, Ifrad Khan, Samiha Anuva, Md Shahriar, Md Tahmid Rahman Laskar, Sabbir Ahmed
Package: Unknown
Prompting with Pseudo-Code Instructions
Authors: Mayank Mishra, Prince Kumar, Riyaz Bhat, Rudra Murthy, Danish Contractor, Srikanth Tamilselvam
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning
Authors: Swaroop Nath, Pushpak Bhattacharyya, Harshad Khadilkar
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Graph vs. Sequence: An Empirical Study on Knowledge Forms for Knowledge-Grounded Dialogue
Authors: Yizhe Yang, Heyan Huang, Yuhang Liu, Yang Gao
Package: Unknown
Exploring Distributional Shifts in Large Language Models for Code Analysis
Authors: Shushan Arakelyan, Rocktim Das, Yi Mao, Xiang Ren
Package: Unknown
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation
Authors: Yixin Liu, Alexander Fabbri, Yilun Zhao, Pengfei Liu, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir Radev
Package: Unknown
ALCAP: Alignment-Augmented Music Captioner
Authors: Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, Xuchen Song
Package: Unknown
Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian
Authors: Ruhiyah Widiaputri, Ayu Purwarianti, Dessi Lestari, Kurniawati Azizah, Dipta Tanaya, Sakriani Sakti
Package: Unknown
Findings of the ACL: EMNLP 2023
Multi Document Summarization Evaluation in the Presence of Damaging Content
Authors: Avshalom Manevich, David Carmel, Nachshon Cohen, Elad Kravi, Ori Shapira
Package: Unknown
Follow-on Question Suggestion via Voice Hints for Voice Assistants
Authors: Besnik Fetahu, Pedro Faustini, Anjie Fang, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi
Package: Unknown
Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti, Muhammad Abdul-Mageed
Package: Unknown
TaTA: A Multilingual Table-to-Text Dataset for African Languages
Authors: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
Package: Unknown
Towards Mitigating LLM Hallucination via Self Reflection
Authors: Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung
Package: Unknown
ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination
Authors: Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, Min Zhang
Package: Unknown
Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting
Authors: Haowei Du, Dinghao Zhang, Chen Li, Yang Li, Dongyan Zhao
Package: Unknown
Accuracy is not enough: Evaluating Personalization in Summarizers
Authors: Rahul Vansh, Darsh Rank, Sourish Dasgupta, Tanmoy Chakraborty
Notes: Abstract Mentions Rouge Scores
Package: Unknown
MaXM: Towards Multilingual Visual Question Answering
Authors: Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut
Package: Unknown
Understanding HTML with Large Language Models
Authors: Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust
Package: Unknown
Can you Summarize my learnings? Towards Perspective-based Educational Dialogue Summarization
Authors: Raghav Jain, Tulika Saha, Jhagrut Lalwani, Sriparna Saha
Package: Unknown
Towards Informative Open-ended Text Generation with Dynamic Knowledge Triples
Authors: Zixuan Ren, Yang Zhao, Chengqing Zong
Package: Unknown
Ask Language Model to Clean Your Noisy Translation Data
Authors: Quinten Bolding, Baohao Liao, Brandon Denis, Jun Luo, Christof Monz
Package: Unknown
Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users
Authors: Yohan Jo, Xinyan Zhao, Arijit Biswas, Nikoletta Basiou, Vincent Auvray, Nikolaos Malandrakis, Angeliki Metallinou, Alexandros Potamianos
Package: Unknown
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
Package: Unknown
Mind the Gap: Automated Corpus Creation for Enthymeme Detection and Reconstruction in Learner Arguments
Authors: Maja Stahl, Nick Düsterhus, Mei-Hua Chen, Henning Wachsmuth
Package: Unknown
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Authors: Dung Nguyen, Le Nam, Anh Dau, Anh Nguyen, Khanh Nghiem, Jin Guo, Nghi Bui
Package: Unknown
INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
Authors: Anil Ramakrishna, Rahul Gupta, Jens Lehmann, Morteza Ziyadi
Package: Unknown
Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting
Authors: Fanghua Ye, Meng Fang, Shenghui Li, Emine Yilmaz
Package: Unknown
Leveraging Structured Information for Explainable Multi-hop Question Answering and Reasoning
Authors: Ruosen Li, Xinya Du
Package: Unknown
TRIP: Accelerating Document-level Multilingual Pre-training via Triangular Document-level Pre-training on Parallel Data Triplets
Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Zhaochuan Gao, Anthony Aue, Arul Menezes, Furu Wei
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Beyond Candidates : Adaptive Dialogue Agent Utilizing Persona and Knowledge
Authors: Jungwoo Lim, Myunghoon Kang, Jinsung Kim, Jeongwook Kim, Yuna Hur, Heuiseok Lim
Package: Unknown
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Authors: Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy
Package: Unknown
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
Authors: Yinghan Long, Sayeed Chowdhury, Kaushik Roy
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Large Language Models Meet Harry Potter: A Dataset for Aligning Dialogue Agents with Characters
Authors: Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, Jia Li
Package: Unknown
Citance-Contextualized Summarization of Scientific Papers
Authors: Shahbaz Syed, Ahmad Hakimi, Khalid Al-Khatib, Martin Potthast
Package: Unknown
A Rewriting Approach for Gender Inclusivity in Portuguese
Authors: Leonor Veloso, Luisa Coheur, Rui Ribeiro
Package: Unknown
LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation
Authors: Zhitao He, Pengfei Cao, Yubo Chen, Kang Liu, Ruopeng Li, Mengshu Sun, Jun Zhao
Package: Unknown
CITB: A Benchmark for Continual Instruction Tuning
Authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad
Package: Unknown
Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogues
Authors: Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-Chung Kwan, Irwin King, Kam-Fai Wong
Package: Unknown
LLM aided semi-supervision for efficient Extractive Dialog Summarization
Authors: Nishant Mishra, Gaurav Sahu, Iacer Calixto, Ameen Abu-Hanna, Issam Laradji
Notes: Abstract Mentions Rouge Scores
Package: Unknown
Exploring In-Context Learning for Knowledge Grounded Dialog Generation
Authors: Qinyu Chen, Wenhao Wu, Sujian Li
Notes: Abstract Mentions Rouge Scores
Package: Unknown
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
Authors: Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang
Package: Unknown
SummIt: Iterative Text Summarization via ChatGPT
Authors: Haopeng Zhang, Xiao Liu, Jiawei Zhang
Package: Unknown
HuatuoGPT, Towards Taming Language Model to Be a Doctor
Authors: Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou Li
Package: Unknown
Diffusion Language Model with Query-Document Relevance for Query-Focused Summarization
Authors: Shaoyao Huang, Luozheng Qin, Ziqiang Cao
Notes: Abstract Mentions Rouge Scores
Package: Unknown
TokenDrop + BucketSampler: Towards Efficient Padding-free Fine-tuning of Language Models
Authors: Amrit Nagarajan, Anand Raghunathan
Package: Unknown
Using In-Context Learning to Improve Dialogue Safety
Authors: Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tur
Package: Unknown
Improving Consistency for Text Summarization with Energy Functions
Authors: Qi Zeng, Qingyu Yin, Zheng Li, Yifan Gao, Sreyashi Nag, Zhengyang Wang, Bing Yin, Heng Ji, Chao Zhang
Package: Unknown
PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning
Authors: Yongil Kim, Yerin Hwang, Hyeongu Yun, Seunghyun Yoon, Trung Bui, Kyomin Jung
Package: Unknown
LLMs – the Good, the Bad or the Indispensable?: A Use Case on Legal Statute Prediction and Legal Judgment Prediction on Indian Court Cases
Authors: Shaurya Vats, Atharva Zope, Somsubhra De, Anurag Sharma, Upal Bhattacharya, Shubham Nigam, Shouvik Guha, Koustav Rudra, Kripabandhu Ghosh
Package: Unknown
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Authors: Viet Lai, Nghia Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Nguyen
Package: Unknown
PARROT: Zero-Shot Narrative Reading Comprehension via Parallel Reading
Authors: Chao Zhao, Anvesh Vijjini, Snigdha Chaturvedi
Package: Unknown
Synthesize, if you do not have: Effective Synthetic Dataset Creation Strategies for Self-Supervised Opinion Summarization in E-commerce
Authors: Tejpalsingh Siledar, Suman Banerjee, Amey Patil, Sudhanshu Singh, Muthusamy Chelliah, Nikesh Garera, Pushpak Bhattacharyya
Package: Unknown
InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT
Authors: Yichong Xu, Ruochen Xu, Dan Iter, Yang Liu, Shuohang Wang, Chenguang Zhu, Michael Zeng
Package: Unknown
DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines
Authors: Prakhar Gupta, Yang Liu, Di Jin, Behnam Hedayatnia, Spandana Gella, Sijia Liu, Patrick Lange, Julia Hirschberg, Dilek Hakkani-Tur
Package: Unknown
Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models
Authors: Hongli Zhan, Desmond Ong, Junyi Li
Package: Unknown
1-PAGER: One Pass Answer Generation and Evidence Retrieval
Authors: Palak Jain, Livio Soares, Tom Kwiatkowski
Package: Unknown
LMGQS: A Large-scale Dataset for Query-focused Summarization
Authors: Ruochen Xu, Song Wang, Yang Liu, Shuohang Wang, Yichong Xu, Dan Iter, Pengcheng He, Chenguang Zhu, Michael Zeng
Package: Unknown
Extrapolating Multilingual Understanding Models as Multilingual Generators
Authors: Bohong Wu, Fei Yuan, Hai Zhao, Lei Li, Jingjing Xu
Notes: Abstract Mentions Rouge Scores
Package: Unknown
3rd Workshop on Multi-lingual Representation Learning
Generating Continuations in Multilingual Idiomatic Contexts
Authors: Rhitabrat Pokharel, Ameeta Agrawal
Package: Unknown
4th New Frontiers in Summarization Workshop
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou
Package: Unknown
In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis
Authors: Yuting Tang, Ratish Puduppully, Zhengyuan Liu, Nancy Chen
Package: Unknown
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Authors: Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad
Package: Unknown
Generating Extractive and Abstractive Summaries in Parallel from Scientific Articles Incorporating Citing Statements
Authors: Sudipta Singha Roy, Robert E. Mercer
Package: Unknown
Analyzing Multi-Sentence Aggregation in Abstractive Summarization via the Shapley Value
Authors: Jingyi He, Meng Cao, Jackie Chi Kit Cheung
Package: Unknown
Natural Legal Language Processing Workshop 2023
Questions about Contracts: Prompt Templates for Structured Answer Generation
Authors: Adam Roegiest, Radha Chitta, Jonathan Donnelly, Maya Lash, Alexandra Vtyurina, Francois Longtin
Package: Unknown
3rd Workshop for NLP Open Source Software
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui
Package: Unknown

Incorrect Rouge Scores — EMNLP 2023 Papers
These papers or their code releases reference Rouge software packages with that compute incorrect Rouge scores because of implementation errors. Incorrect Rouge scores differ from the official ROUGE-1.5.5 reference implementation of Rouge. See packages section for more detail.

Proceedings of CoNLL 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
Authors: Mohammad Akbari, Saeed Ranjbar Alvar, Behnam Kamranian, Amin Banitalebi-Dehkordi, Yong Zhang
Package: LA/torchmetrics
EMNLP 2023 Main Proceedings
Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
Authors: Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae Yu
Package: GL/rougescore
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Authors: Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan
Package: GL/rougescore
MemeCap: A Dataset for Captioning and Interpreting Memes
Authors: EunJeong Hwang, Vered Shwartz
Package: MS/rouge
Fast and Accurate Factual Inconsistency Detection Over Long Documents
Authors: Barrett Lattimer, Patrick CHen, Xinyuan Zhang, Yi Yang
Package: GL/rougescore, PT/rouge
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
Authors: Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun Peng
Package: GL/rougescore
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li
Package: PT/rouge
Lion: Adversarial Distillation of Proprietary Large Language Models
Authors: Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei Wang
Package: GL/rougescore
Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media
Authors: Shubham Mittal, Megha Sundriyal, Preslav Nakov
Package: PT/rouge
Investigating Efficiently Extending Transformers for Long Input Summarization
Authors: Jason Phang, Yao Zhao, Peter Liu
Package: GL/rougescore
mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images
Authors: Keighley Overbay, Jaewoo Ahn, Fatemeh Pesaran zadeh, Joonsuk Park, Gunhee Kim
Package: PT/rouge
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Authors: Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
Package: GL/rougescore
Towards Interpretable Mental Health Analysis with Large Language Models
Authors: Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, Sophia Ananiadou
Package: GL/rougescore
Modeling Empathic Similarity in Personal Narratives
Authors: Jocelyn Shen, Maarten Sap, Pedro Colon-Hernandez, Hae Park, Cynthia Breazeal
Package: PT/rouge
Enabling Large Language Models to Generate Text with Citations
Authors: Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
Package: GL/rougescore
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Authors: Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, Ivan Vulić
Package: GL/rougescore
CiteBench: A Benchmark for Scientific Citation Text Generation
Authors: Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych
Package: GL/rougescore
Instructive Dialogue Summarization with Query Aggregations
Authors: Bin Wang, Zhengyuan Liu, Nancy Chen
Package: DI/pyrouge, GL/rougescore
Enhancing Biomedical Lay Summarisation with External Knowledge Graphs
Authors: Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, Chenghua Lin
Package: GL/rougescore
Background Summarization of Event Timelines
Authors: Adithya Pratapa, Kevin Small, Markus Dreyer
Notes: Received Paper Award
Package: GL/rougescore
trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback
Authors: Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, Louis Castricato
Package: GL/rougescore
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
Authors: Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan
Package: GL/rougescore
Detecting and Mitigating Hallucinations in Multilingual Summarisation
Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo Ponti, Shay Cohen
Package: GL/rougescore, PT/rouge
Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
Authors: Manuel Faysse, Gautier Viaud, Céline Hudelot, Pierre Colombo
Package: GL/rougescore
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction
Authors: Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Siru Ouyang, Heng Ji, Jiawei Han
Package: GL/rougescore
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
Authors: Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit Bansal
Package: GL/rougescore
Contrastive Learning for Inference in Dialogue
Authors: Etsuko Ishii, Yan Xu, Bryan Wilie, Ziwei Ji, Holy Lovenia, Willy Chung, Pascale Fung
Package: MS/rouge
Paraphrase Types for Generation and Detection
Authors: Jan Wahle, Bela Gipp, Terry Ruas
Package: PT/rouge
Hallucination Mitigation in Natural Language Generation from Large-Scale Open-Domain Knowledge Graphs
Authors: Xiao Shi, Zhengyuan Zhu, Zeyu Zhang, Chengkai Li
Package: GL/rougescore
Multilingual Large Language Models Are Not (Yet) Code-Switchers
Authors: Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, Alham Aji
Package: GL/rougescore
KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection
Authors: Sehyun Choi, Tianqing Fang, Zhaowei Wang, Yangqiu Song
Package: GL/rougescore
CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code
Authors: Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, Wenhai Wang
Package: MS/rouge
Length Does Matter: Summary Length can Bias Summarization Metrics
Authors: Xiaobo Guo, Soroush Vosoughi
Package: BZ/pyrouge
Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation
Authors: Jiayu Lin, Rong Ye, Meng Han, Qi Zhang, Ruofei Lai, Xinyu Zhang, Zhao Cao, Xuanjing Huang, Zhongyu Wei
Package: PT/rouge
Findings of the ACL: EMNLP 2023
DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
Authors: Forrest Bao, Ruixuan Tu, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, Cen Chen
Package: GL/rougescore
Execution-Based Evaluation for Open-Domain Code Generation
Authors: Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig
Package: GL/rougescore
Improving the Robustness of Summarization Models by Detecting and Removing Input Noise
Authors: Kundan Krishna, Yao Zhao, Jie Ren, Balaji Lakshminarayanan, Jiaming Luo, Mohammad Saleh, Peter Liu
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
Extractive Summarization via ChatGPT for Faithful Summary Generation
Authors: Haopeng Zhang, Xiao Liu, Jiawei Zhang
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
InstructExcel: A Benchmark for Natural Language Instruction in Excel
Authors: Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri
Package: GL/rougescore
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Authors: Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, Yangqiu Song
Package: GL/rougescore
FREDSum: A Dialogue Summarization Corpus for French Political Debates
Authors: Virgile Rennard, Guokan Shang, Damien Grari, Julie Hunter, Michalis Vazirgiannis
Package: GL/rougescore
Frugal Prompting for Dialog Models
Authors: Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal
Package: GL/rougescore
The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation
Authors: Mutian He, Philip Garner
Package: GL/rougescore
Is ChatGPT a Good Multi-Party Conversation Solver?
Authors: Chao-Hong Tan, Jia-Chen Gu, Zhen-Hua Ling
Package: GL/rougescore
Bipartite Graph Pre-training for Unsupervised Extractive Summarization with Graph Convolutional Auto-Encoders
Authors: Qianren Mao, Shaobo Zhao, Jiarui Li, Xiaolei Gu, Shizhu He, Bo Li, Jianxin Li
Package: BZ/pyrouge
Adapting Pretrained Text-to-Text Models for Long Text Sequences
Authors: Wenhan Xiong, Anchit Gupta, Shubham Toshniwal, Yashar Mehdad, Scott Yih
Package: PT/files2rouge
Large-Scale and Multi-Perspective Opinion Summarization with Diverse Review Subsets
Authors: Han Jiang, Rui Wang, Zhihua Wei, Yu Li, Xinpeng Wang
Package: PT/files2rouge
Topic-Informed Dialogue Summarization using Topic Distribution and Prompt-based Modeling
Authors: Jaeah You, Youngjoong Ko
Notes: Abstract Mentions Rouge Scores
Package: DI/pyrouge
A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization
Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
Authors: Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun
Package: GL/rougescore
Inverse Reinforcement Learning for Text Summarization
Authors: Yu Fu, Deyi Xiong, Yue Dong
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
From Chaos to Clarity: Claim Normalization to Empower Fact-Checking
Authors: Megha Sundriyal, Tanmoy Chakraborty, Preslav Nakov
Package: DI/pyrouge
Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries
Authors: Prafulla Choubey, Alexander Fabbri, Caiming Xiong, Chien-Sheng Wu
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Wang, Arman Cohan
Package: GL/rougescore
USB: A Unified Summarization Benchmark Across Tasks and Domains
Authors: Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron Wallace, Jeffrey Bigham, Zachary Lipton
Package: GL/rougescore
Domain Adaptation for Conversational Query Production with the RAG Model Feedback
Authors: Ante Wang, Linfeng Song, Ge Xu, Jinsong Su
Package: GL/rougescore
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
Authors: Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
Package: LA/torchmetrics
PivotFEC: Enhancing Few-shot Factual Error Correction with a Pivot Task Approach using Large Language Models
Authors: Xingwei He, A-Long Jin, Jun Ma, Yuan Yuan, Siu Yiu
Package: GL/rougescore
Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization
Authors: Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy Huang
Package: GL/rougescore
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration
Authors: Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, Tat-Seng Chua
Package: BZ/pyrouge, MS/rouge
Natural Response Generation for Chinese Reading Comprehension
Authors: Nuo Chen, Hongguang Li, Yinan Bao, Baoyuan Wang, Jia Li
Package: Custom reimplementation of Rouge
Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation
Authors: Jinglong Gao, Xiao Ding, Bing Qin, Ting Liu
Package: PT/rouge
Enhancing Accessible Communication: from European Portuguese to Portuguese Sign Language
Authors: Catarina Sousa, Luisa Coheur, Mara Moita
Package: GL/rougescore
HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue
Authors: Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Yoon, Junyeong Kim, Chang Yoo
Package: MS/rouge
Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs
Authors: Young-Suk Lee, Md Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Astudillo
Package: GL/rougescore
Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans
Authors: Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan
Notes: Abstract Mentions Rouge Scores
Package: GL/rougescore
COMET-M: Reasoning about Multiple Events in Complex Sentences
Authors: Sahithya Ravi, Raymond Ng, Vered Shwartz
Package: GL/rougescore, MS/rouge
Cross-modality Data Augmentation for End-to-End Sign Language Translation
Authors: Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Hui Xiong
Package: MS/rouge
InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators
Authors: Heng Yang, Ke Li
Package: PT/rouge
Re-Examining Summarization Evaluation across Multiple Quality Criteria
Authors: Ori Ernst, Ori Shapira, Ido Dagan, Ran Levy
Package: BZ/pyrouge
NarrativeXL: a Large-scale Dataset for Long-Term Memory Models
Authors: Arsenii Moskvichev, Ky-Vinh Mai
Package: GL/rougescore
PIVOINE: Instruction Tuning for Open-world Entity Profiling
Authors: Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu Chen
Package: ND/easyrouge, PT/rouge
Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
Authors: Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan, Jiaxing Zhang, Baoyuan Wang, Jia Li
Package: Custom reimplementation of Rouge
Mitigating Intrinsic Named Entity-Related Hallucinations of Abstractive Text Summarization
Authors: Jianbin Shen, Junyu Xuan, Christy Liang
Package: GL/rougescore
3rd Workshop on Multi-lingual Representation Learning
Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023
Authors: Francesco Tinner, David Ifeoluwa Adelani, Chris Emezue, Mammad Hajili, Omer Goldman, Muhammad Farid Adilazuarda, Muhammad Dehan Al Kautsar, Aziza Mirsaidova, Müge Kural, Dylan Massey, Chiamaka Chukwuneke, Chinedu Mbonu, Damilola Oluwaseun Oloyede, Kayode Olaleye, Jonathan Atala, Benjamin A. Ajibade, Saksham Bassi, Rahul Aralikatte, Najoung Kim, Duygu Ataman
Package: GL/rougescore
4th New Frontiers in Summarization Workshop
Extract, Select and Rewrite: A Modular Sentence Summarization Method
Authors: Shuo Guan, Vishakh Padmakumar
Package: PT/rouge
Improving Multi-Stage Long Document Summarization with Enhanced Coarse Summarizer
Authors: Jinhyeong Lim, Hyun-Je Song
Package: BZ/pyrouge
3rd Workshop for NLP Open Source Software
nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources
Authors: Piotr Nawrot
Package: GL/rougescore

Incorrect Rouge Packages — Cited at EMNLP 2023
These packages have implementation or configuration errors that result in incorrect Rouge scores. These errors were first identified in the ACL 2023 Rogue Scores paper by comparing their output scores to ROUGE-1.5.5 under various evaluation conditions.

Package With Errors: GL/rougescore
Incorrect implementation of Porter stemming. Incorrect default implementation of Rouge-L. Bootstrapping introduces random noise into scores (minor issue). Distributed by both Google Research (GL/rougescore) and Hugging Face (HF/evaluate).
Package With Errors: PT/rouge
Implementation errors in both Rouge-N and Rouge-L algorithms. Not capable of performing stemming or bootstrapping.
Package With Errors: PT/files2rouge
Incorrectly tokenizes sentences using the period character (“.”), ignoring existing tokenization. Bootstrapping introduces random noise into scores (minor issue).
Package With Errors: DI/pyrouge
Unclear implementation errors cause incorrect Rouge scores for approximately 4% of model outputs during testing. Not capable of performing bootstrapping.
Package With Errors: MS/rouge
Accidentally computes recall-biased Rouge F-scores using $ \beta=1.2 $. (Rouge F-scores are almost universally computed with $ \beta=1.0 $.) Performs incorrect sentence tokenization. Not capable of performing stemming or bootstrapping.
Package With Errors: BZ/pyrouge
Contains single line of code that silently enables stemming, even when user attempts to disable stemming. Bootstrapping introduces random noise into scores (minor issue). Distributed and reused by several other packages, including YL/summeval.
Package With Errors: ND/easyrouge
Omits many major components of Rouge scores: “Preprocessing like stopword removal, stemming and tokenization is left to the client.”
Package With Errors: LA/torchmetrics
This custom reimplementation of Rouge has not been evaluated for correctness. It appears to be based on the incorrect GL/rougescore implementation, including replicating the incorrect default Rouge-L behavior.
Custom Reimplementations
Some papers link to code which contain custom ad hoc reimplementations or wrappers of Rouge not evaluated in Rogue Scores. Custom implementations correctness is determined by static analysis during review of code release.

Timeline — Paper and Code Review
December 7, 2023 — Public release of the complete EMNLP 2023 proceedings.
December 8, 2023 — Review of EMNLP 2023 Main Proceedings papers and code releases.
December 10, 2023 — Review of EMNLP 2023 Findings papers and code releases.
December 12, 2023 — Review of papers and code releases for all remaining EMNLP 2023 collocated events including demonstrations, tutorials, industry track, and workshops.

Methods — Paper and Code Review
The review includes all EMNLP 2023 papers that compute Rouge scores. Papers and citation information are downloaded from the ACL Anthology.
A preliminary identification of Rouge papers is conducted automatically by searching “rouge” across all full-text paper PDFs and excluding papers that do not match.
Matching papers are reviewed manually to identify if they compute Rouge scores. This includes Rouge scores computed but not reported, such as during model training. Papers not computing Rouge scores are excluded from the review. Remaining papers are included in the review.
Remaining papers are first searched for in-text paper citations of Rouge packages. Papers with in-text package citations are labeled accordingly and the review of the paper concludes.
Papers without in-text paper Rouge citations are searched for in-text code release links. Papers without code links are labeled as “unknown package” and the review of the paper concludes.
Paper code releases are searched for references to Rouge, including README documents, repository issues and pull requests, standard code files, shell scripts, and package management files such as requirements.txt or environment.yml.
Papers with code referencing a Rouge package are labeled accordingly. Papers whose code does not reference a Rouge packages are labeled as “unknown package.” Review concludes.

Challenges and Limitations — Paper and Code Review
Parameter Differences. This review only examines use of Rouge software packages. It does not examine parameter differences, which can also lead to substantial differences in scores.
Automated Search. A preliminary case-insensitive search for “rouge” is conducted for all papers. Only matching papers receive a full manual paper and code review. Papers which evaluate with Rouge without explicitly naming the evaluation metric, and papers which reference Rouge only inside non-searchable images may be excluded from the review.
Human Annotation. Manual paper review is used to identify Rouge computation and packages. Despite best efforts, all human annotation has the potential to introduce labeling errors.
Code Availability. Because in-text package citations are rare, most identifications of Rouge packages are made through code releases. At time of review, many papers link to code releases without code. It is possible that many papers currently labeled “unknown” will eventually link to code that contains an identifiable Rouge package.
Non-Evaluation Metrics. Some papers use Rouge for reasons other than evaluation, such as feature generation or for internal training validation. This review does not make any distinction between evaluation and non-evaluation Rouge.
Assumed Correctness. The review assumes all papers that use ROUGE-1.5.5 directly (rather than using a wrapper or reimplementation) report correct Rouge scores. However, many of these papers may run ROUGE-1.5.5 via custom ad hoc wrapper code that (like many wrapper packages) is also implemented incorrectly and introduces scoring errors.
Package Inference. Many code releases are missing explicit dependency specification, making identifying exact Rouge packages challenging. In these cases, function signatures are used to identify the most likely Rouge package.
Multiple Packages. When a code release contains multiple Rouge packages, an attempt is made to identify which packages are used to compute Rouge scores reported in the paper. If this is unclear, all Rouge packages appearing in the code release are listed.
External Materials. Only main paper text, appendices, and code linked in papers are reviewed. External materials such as websites, slides, videos, or code with no link appearing in papers are not examined as part of this review.

Analyses.org