Analyses.org

Analyses

Evaluation Software Errors in the EMNLP 2023 Proceedings

The Rouge evaluation metric is used by 1 in every 10 papers at EMNLP 2023. However, this review finds that nearly all Rouge scores reported in EMNLP 2023 papers are either irreproducible or incorrect.

Rogue Scores Project Page
Learn how evaluation errors affect 2,000+ papers.

Download the Dataset [JSON]
Dataset of annotated papers used to generate this page.

Widespread Errors in NLG Evaluation

If you’ve ever written a paper involving natural language generation, there’s a good chance you evaluated your model using the Rouge metric. Somewhere between 10% and 20% of recent NLP papers use Rouge, either as the main metric (e.g., summarization) or as part of a panel of metrics (e.g., caption generation). This makes Rouge one of the most frequently used metrics in NLP research today. Consequently, the validity of thousands of NLP results depends on Rouge scores being reproducible, comparable, and correct.

Yet, among the thousands of NLP papers evaluate using Rouge, only a fraction of papers cite specific Rouge software packages (33%) or configuration parameters (5%). Choice of Rouge package and parameters can dramatically affect Rouge scores. When papers omit these details, it makes reported Rouge scores difficult to compare and reproduce.

Furthermore, many nonstandard Rouge packages have serious implementation errors that result in incorrect Rouge scores. The validity of these incorrect scores is unknown: many nonstandard packages differ meaningfully from the standard ROUGE-1.5.5 implementation, yet unlike ROUGE-1.5.5, they have never been validated against human judgement. Prior work suggests that several thousand papers evaluate using incorrect Rouge packages.

Recommendations for Reviewers

Model evaluation is a critical component of empirical NLP research. For several years, the Responsible NLP Research Checklist has asked authors to cite software and parameters for important software packages and specifically identifies Rouge as an example. However, approximately 10% of EMNLP 2023 papers failed to follow these suggestions. Because Rouge evaluation errors and discrepancies can affect the core findings of a paper, these suggestions need to be upgraded to strict requirements for acceptance:

  1. Reviewers should strongly recommend rejection — for papers reporting or computing Rouge scores without providing an in-text Rouge software package citation.

  2. Reviewers should strongly recommend rejection — for papers using a Rouge package other than the standard ROUGE-1.5.5 implementation unless the paper contains an appendix section discussing the limitations of the package and rationale for using it.

  3. Reviewers should strongly recommend rejection — for papers reporting or computing Rouge scores without Rouge configuration parameters, such as stemming.

Evaluation Software Errors at EMNLP 2023

(Completed: December 8–12, 2023)

This review examines all papers computing Rouge scores across the entire EMNLP 2023 Main Proceedings, Findings, and all other EMNLP 2023 collocated events. The main finding of the review is that nearly all Rouge scores are either irreproducible or incorrect.

The results below are built automatically from the dataset file, available to download above. For more details, read the timeline, methods, and limitations sections below.


Correct Rouge Scores — EMNLP 2023 Papers

These papers appear to compute correct Rouge scores based on paper and code release review. Correct Rouge scores are computed by (1) the benchmark ROUGE-1.5.5 package, (2) an alternative Rouge package with Rouge scores identical to ROUGE-1.5.5, or (3) an alternative Rouge package specifically designed and used for non-English or multilingual evaluation.

  1. EMNLP 2023 Main Proceedings
  2. Better Quality Pre-training Data and T5 Models for African Languages
    Authors: Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Owodunni, Odunayo Ogundepo, David Adelani, Jimmy Lin
    Package: Multilingual Rouge
  3. GEMINI: Controlling The Sentence-Level Summary Style in Abstractive Text Summarization
    Authors: Guangsheng Bao, Zebin Ou, Yue Zhang
    Package: Standard ROUGE-1.5.5 (custom wrapper)
  4. DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining
    Authors: Weifeng Jiang, Qianren Mao, Chenghua Lin, Jianxin Li, Ting Deng, Weiyi Yang, Zheng Wang
    Package: Standard ROUGE-1.5.5
  5. MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments
    Authors: Debtanu Datta, Shubham Soni, Rajdeep Mukherjee, Saptarshi Ghosh
    Package: Multilingual Rouge
  6. Findings of the ACL: EMNLP 2023
  7. Understanding Translationese in Cross-Lingual Summarization
    Authors: Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu, Zhixu Li, Jie Zhou
    Package: Multilingual Rouge
  8. Towards a Unified Framework for Reference Retrieval and Related Work Generation
    Authors: Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, Zhaochun Ren
    Package: Standard ROUGE-1.5.5
  9. Hierarchical Catalogue Generation for Literature Review: A Benchmark
    Authors: Kun Zhu, Xiaocheng Feng, Xiachong Feng, Yingsheng Wu, Bing Qin
    Package: Standard ROUGE-1.5.5
  10. mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
    Authors: David Uthus, Santiago Ontanon, Joshua Ainslie, Mandy Guo
    Package: Multilingual Rouge
  11. D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
    Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie Zhou
    Package: Multilingual Rouge
  12. 4th New Frontiers in Summarization Workshop
  13. Zero-Shot Cross-Lingual Summarization via Large Language Models
    Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Beiqi Zou, Zhixu Li, Jianfeng Qu, Jie Zhou
    Package: Multilingual Rouge

Unknown Rouge Scores — EMNLP 2023 Papers

At the time of review (see review timeline below), these papers are missing Rouge software package citations. Because each Rouge package computes different Rouge scores, it is unclear whether Rouge scores reported in these papers are correct or comparable with prior work.

  1. Proceedings of CoNLL 2023
  2. Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation
    Authors: Dama Sravani, Radhika Mamidi
    Package: Unknown
  3. Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization
    Authors: Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune
    Package: Unknown
  4. JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
    Authors: Yuiga Wada, Kanta Kaneda, Komei Sugiura
    Package: Unknown
  5. MuLER: Detailed and Scalable Reference-based Evaluation
    Authors: Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend
    Package: Unknown
  6. EMNLP 2023 Main Proceedings
  7. Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
    Authors: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Wang
    Package: Unknown
  8. GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
    Authors: Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
    Package: Unknown
  9. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
    Authors: Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo Kang
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  10. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
    Authors: Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang
    Package: Unknown
  11. Location-Aware Visual Question Generation with Lightweight Models
    Authors: Nicholas Suwono, Justin Chen, Tun Hung, Ting-Hao Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  12. A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan Plummer, Kate Saenko, Jianmo Ni, Mandy Guo
    Package: Unknown
  13. OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization
    Authors: Shmuel Amar, Liat Schiff, Ori Ernst, Asi Shefer, Ori Shapira, Ido Dagan
    Package: Unknown
  14. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
    Authors: Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor
    Package: Unknown
  15. Promoting Topic Coherence and Inter-Document Consorts in Multi-Document Summarization via Simplicial Complex and Sheaf Graph
    Authors: Yash Atri, Arun Iyer, Tanmoy Chakraborty, Vikram Goyal
    Package: Unknown
  16. TempTabQA: Temporal Question Answering for Semi-Structured Tables
    Authors: Vivek Gupta, Pranshu Kandoi, Mahek Vora, Shuo Zhang, Yujie He, Ridho Reinanda, Vivek Srikumar
    Package: Unknown
  17. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
    Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  18. Learning Retrieval Augmentation for Personalized Dialogue Generation
    Authors: Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang
    Package: Unknown
  19. Indicative Summarization of Long Discussions
    Authors: Shahbaz Syed, Dominik Schwabe, Khalid Khatib, Martin Potthast
    Package: Unknown
  20. Evaluating Large Language Models on Controlled Generation Tasks
    Authors: Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, Xuezhe Ma
    Package: Unknown
  21. CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types
    Authors: Zishan Guo, Linhao Yu, Minghui Xu, Renren Jin, Deyi Xiong
    Package: Unknown
  22. Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
    Authors: Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang
    Package: Unknown
  23. Prompting Large Language Models with Chain-of-Thought for Few-Shot Knowledge Base Question Generation
    Authors: Yuanyuan Liang, Jianing Wang, Hanlun Zhu, Lei Wang, Weining Qian, Yunshi Lan
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  24. Interactive Text Generation
    Authors: Felix Faltings, Michel Galley, Kianté Brantley, Baolin Peng, Weixin Cai, Yizhe Zhang, Jianfeng Gao, Bill Dolan
    Package: Unknown
  25. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
    Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai
    Package: Unknown
  26. QUDeval: The Evaluation of Questions Under Discussion Discourse Parsing
    Authors: Yating Wu, Ritika Mangla, Greg Durrett, Junyi Li
    Package: Unknown
  27. EntSUMv2: Dataset, Models and Evaluation for More Abstractive Entity-Centric Summarization
    Authors: Dhruv Mehra, Lingjue Xie, Ella Hofmann-Coyle, Mayank Kulkarni, Daniel Preotiuc-Pietro
    Package: Unknown
  28. MediaHG: Rethinking Eye-catchy Features in Social Media Headline Generation
    Authors: Boning Zhang, Yang Yang
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  29. Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
    Authors: Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Lei Hou, Juanzi Li, Xu Bin
    Notes: Received Paper Award
    Package: Unknown
  30. Answering Questions by Meta-Reasoning over Multiple Chains of Thought
    Authors: Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, Jonathan Berant
    Package: Unknown
  31. Abstractive Open Information Extraction
    Authors: Kevin Pei, Ishan Jindal, Kevin Chang
    Package: Unknown
  32. ReTAG: Reasoning Aware Table to Analytic Text Generation
    Authors: Deepanway Ghosal, Preksha Nema, Aravindan Raghuveer
    Package: Unknown
  33. Evaluation of African American Language Bias in Natural Language Generation
    Authors: Nicholas Deas, Jessica Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, Kathleen McKeown
    Package: Unknown
  34. Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
    Authors: Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu
    Package: Unknown
  35. PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
    Authors: Wookje Han, Jinsol Park, Kyungjae Lee
    Package: Unknown
  36. DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
    Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei Liu
    Package: Unknown
  37. Gender Biases in Automatic Evaluation Metrics for Image Captioning
    Authors: Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng
    Package: Unknown
  38. SOUL: Towards Sentiment and Opinion Understanding of Language
    Authors: Yue Deng, Wenxuan Zhang, Sinno Pan, Lidong Bing
    Package: Unknown
  39. MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
    Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu
    Package: Unknown
  40. ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization
    Authors: Xiutian Zhao, Ke Wang, Wei Peng
    Package: Unknown
  41. SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
    Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur Parikh
    Package: Unknown
  42. A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot
    Authors: Aanisha Bhattacharyya, Yaman Singla, Balaji Krishnamurthy, Rajiv Shah, Changyou Chen
    Package: Unknown
  43. Active Learning for Natural Language Generation
    Authors: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor
    Package: Unknown
  44. Reducing Sequence Length by Predicting Edit Spans with Large Language Models
    Authors: Masahiro Kaneko, Naoaki Okazaki
    Package: Unknown
  45. Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
    Authors: Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao
    Package: Unknown
  46. We Are What We Repeatedly Do: Inducing and Deploying Habitual Schemas in Persona-Based Responses
    Authors: Benjamin Kane, Lenhart Schubert
    Package: Unknown
  47. Countering Misinformation via Emotional Response Generation
    Authors: Daniel Russo, Shane Kaszefski-Yaschuk, Jacopo Staiano, Marco Guerini
    Package: Unknown
  48. Models See Hallucinations: Evaluating the Factuality in Video Captioning
    Authors: Hui Liu, Xiaojun Wan
    Package: Unknown
  49. Select, Prompt, Filter: Distilling Large Language Models for Summarizing Conversations
    Authors: Minh-Quang Pham, Sathish Indurthi, Shamil Chollampatt, Marco Turchi
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  50. Impressions: Visual Semiotics and Aesthetic Impact Understanding
    Authors: Julia Kruk, Caleb Ziems, Diyi Yang
    Package: Unknown
  51. AutoTrial: Prompting Language Models for Clinical Trial Design
    Authors: Zifeng Wang, Cao Xiao, Jimeng Sun
    Package: Unknown
  52. Multi-Source Multi-Type Knowledge Exploration and Exploitation for Dialogue Generation
    Authors: Xuanfan Ni, Hongliang Dai, Zhaochun Ren, Piji Li
    Package: Unknown
  53. Context Compression for Auto-regressive Transformers with Sentinel Tokens
    Authors: Siyu Ren, Qi Jia, Kenny Zhu
    Package: Unknown
  54. Reconstruct Before Summarize: An Efficient Two-Step Framework for Condensing and Summarizing Meeting Transcripts
    Authors: Haochen Tan, Han Wu, Wei Shao, Xinyun Zhang, Mingjie Zhan, Zhaohui Hou, Ding Liang, Linqi Song
    Package: Unknown
  55. MaNtLE: Model-agnostic Natural Language Explainer
    Authors: Rakesh Menon, Kerem Zaman, Shashank Srivastava
    Package: Unknown
  56. PTP: Boosting Stability and Performance of Prompt Tuning with Perturbation-Based Regularizer
    Authors: Lichang Chen, Jiuhai Chen, Heng Huang, Minhao Cheng
    Package: Unknown
  57. CLAIR: Evaluating Image Captions with Large Language Models
    Authors: David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John Canny
    Package: Unknown
  58. q2d: Turning Questions into Dialogs to Teach Models How to Search
    Authors: Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee Aharoni, Enav Weinreb
    Package: Unknown
  59. You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models
    Authors: Alexander Baranov, Vladimir Kniazhevsky, Pavel Braslavski
    Package: Unknown
  60. IEKG: A Commonsense Knowledge Graph for Idiomatic Expressions
    Authors: Ziheng Zeng, Kellen Cheng, Srihari Nanniyur, Jianing Zhou, Suma Bhat
    Package: Unknown
  61. Exploring the Boundaries of GPT-4 in Radiology
    Authors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel Castro, Maria Wetscherek, Robert Tinn, Harshita Sharma, Fernando Pérez-García, Anton Schwaighofer, Pranav Rajpurkar, Sameer Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya Nori, Matthew Lungren, Ozan Oktay, Javier Alvarez-Valle
    Package: Unknown
  62. Self-Ensemble of N-best Generation Hypotheses by Lexically Constrained Decoding
    Authors: Ryota Miyano, Tomoyuki Kajiwara, Yuki Arase
    Package: Unknown
  63. R2H: Building Multimodal Navigation Helpers that Respond to Help Requests
    Authors: Yue Fan, Jing Gu, Kaizhi Zheng, Xin Wang
    Package: Unknown
  64. Unveiling the Essence of Poetry: Introducing a Comprehensive Dataset and Benchmark for Poem Summarization
    Authors: Ridwan Mahbub, Ifrad Khan, Samiha Anuva, Md Shahriar, Md Tahmid Rahman Laskar, Sabbir Ahmed
    Package: Unknown
  65. Prompting with Pseudo-Code Instructions
    Authors: Mayank Mishra, Prince Kumar, Riyaz Bhat, Rudra Murthy, Danish Contractor, Srikanth Tamilselvam
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  66. Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning
    Authors: Swaroop Nath, Pushpak Bhattacharyya, Harshad Khadilkar
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  67. Graph vs. Sequence: An Empirical Study on Knowledge Forms for Knowledge-Grounded Dialogue
    Authors: Yizhe Yang, Heyan Huang, Yuhang Liu, Yang Gao
    Package: Unknown
  68. Exploring Distributional Shifts in Large Language Models for Code Analysis
    Authors: Shushan Arakelyan, Rocktim Das, Yi Mao, Xiang Ren
    Package: Unknown
  69. Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation
    Authors: Yixin Liu, Alexander Fabbri, Yilun Zhao, Pengfei Liu, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir Radev
    Package: Unknown
  70. ALCAP: Alignment-Augmented Music Captioner
    Authors: Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, Xuchen Song
    Package: Unknown
  71. Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian
    Authors: Ruhiyah Widiaputri, Ayu Purwarianti, Dessi Lestari, Kurniawati Azizah, Dipta Tanaya, Sakriani Sakti
    Package: Unknown
  72. Findings of the ACL: EMNLP 2023
  73. Multi Document Summarization Evaluation in the Presence of Damaging Content
    Authors: Avshalom Manevich, David Carmel, Nachshon Cohen, Elad Kravi, Ori Shapira
    Package: Unknown
  74. Follow-on Question Suggestion via Voice Hints for Voice Assistants
    Authors: Besnik Fetahu, Pedro Faustini, Anjie Fang, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi
    Package: Unknown
  75. Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti, Muhammad Abdul-Mageed
    Package: Unknown
  76. TaTA: A Multilingual Table-to-Text Dataset for African Languages
    Authors: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
    Package: Unknown
  77. Towards Mitigating LLM Hallucination via Self Reflection
    Authors: Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung
    Package: Unknown
  78. ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination
    Authors: Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, Min Zhang
    Package: Unknown
  79. Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting
    Authors: Haowei Du, Dinghao Zhang, Chen Li, Yang Li, Dongyan Zhao
    Package: Unknown
  80. Accuracy is not enough: Evaluating Personalization in Summarizers
    Authors: Rahul Vansh, Darsh Rank, Sourish Dasgupta, Tanmoy Chakraborty
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  81. MaXM: Towards Multilingual Visual Question Answering
    Authors: Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut
    Package: Unknown
  82. Understanding HTML with Large Language Models
    Authors: Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust
    Package: Unknown
  83. Can you Summarize my learnings? Towards Perspective-based Educational Dialogue Summarization
    Authors: Raghav Jain, Tulika Saha, Jhagrut Lalwani, Sriparna Saha
    Package: Unknown
  84. Towards Informative Open-ended Text Generation with Dynamic Knowledge Triples
    Authors: Zixuan Ren, Yang Zhao, Chengqing Zong
    Package: Unknown
  85. Ask Language Model to Clean Your Noisy Translation Data
    Authors: Quinten Bolding, Baohao Liao, Brandon Denis, Jun Luo, Christof Monz
    Package: Unknown
  86. Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users
    Authors: Yohan Jo, Xinyan Zhao, Arijit Biswas, Nikoletta Basiou, Vincent Auvray, Nikolaos Malandrakis, Angeliki Metallinou, Alexandros Potamianos
    Package: Unknown
  87. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
    Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
    Package: Unknown
  88. Mind the Gap: Automated Corpus Creation for Enthymeme Detection and Reconstruction in Learner Arguments
    Authors: Maja Stahl, Nick Düsterhus, Mei-Hua Chen, Henning Wachsmuth
    Package: Unknown
  89. The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
    Authors: Dung Nguyen, Le Nam, Anh Dau, Anh Nguyen, Khanh Nghiem, Jin Guo, Nghi Bui
    Package: Unknown
  90. INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
    Authors: Anil Ramakrishna, Rahul Gupta, Jens Lehmann, Morteza Ziyadi
    Package: Unknown
  91. Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting
    Authors: Fanghua Ye, Meng Fang, Shenghui Li, Emine Yilmaz
    Package: Unknown
  92. Leveraging Structured Information for Explainable Multi-hop Question Answering and Reasoning
    Authors: Ruosen Li, Xinya Du
    Package: Unknown
  93. TRIP: Accelerating Document-level Multilingual Pre-training via Triangular Document-level Pre-training on Parallel Data Triplets
    Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Zhaochuan Gao, Anthony Aue, Arul Menezes, Furu Wei
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  94. Beyond Candidates : Adaptive Dialogue Agent Utilizing Persona and Knowledge
    Authors: Jungwoo Lim, Myunghoon Kang, Jinsung Kim, Jeongwook Kim, Yuna Hur, Heuiseok Lim
    Package: Unknown
  95. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
    Authors: Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy
    Package: Unknown
  96. Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
    Authors: Yinghan Long, Sayeed Chowdhury, Kaushik Roy
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  97. Large Language Models Meet Harry Potter: A Dataset for Aligning Dialogue Agents with Characters
    Authors: Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, Jia Li
    Package: Unknown
  98. Citance-Contextualized Summarization of Scientific Papers
    Authors: Shahbaz Syed, Ahmad Hakimi, Khalid Al-Khatib, Martin Potthast
    Package: Unknown
  99. A Rewriting Approach for Gender Inclusivity in Portuguese
    Authors: Leonor Veloso, Luisa Coheur, Rui Ribeiro
    Package: Unknown
  100. LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation
    Authors: Zhitao He, Pengfei Cao, Yubo Chen, Kang Liu, Ruopeng Li, Mengshu Sun, Jun Zhao
    Package: Unknown
  101. CITB: A Benchmark for Continual Instruction Tuning
    Authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad
    Package: Unknown
  102. Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogues
    Authors: Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-Chung Kwan, Irwin King, Kam-Fai Wong
    Package: Unknown
  103. LLM aided semi-supervision for efficient Extractive Dialog Summarization
    Authors: Nishant Mishra, Gaurav Sahu, Iacer Calixto, Ameen Abu-Hanna, Issam Laradji
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  104. Exploring In-Context Learning for Knowledge Grounded Dialog Generation
    Authors: Qinyu Chen, Wenhao Wu, Sujian Li
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  105. InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
    Authors: Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang
    Package: Unknown
  106. SummIt: Iterative Text Summarization via ChatGPT
    Authors: Haopeng Zhang, Xiao Liu, Jiawei Zhang
    Package: Unknown
  107. HuatuoGPT, Towards Taming Language Model to Be a Doctor
    Authors: Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou Li
    Package: Unknown
  108. Diffusion Language Model with Query-Document Relevance for Query-Focused Summarization
    Authors: Shaoyao Huang, Luozheng Qin, Ziqiang Cao
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  109. TokenDrop + BucketSampler: Towards Efficient Padding-free Fine-tuning of Language Models
    Authors: Amrit Nagarajan, Anand Raghunathan
    Package: Unknown
  110. Using In-Context Learning to Improve Dialogue Safety
    Authors: Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tur
    Package: Unknown
  111. Improving Consistency for Text Summarization with Energy Functions
    Authors: Qi Zeng, Qingyu Yin, Zheng Li, Yifan Gao, Sreyashi Nag, Zhengyang Wang, Bing Yin, Heng Ji, Chao Zhang
    Package: Unknown
  112. PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning
    Authors: Yongil Kim, Yerin Hwang, Hyeongu Yun, Seunghyun Yoon, Trung Bui, Kyomin Jung
    Package: Unknown
  113. LLMs – the Good, the Bad or the Indispensable?: A Use Case on Legal Statute Prediction and Legal Judgment Prediction on Indian Court Cases
    Authors: Shaurya Vats, Atharva Zope, Somsubhra De, Anurag Sharma, Upal Bhattacharya, Shubham Nigam, Shouvik Guha, Koustav Rudra, Kripabandhu Ghosh
    Package: Unknown
  114. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
    Authors: Viet Lai, Nghia Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Nguyen
    Package: Unknown
  115. PARROT: Zero-Shot Narrative Reading Comprehension via Parallel Reading
    Authors: Chao Zhao, Anvesh Vijjini, Snigdha Chaturvedi
    Package: Unknown
  116. Synthesize, if you do not have: Effective Synthetic Dataset Creation Strategies for Self-Supervised Opinion Summarization in E-commerce
    Authors: Tejpalsingh Siledar, Suman Banerjee, Amey Patil, Sudhanshu Singh, Muthusamy Chelliah, Nikesh Garera, Pushpak Bhattacharyya
    Package: Unknown
  117. InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT
    Authors: Yichong Xu, Ruochen Xu, Dan Iter, Yang Liu, Shuohang Wang, Chenguang Zhu, Michael Zeng
    Package: Unknown
  118. DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines
    Authors: Prakhar Gupta, Yang Liu, Di Jin, Behnam Hedayatnia, Spandana Gella, Sijia Liu, Patrick Lange, Julia Hirschberg, Dilek Hakkani-Tur
    Package: Unknown
  119. Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models
    Authors: Hongli Zhan, Desmond Ong, Junyi Li
    Package: Unknown
  120. 1-PAGER: One Pass Answer Generation and Evidence Retrieval
    Authors: Palak Jain, Livio Soares, Tom Kwiatkowski
    Package: Unknown
  121. LMGQS: A Large-scale Dataset for Query-focused Summarization
    Authors: Ruochen Xu, Song Wang, Yang Liu, Shuohang Wang, Yichong Xu, Dan Iter, Pengcheng He, Chenguang Zhu, Michael Zeng
    Package: Unknown
  122. Extrapolating Multilingual Understanding Models as Multilingual Generators
    Authors: Bohong Wu, Fei Yuan, Hai Zhao, Lei Li, Jingjing Xu
    Notes: Abstract Mentions Rouge Scores
    Package: Unknown
  123. 3rd Workshop on Multi-lingual Representation Learning
  124. Generating Continuations in Multilingual Idiomatic Contexts
    Authors: Rhitabrat Pokharel, Ameeta Agrawal
    Package: Unknown
  125. 4th New Frontiers in Summarization Workshop
  126. Is ChatGPT a Good NLG Evaluator? A Preliminary Study
    Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou
    Package: Unknown
  127. In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis
    Authors: Yuting Tang, Ratish Puduppully, Zhengyuan Liu, Nancy Chen
    Package: Unknown
  128. From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
    Authors: Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad
    Package: Unknown
  129. Generating Extractive and Abstractive Summaries in Parallel from Scientific Articles Incorporating Citing Statements
    Authors: Sudipta Singha Roy, Robert E. Mercer
    Package: Unknown
  130. Analyzing Multi-Sentence Aggregation in Abstractive Summarization via the Shapley Value
    Authors: Jingyi He, Meng Cao, Jackie Chi Kit Cheung
    Package: Unknown
  131. Natural Legal Language Processing Workshop 2023
  132. Questions about Contracts: Prompt Templates for Structured Answer Generation
    Authors: Adam Roegiest, Radha Chitta, Jonathan Donnelly, Maya Lash, Alexandra Vtyurina, Francois Longtin
    Package: Unknown
  133. 3rd Workshop for NLP Open Source Software
  134. The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
    Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui
    Package: Unknown

Incorrect Rouge Scores — EMNLP 2023 Papers

These papers or their code releases reference Rouge software packages with that compute incorrect Rouge scores because of implementation errors. Incorrect Rouge scores differ from the official ROUGE-1.5.5 reference implementation of Rouge. See packages section for more detail.

  1. Proceedings of CoNLL 2023
  2. ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
    Authors: Mohammad Akbari, Saeed Ranjbar Alvar, Behnam Kamranian, Amin Banitalebi-Dehkordi, Yong Zhang
    Package: LA/torchmetrics
  3. EMNLP 2023 Main Proceedings
  4. Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
    Authors: Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae Yu
    Package: GL/rougescore
  5. BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
    Authors: Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan
    Package: GL/rougescore
  6. MemeCap: A Dataset for Captioning and Interpreting Memes
    Authors: EunJeong Hwang, Vered Shwartz
    Package: MS/rouge
  7. Fast and Accurate Factual Inconsistency Detection Over Long Documents
    Authors: Barrett Lattimer, Patrick CHen, Xinyuan Zhang, Yi Yang
    Package: GL/rougescore, PT/rouge
  8. Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
    Authors: Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun Peng
    Package: GL/rougescore
  9. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
    Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li
    Package: PT/rouge
  10. Lion: Adversarial Distillation of Proprietary Large Language Models
    Authors: Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei Wang
    Package: GL/rougescore
  11. Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media
    Authors: Shubham Mittal, Megha Sundriyal, Preslav Nakov
    Package: PT/rouge
  12. Investigating Efficiently Extending Transformers for Long Input Summarization
    Authors: Jason Phang, Yao Zhao, Peter Liu
    Package: GL/rougescore
  13. mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images
    Authors: Keighley Overbay, Jaewoo Ahn, Fatemeh Pesaran zadeh, Joonsuk Park, Gunhee Kim
    Package: PT/rouge
  14. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
    Authors: Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
    Package: GL/rougescore
  15. Towards Interpretable Mental Health Analysis with Large Language Models
    Authors: Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, Sophia Ananiadou
    Package: GL/rougescore
  16. Modeling Empathic Similarity in Personal Narratives
    Authors: Jocelyn Shen, Maarten Sap, Pedro Colon-Hernandez, Hae Park, Cynthia Breazeal
    Package: PT/rouge
  17. Enabling Large Language Models to Generate Text with Citations
    Authors: Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
    Package: GL/rougescore
  18. A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
    Authors: Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, Ivan Vulić
    Package: GL/rougescore
  19. CiteBench: A Benchmark for Scientific Citation Text Generation
    Authors: Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych
    Package: GL/rougescore
  20. Instructive Dialogue Summarization with Query Aggregations
    Authors: Bin Wang, Zhengyuan Liu, Nancy Chen
    Package: DI/pyrouge, GL/rougescore
  21. Enhancing Biomedical Lay Summarisation with External Knowledge Graphs
    Authors: Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, Chenghua Lin
    Package: GL/rougescore
  22. Background Summarization of Event Timelines
    Authors: Adithya Pratapa, Kevin Small, Markus Dreyer
    Notes: Received Paper Award
    Package: GL/rougescore
  23. trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback
    Authors: Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, Louis Castricato
    Package: GL/rougescore
  24. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
    Authors: Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan
    Package: GL/rougescore
  25. Detecting and Mitigating Hallucinations in Multilingual Summarisation
    Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo Ponti, Shay Cohen
    Package: GL/rougescore, PT/rouge
  26. Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
    Authors: Manuel Faysse, Gautier Viaud, Céline Hudelot, Pierre Colombo
    Package: GL/rougescore
  27. Instruct and Extract: Instruction Tuning for On-Demand Information Extraction
    Authors: Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Siru Ouyang, Heng Ji, Jiawei Han
    Package: GL/rougescore
  28. ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
    Authors: Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit Bansal
    Package: GL/rougescore
  29. Contrastive Learning for Inference in Dialogue
    Authors: Etsuko Ishii, Yan Xu, Bryan Wilie, Ziwei Ji, Holy Lovenia, Willy Chung, Pascale Fung
    Package: MS/rouge
  30. Paraphrase Types for Generation and Detection
    Authors: Jan Wahle, Bela Gipp, Terry Ruas
    Package: PT/rouge
  31. Hallucination Mitigation in Natural Language Generation from Large-Scale Open-Domain Knowledge Graphs
    Authors: Xiao Shi, Zhengyuan Zhu, Zeyu Zhang, Chengkai Li
    Package: GL/rougescore
  32. Multilingual Large Language Models Are Not (Yet) Code-Switchers
    Authors: Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, Alham Aji
    Package: GL/rougescore
  33. KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection
    Authors: Sehyun Choi, Tianqing Fang, Zhaowei Wang, Yangqiu Song
    Package: GL/rougescore
  34. CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code
    Authors: Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, Wenhai Wang
    Package: MS/rouge
  35. Length Does Matter: Summary Length can Bias Summarization Metrics
    Authors: Xiaobo Guo, Soroush Vosoughi
    Package: BZ/pyrouge
  36. Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation
    Authors: Jiayu Lin, Rong Ye, Meng Han, Qi Zhang, Ruofei Lai, Xinyu Zhang, Zhao Cao, Xuanjing Huang, Zhongyu Wei
    Package: PT/rouge
  37. Findings of the ACL: EMNLP 2023
  38. DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
    Authors: Forrest Bao, Ruixuan Tu, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, Cen Chen
    Package: GL/rougescore
  39. Execution-Based Evaluation for Open-Domain Code Generation
    Authors: Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig
    Package: GL/rougescore
  40. Improving the Robustness of Summarization Models by Detecting and Removing Input Noise
    Authors: Kundan Krishna, Yao Zhao, Jie Ren, Balaji Lakshminarayanan, Jiaming Luo, Mohammad Saleh, Peter Liu
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  41. Extractive Summarization via ChatGPT for Faithful Summary Generation
    Authors: Haopeng Zhang, Xiao Liu, Jiawei Zhang
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  42. InstructExcel: A Benchmark for Natural Language Instruction in Excel
    Authors: Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri
    Package: GL/rougescore
  43. Multi-step Jailbreaking Privacy Attacks on ChatGPT
    Authors: Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, Yangqiu Song
    Package: GL/rougescore
  44. FREDSum: A Dialogue Summarization Corpus for French Political Debates
    Authors: Virgile Rennard, Guokan Shang, Damien Grari, Julie Hunter, Michalis Vazirgiannis
    Package: GL/rougescore
  45. Frugal Prompting for Dialog Models
    Authors: Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal
    Package: GL/rougescore
  46. The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation
    Authors: Mutian He, Philip Garner
    Package: GL/rougescore
  47. Is ChatGPT a Good Multi-Party Conversation Solver?
    Authors: Chao-Hong Tan, Jia-Chen Gu, Zhen-Hua Ling
    Package: GL/rougescore
  48. Bipartite Graph Pre-training for Unsupervised Extractive Summarization with Graph Convolutional Auto-Encoders
    Authors: Qianren Mao, Shaobo Zhao, Jiarui Li, Xiaolei Gu, Shizhu He, Bo Li, Jianxin Li
    Package: BZ/pyrouge
  49. Adapting Pretrained Text-to-Text Models for Long Text Sequences
    Authors: Wenhan Xiong, Anchit Gupta, Shubham Toshniwal, Yashar Mehdad, Scott Yih
    Package: PT/files2rouge
  50. Large-Scale and Multi-Perspective Opinion Summarization with Diverse Review Subsets
    Authors: Han Jiang, Rui Wang, Zhihua Wei, Yu Li, Xinpeng Wang
    Package: PT/files2rouge
  51. Topic-Informed Dialogue Summarization using Topic Distribution and Prompt-based Modeling
    Authors: Jaeah You, Youngjoong Ko
    Notes: Abstract Mentions Rouge Scores
    Package: DI/pyrouge
  52. A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization
    Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  53. NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
    Authors: Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun
    Package: GL/rougescore
  54. Inverse Reinforcement Learning for Text Summarization
    Authors: Yu Fu, Deyi Xiong, Yue Dong
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  55. From Chaos to Clarity: Claim Normalization to Empower Fact-Checking
    Authors: Megha Sundriyal, Tanmoy Chakraborty, Preslav Nakov
    Package: DI/pyrouge
  56. Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries
    Authors: Prafulla Choubey, Alexander Fabbri, Caiming Xiong, Chien-Sheng Wu
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  57. Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
    Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Wang, Arman Cohan
    Package: GL/rougescore
  58. USB: A Unified Summarization Benchmark Across Tasks and Domains
    Authors: Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron Wallace, Jeffrey Bigham, Zachary Lipton
    Package: GL/rougescore
  59. Domain Adaptation for Conversational Query Production with the RAG Model Feedback
    Authors: Ante Wang, Linfeng Song, Ge Xu, Jinsong Su
    Package: GL/rougescore
  60. DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
    Authors: Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
    Package: LA/torchmetrics
  61. PivotFEC: Enhancing Few-shot Factual Error Correction with a Pivot Task Approach using Large Language Models
    Authors: Xingwei He, A-Long Jin, Jun Ma, Yuan Yuan, Siu Yiu
    Package: GL/rougescore
  62. Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization
    Authors: Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy Huang
    Package: GL/rougescore
  63. Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration
    Authors: Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, Tat-Seng Chua
    Package: BZ/pyrouge, MS/rouge
  64. Natural Response Generation for Chinese Reading Comprehension
    Authors: Nuo Chen, Hongguang Li, Yinan Bao, Baoyuan Wang, Jia Li
    Package: Custom reimplementation of Rouge
  65. Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation
    Authors: Jinglong Gao, Xiao Ding, Bing Qin, Ting Liu
    Package: PT/rouge
  66. Enhancing Accessible Communication: from European Portuguese to Portuguese Sign Language
    Authors: Catarina Sousa, Luisa Coheur, Mara Moita
    Package: GL/rougescore
  67. HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue
    Authors: Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Yoon, Junyeong Kim, Chang Yoo
    Package: MS/rouge
  68. Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs
    Authors: Young-Suk Lee, Md Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Astudillo
    Package: GL/rougescore
  69. Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans
    Authors: Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan
    Notes: Abstract Mentions Rouge Scores
    Package: GL/rougescore
  70. COMET-M: Reasoning about Multiple Events in Complex Sentences
    Authors: Sahithya Ravi, Raymond Ng, Vered Shwartz
    Package: GL/rougescore, MS/rouge
  71. Cross-modality Data Augmentation for End-to-End Sign Language Translation
    Authors: Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Hui Xiong
    Package: MS/rouge
  72. InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators
    Authors: Heng Yang, Ke Li
    Package: PT/rouge
  73. Re-Examining Summarization Evaluation across Multiple Quality Criteria
    Authors: Ori Ernst, Ori Shapira, Ido Dagan, Ran Levy
    Package: BZ/pyrouge
  74. NarrativeXL: a Large-scale Dataset for Long-Term Memory Models
    Authors: Arsenii Moskvichev, Ky-Vinh Mai
    Package: GL/rougescore
  75. PIVOINE: Instruction Tuning for Open-world Entity Profiling
    Authors: Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu Chen
    Package: ND/easyrouge, PT/rouge
  76. Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
    Authors: Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan, Jiaxing Zhang, Baoyuan Wang, Jia Li
    Package: Custom reimplementation of Rouge
  77. Mitigating Intrinsic Named Entity-Related Hallucinations of Abstractive Text Summarization
    Authors: Jianbin Shen, Junyu Xuan, Christy Liang
    Package: GL/rougescore
  78. 3rd Workshop on Multi-lingual Representation Learning
  79. Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023
    Authors: Francesco Tinner, David Ifeoluwa Adelani, Chris Emezue, Mammad Hajili, Omer Goldman, Muhammad Farid Adilazuarda, Muhammad Dehan Al Kautsar, Aziza Mirsaidova, Müge Kural, Dylan Massey, Chiamaka Chukwuneke, Chinedu Mbonu, Damilola Oluwaseun Oloyede, Kayode Olaleye, Jonathan Atala, Benjamin A. Ajibade, Saksham Bassi, Rahul Aralikatte, Najoung Kim, Duygu Ataman
    Package: GL/rougescore
  80. 4th New Frontiers in Summarization Workshop
  81. Extract, Select and Rewrite: A Modular Sentence Summarization Method
    Authors: Shuo Guan, Vishakh Padmakumar
    Package: PT/rouge
  82. Improving Multi-Stage Long Document Summarization with Enhanced Coarse Summarizer
    Authors: Jinhyeong Lim, Hyun-Je Song
    Package: BZ/pyrouge
  83. 3rd Workshop for NLP Open Source Software
  84. nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources
    Authors: Piotr Nawrot
    Package: GL/rougescore

Incorrect Rouge Packages — Cited at EMNLP 2023

These packages have implementation or configuration errors that result in incorrect Rouge scores. These errors were first identified in the ACL 2023 Rogue Scores paper by comparing their output scores to ROUGE-1.5.5 under various evaluation conditions.

  1. Package With Errors: GL/rougescore
    Incorrect implementation of Porter stemming. Incorrect default implementation of Rouge-L. Bootstrapping introduces random noise into scores (minor issue). Distributed by both Google Research (GL/rougescore) and Hugging Face (HF/evaluate).
  2. Package With Errors: PT/rouge
    Implementation errors in both Rouge-N and Rouge-L algorithms. Not capable of performing stemming or bootstrapping.
  3. Package With Errors: PT/files2rouge
    Incorrectly tokenizes sentences using the period character (“.”), ignoring existing tokenization. Bootstrapping introduces random noise into scores (minor issue).
  4. Package With Errors: DI/pyrouge
    Unclear implementation errors cause incorrect Rouge scores for approximately 4% of model outputs during testing. Not capable of performing bootstrapping.
  5. Package With Errors: MS/rouge
    Accidentally computes recall-biased Rouge F-scores using $ \beta=1.2 $. (Rouge F-scores are almost universally computed with $ \beta=1.0 $.) Performs incorrect sentence tokenization. Not capable of performing stemming or bootstrapping.
  6. Package With Errors: BZ/pyrouge
    Contains single line of code that silently enables stemming, even when user attempts to disable stemming. Bootstrapping introduces random noise into scores (minor issue). Distributed and reused by several other packages, including YL/summeval.
  7. Package With Errors: ND/easyrouge
    Omits many major components of Rouge scores: “Preprocessing like stopword removal, stemming and tokenization is left to the client.”
  8. Package With Errors: LA/torchmetrics
    This custom reimplementation of Rouge has not been evaluated for correctness. It appears to be based on the incorrect GL/rougescore implementation, including replicating the incorrect default Rouge-L behavior.
  9. Custom Reimplementations
    Some papers link to code which contain custom ad hoc reimplementations or wrappers of Rouge not evaluated in Rogue Scores. Custom implementations correctness is determined by static analysis during review of code release.

Timeline — Paper and Code Review

Methods — Paper and Code Review

  1. The review includes all EMNLP 2023 papers that compute Rouge scores. Papers and citation information are downloaded from the ACL Anthology.
  2. A preliminary identification of Rouge papers is conducted automatically by searching “rouge” across all full-text paper PDFs and excluding papers that do not match.
  3. Matching papers are reviewed manually to identify if they compute Rouge scores. This includes Rouge scores computed but not reported, such as during model training. Papers not computing Rouge scores are excluded from the review. Remaining papers are included in the review.
  4. Remaining papers are first searched for in-text paper citations of Rouge packages. Papers with in-text package citations are labeled accordingly and the review of the paper concludes.
  5. Papers without in-text paper Rouge citations are searched for in-text code release links. Papers without code links are labeled as “unknown package” and the review of the paper concludes.
  6. Paper code releases are searched for references to Rouge, including README documents, repository issues and pull requests, standard code files, shell scripts, and package management files such as requirements.txt or environment.yml.
  7. Papers with code referencing a Rouge package are labeled accordingly. Papers whose code does not reference a Rouge packages are labeled as “unknown package.” Review concludes.

Challenges and Limitations — Paper and Code Review