Widespread Errors in NLG Evaluation
If you’ve ever written a paper involving natural language generation, there’s a good chance you evaluated your model using the
Yet, among the thousands of NLP papers evaluate using
Furthermore, many nonstandard ROUGE-1.5.5
implementation, yet unlike ROUGE-1.5.5
, they have never been validated against human judgement. Prior work suggests that several thousand papers evaluate using incorrect
Recommendations for Reviewers
Model evaluation is a critical component of empirical NLP research. For several years, the Responsible NLP Research Checklist has asked authors to cite software and parameters for important software packages and specifically identifies
Reviewers should strongly recommend rejection — for papers reporting or computing
Rouge scores without providing an in-textRouge software package citation.Reviewers should strongly recommend rejection — for papers using a
Rouge package other than the standardROUGE-1.5.5
implementation unless the paper contains an appendix section discussing the limitations of the package and rationale for using it.Reviewers should strongly recommend rejection — for papers reporting or computing
Rouge scores withoutRouge configuration parameters, such as stemming.
Evaluation Software Errors at EMNLP 2023
This review examines all papers computing
The results below are built automatically from the dataset file, available to download above. For more details, read the timeline, methods, and limitations sections below.
Correct
Rouge Scores — EMNLP 2023 PapersThese papers appear to compute correct
Rouge scores based on paper and code release review. CorrectRouge scores are computed by (1) the benchmarkROUGE-1.5.5
package, (2) an alternativeRouge package withRouge scores identical toROUGE-1.5.5
, or (3) an alternativeRouge package specifically designed and used for non-English or multilingual evaluation.
EMNLP 2023 Main Proceedings - Better Quality Pre-training Data and T5 Models for African Languages
Authors: Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Owodunni, Odunayo Ogundepo, David Adelani, Jimmy LinPackage: MultilingualRouge - GEMINI: Controlling The Sentence-Level Summary Style in Abstractive Text Summarization
Authors: Guangsheng Bao, Zebin Ou, Yue ZhangPackage: StandardROUGE-1.5.5
(custom wrapper) - DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining
Authors: Weifeng Jiang, Qianren Mao, Chenghua Lin, Jianxin Li, Ting Deng, Weiyi Yang, Zheng WangPackage: StandardROUGE-1.5.5
- MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments
Authors: Debtanu Datta, Shubham Soni, Rajdeep Mukherjee, Saptarshi GhoshPackage: MultilingualRouge Findings of the ACL: EMNLP 2023 - Understanding Translationese in Cross-Lingual Summarization
Authors: Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu, Zhixu Li, Jie ZhouPackage: MultilingualRouge - Towards a Unified Framework for Reference Retrieval and Related Work Generation
Authors: Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, Zhaochun RenPackage: StandardROUGE-1.5.5
- Hierarchical Catalogue Generation for Literature Review: A Benchmark
Authors: Kun Zhu, Xiaocheng Feng, Xiachong Feng, Yingsheng Wu, Bing QinPackage: StandardROUGE-1.5.5
- mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
Authors: David Uthus, Santiago Ontanon, Joshua Ainslie, Mandy GuoPackage: MultilingualRouge - D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie ZhouPackage: MultilingualRouge 4th New Frontiers in Summarization Workshop - Zero-Shot Cross-Lingual Summarization via Large Language Models
Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Beiqi Zou, Zhixu Li, Jianfeng Qu, Jie ZhouPackage: MultilingualRouge
Unknown
Rouge Scores — EMNLP 2023 PapersAt the time of review (see review timeline below), these papers are missing
Rouge software package citations. Because eachRouge package computes differentRouge scores, it is unclear whetherRouge scores reported in these papers are correct or comparable with prior work.
Proceedings of CoNLL 2023 - Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation
Authors: Dama Sravani, Radhika MamidiPackage: Unknown - Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization
Authors: Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor CarbunePackage: Unknown - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
Authors: Yuiga Wada, Kanta Kaneda, Komei SugiuraPackage: Unknown - MuLER: Detailed and Scalable Reference-based Evaluation
Authors: Taelin Karidi, Leshem Choshen, Gal Patel, Omri AbendPackage: Unknown EMNLP 2023 Main Proceedings - Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
Authors: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William WangPackage: Unknown - GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Authors: Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-MageedPackage: Unknown - Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
Authors: Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo KangNotes: Abstract MentionsRouge ScoresPackage: Unknown - Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
Authors: Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi YangPackage: Unknown - Location-Aware Visual Question Generation with Lightweight Models
Authors: Nicholas Suwono, Justin Chen, Tun Hung, Ting-Hao Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua SunNotes: Abstract MentionsRouge ScoresPackage: Unknown - A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan Plummer, Kate Saenko, Jianmo Ni, Mandy GuoPackage: Unknown - OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization
Authors: Shmuel Amar, Liat Schiff, Ori Ernst, Asi Shefer, Ori Shapira, Ido DaganPackage: Unknown - TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Authors: Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan SzpektorPackage: Unknown - Promoting Topic Coherence and Inter-Document Consorts in Multi-Document Summarization via Simplicial Complex and Sheaf Graph
Authors: Yash Atri, Arun Iyer, Tanmoy Chakraborty, Vikram GoyalPackage: Unknown - TempTabQA: Temporal Question Answering for Semi-Structured Tables
Authors: Vivek Gupta, Pranshu Kandoi, Mahek Vora, Shuo Zhang, Yujie He, Ridho Reinanda, Vivek SrikumarPackage: Unknown - G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang ZhuNotes: Abstract MentionsRouge ScoresPackage: Unknown - Learning Retrieval Augmentation for Personalized Dialogue Generation
Authors: Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian TangPackage: Unknown - Indicative Summarization of Long Discussions
Authors: Shahbaz Syed, Dominik Schwabe, Khalid Khatib, Martin PotthastPackage: Unknown - Evaluating Large Language Models on Controlled Generation Tasks
Authors: Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, Xuezhe MaPackage: Unknown - CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types
Authors: Zishan Guo, Linhao Yu, Minghui Xu, Renren Jin, Deyi XiongPackage: Unknown - Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Authors: Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei ChangPackage: Unknown - Prompting Large Language Models with Chain-of-Thought for Few-Shot Knowledge Base Question Generation
Authors: Yuanyuan Liang, Jianing Wang, Hanlun Zhu, Lei Wang, Weining Qian, Yunshi LanNotes: Abstract MentionsRouge ScoresPackage: Unknown - Interactive Text Generation
Authors: Felix Faltings, Michel Galley, Kianté Brantley, Baolin Peng, Weixin Cai, Yizhe Zhang, Jianfeng Gao, Bill DolanPackage: Unknown - GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit SanghaiPackage: Unknown - QUDeval: The Evaluation of Questions Under Discussion Discourse Parsing
Authors: Yating Wu, Ritika Mangla, Greg Durrett, Junyi LiPackage: Unknown - EntSUMv2: Dataset, Models and Evaluation for More Abstractive Entity-Centric Summarization
Authors: Dhruv Mehra, Lingjue Xie, Ella Hofmann-Coyle, Mayank Kulkarni, Daniel Preotiuc-PietroPackage: Unknown - MediaHG: Rethinking Eye-catchy Features in Social Media Headline Generation
Authors: Boning Zhang, Yang YangNotes: Abstract MentionsRouge ScoresPackage: Unknown - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Authors: Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Lei Hou, Juanzi Li, Xu BinNotes: Received Paper AwardPackage: Unknown - Answering Questions by Meta-Reasoning over Multiple Chains of Thought
Authors: Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, Jonathan BerantPackage: Unknown - Abstractive Open Information Extraction
Authors: Kevin Pei, Ishan Jindal, Kevin ChangPackage: Unknown - ReTAG: Reasoning Aware Table to Analytic Text Generation
Authors: Deepanway Ghosal, Preksha Nema, Aravindan RaghuveerPackage: Unknown - Evaluation of African American Language Bias in Natural Language Generation
Authors: Nicholas Deas, Jessica Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, Kathleen McKeownPackage: Unknown - Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
Authors: Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming GuPackage: Unknown - PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
Authors: Wookje Han, Jinsol Park, Kyungjae LeePackage: Unknown - DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei LiuPackage: Unknown - Gender Biases in Automatic Evaluation Metrics for Image Captioning
Authors: Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun PengPackage: Unknown - SOUL: Towards Sentiment and Opinion Understanding of Language
Authors: Yue Deng, Wenxuan Zhang, Sinno Pan, Lidong BingPackage: Unknown - MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Chang, Amilcare Gentili, Julian McAuley, Chun-Nan HsuPackage: Unknown - ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization
Authors: Xiutian Zhao, Ke Wang, Wei PengPackage: Unknown - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur ParikhPackage: Unknown - A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot
Authors: Aanisha Bhattacharyya, Yaman Singla, Balaji Krishnamurthy, Rajiv Shah, Changyou ChenPackage: Unknown - Active Learning for Natural Language Generation
Authors: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-DorPackage: Unknown - Reducing Sequence Length by Predicting Edit Spans with Large Language Models
Authors: Masahiro Kaneko, Naoaki OkazakiPackage: Unknown - Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Authors: Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera LiaoPackage: Unknown - We Are What We Repeatedly Do: Inducing and Deploying Habitual Schemas in Persona-Based Responses
Authors: Benjamin Kane, Lenhart SchubertPackage: Unknown - Countering Misinformation via Emotional Response Generation
Authors: Daniel Russo, Shane Kaszefski-Yaschuk, Jacopo Staiano, Marco GueriniPackage: Unknown - Models See Hallucinations: Evaluating the Factuality in Video Captioning
Authors: Hui Liu, Xiaojun WanPackage: Unknown - Select, Prompt, Filter: Distilling Large Language Models for Summarizing Conversations
Authors: Minh-Quang Pham, Sathish Indurthi, Shamil Chollampatt, Marco TurchiNotes: Abstract MentionsRouge ScoresPackage: Unknown - Impressions: Visual Semiotics and Aesthetic Impact Understanding
Authors: Julia Kruk, Caleb Ziems, Diyi YangPackage: Unknown - AutoTrial: Prompting Language Models for Clinical Trial Design
Authors: Zifeng Wang, Cao Xiao, Jimeng SunPackage: Unknown - Multi-Source Multi-Type Knowledge Exploration and Exploitation for Dialogue Generation
Authors: Xuanfan Ni, Hongliang Dai, Zhaochun Ren, Piji LiPackage: Unknown - Context Compression for Auto-regressive Transformers with Sentinel Tokens
Authors: Siyu Ren, Qi Jia, Kenny ZhuPackage: Unknown - Reconstruct Before Summarize: An Efficient Two-Step Framework for Condensing and Summarizing Meeting Transcripts
Authors: Haochen Tan, Han Wu, Wei Shao, Xinyun Zhang, Mingjie Zhan, Zhaohui Hou, Ding Liang, Linqi SongPackage: Unknown - MaNtLE: Model-agnostic Natural Language Explainer
Authors: Rakesh Menon, Kerem Zaman, Shashank SrivastavaPackage: Unknown - PTP: Boosting Stability and Performance of Prompt Tuning with Perturbation-Based Regularizer
Authors: Lichang Chen, Jiuhai Chen, Heng Huang, Minhao ChengPackage: Unknown - CLAIR: Evaluating Image Captions with Large Language Models
Authors: David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John CannyPackage: Unknown - q2d: Turning Questions into Dialogs to Teach Models How to Search
Authors: Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee Aharoni, Enav WeinrebPackage: Unknown - You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models
Authors: Alexander Baranov, Vladimir Kniazhevsky, Pavel BraslavskiPackage: Unknown - IEKG: A Commonsense Knowledge Graph for Idiomatic Expressions
Authors: Ziheng Zeng, Kellen Cheng, Srihari Nanniyur, Jianing Zhou, Suma BhatPackage: Unknown - Exploring the Boundaries of GPT-4 in Radiology
Authors: Qianchu Liu, Stephanie Hyland, Shruthi Bannur, Kenza Bouzid, Daniel Castro, Maria Wetscherek, Robert Tinn, Harshita Sharma, Fernando Pérez-García, Anton Schwaighofer, Pranav Rajpurkar, Sameer Khanna, Hoifung Poon, Naoto Usuyama, Anja Thieme, Aditya Nori, Matthew Lungren, Ozan Oktay, Javier Alvarez-VallePackage: Unknown - Self-Ensemble of N-best Generation Hypotheses by Lexically Constrained Decoding
Authors: Ryota Miyano, Tomoyuki Kajiwara, Yuki ArasePackage: Unknown - R2H: Building Multimodal Navigation Helpers that Respond to Help Requests
Authors: Yue Fan, Jing Gu, Kaizhi Zheng, Xin WangPackage: Unknown - Unveiling the Essence of Poetry: Introducing a Comprehensive Dataset and Benchmark for Poem Summarization
Authors: Ridwan Mahbub, Ifrad Khan, Samiha Anuva, Md Shahriar, Md Tahmid Rahman Laskar, Sabbir AhmedPackage: Unknown - Prompting with Pseudo-Code Instructions
Authors: Mayank Mishra, Prince Kumar, Riyaz Bhat, Rudra Murthy, Danish Contractor, Srikanth TamilselvamNotes: Abstract MentionsRouge ScoresPackage: Unknown - Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning
Authors: Swaroop Nath, Pushpak Bhattacharyya, Harshad KhadilkarNotes: Abstract MentionsRouge ScoresPackage: Unknown - Graph vs. Sequence: An Empirical Study on Knowledge Forms for Knowledge-Grounded Dialogue
Authors: Yizhe Yang, Heyan Huang, Yuhang Liu, Yang GaoPackage: Unknown - Exploring Distributional Shifts in Large Language Models for Code Analysis
Authors: Shushan Arakelyan, Rocktim Das, Yi Mao, Xiang RenPackage: Unknown - Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation
Authors: Yixin Liu, Alexander Fabbri, Yilun Zhao, Pengfei Liu, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir RadevPackage: Unknown - ALCAP: Alignment-Augmented Music Captioner
Authors: Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, Xuchen SongPackage: Unknown - Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian
Authors: Ruhiyah Widiaputri, Ayu Purwarianti, Dessi Lestari, Kurniawati Azizah, Dipta Tanaya, Sakriani SaktiPackage: Unknown Findings of the ACL: EMNLP 2023 - Multi Document Summarization Evaluation in the Presence of Damaging Content
Authors: Avshalom Manevich, David Carmel, Nachshon Cohen, Elad Kravi, Ori ShapiraPackage: Unknown - Follow-on Question Suggestion via Voice Hints for Voice Assistants
Authors: Besnik Fetahu, Pedro Faustini, Anjie Fang, Giuseppe Castellucci, Oleg Rokhlenko, Shervin MalmasiPackage: Unknown - Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti, Muhammad Abdul-MageedPackage: Unknown - TaTA: A Multilingual Table-to-Text Dataset for African Languages
Authors: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan Botha, Michael Chavinda, Ankur Parikh, Clara RiveraPackage: Unknown - Towards Mitigating LLM Hallucination via Self Reflection
Authors: Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale FungPackage: Unknown - ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination
Authors: Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, Min ZhangPackage: Unknown - Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting
Authors: Haowei Du, Dinghao Zhang, Chen Li, Yang Li, Dongyan ZhaoPackage: Unknown - Accuracy is not enough: Evaluating Personalization in Summarizers
Authors: Rahul Vansh, Darsh Rank, Sourish Dasgupta, Tanmoy ChakrabortyNotes: Abstract MentionsRouge ScoresPackage: Unknown - MaXM: Towards Multilingual Visual Question Answering
Authors: Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu SoricutPackage: Unknown - Understanding HTML with Large Language Models
Authors: Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra FaustPackage: Unknown - Can you Summarize my learnings? Towards Perspective-based Educational Dialogue Summarization
Authors: Raghav Jain, Tulika Saha, Jhagrut Lalwani, Sriparna SahaPackage: Unknown - Towards Informative Open-ended Text Generation with Dynamic Knowledge Triples
Authors: Zixuan Ren, Yang Zhao, Chengqing ZongPackage: Unknown - Ask Language Model to Clean Your Noisy Translation Data
Authors: Quinten Bolding, Baohao Liao, Brandon Denis, Jun Luo, Christof MonzPackage: Unknown - Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users
Authors: Yohan Jo, Xinyan Zhao, Arijit Biswas, Nikoletta Basiou, Vincent Auvray, Nikolaos Malandrakis, Angeliki Metallinou, Alexandros PotamianosPackage: Unknown - Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong BingPackage: Unknown - Mind the Gap: Automated Corpus Creation for Enthymeme Detection and Reconstruction in Learner Arguments
Authors: Maja Stahl, Nick Düsterhus, Mei-Hua Chen, Henning WachsmuthPackage: Unknown - The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Authors: Dung Nguyen, Le Nam, Anh Dau, Anh Nguyen, Khanh Nghiem, Jin Guo, Nghi BuiPackage: Unknown - INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
Authors: Anil Ramakrishna, Rahul Gupta, Jens Lehmann, Morteza ZiyadiPackage: Unknown - Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting
Authors: Fanghua Ye, Meng Fang, Shenghui Li, Emine YilmazPackage: Unknown - Leveraging Structured Information for Explainable Multi-hop Question Answering and Reasoning
Authors: Ruosen Li, Xinya DuPackage: Unknown - TRIP: Accelerating Document-level Multilingual Pre-training via Triangular Document-level Pre-training on Parallel Data Triplets
Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Zhaochuan Gao, Anthony Aue, Arul Menezes, Furu WeiNotes: Abstract MentionsRouge ScoresPackage: Unknown - Beyond Candidates : Adaptive Dialogue Agent Utilizing Persona and Knowledge
Authors: Jungwoo Lim, Myunghoon Kang, Jinsung Kim, Jeongwook Kim, Yuna Hur, Heuiseok LimPackage: Unknown - ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Authors: Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer LevyPackage: Unknown - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
Authors: Yinghan Long, Sayeed Chowdhury, Kaushik RoyNotes: Abstract MentionsRouge ScoresPackage: Unknown - Large Language Models Meet Harry Potter: A Dataset for Aligning Dialogue Agents with Characters
Authors: Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, Jia LiPackage: Unknown - Citance-Contextualized Summarization of Scientific Papers
Authors: Shahbaz Syed, Ahmad Hakimi, Khalid Al-Khatib, Martin PotthastPackage: Unknown - A Rewriting Approach for Gender Inclusivity in Portuguese
Authors: Leonor Veloso, Luisa Coheur, Rui RibeiroPackage: Unknown - LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation
Authors: Zhitao He, Pengfei Cao, Yubo Chen, Kang Liu, Ruopeng Li, Mengshu Sun, Jun ZhaoPackage: Unknown - CITB: A Benchmark for Continual Instruction Tuning
Authors: Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-RadPackage: Unknown - Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogues
Authors: Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-Chung Kwan, Irwin King, Kam-Fai WongPackage: Unknown - LLM aided semi-supervision for efficient Extractive Dialog Summarization
Authors: Nishant Mishra, Gaurav Sahu, Iacer Calixto, Ameen Abu-Hanna, Issam LaradjiNotes: Abstract MentionsRouge ScoresPackage: Unknown - Exploring In-Context Learning for Knowledge Grounded Dialog Generation
Authors: Qinyu Chen, Wenhao Wu, Sujian LiNotes: Abstract MentionsRouge ScoresPackage: Unknown - InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
Authors: Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie HuangPackage: Unknown - SummIt: Iterative Text Summarization via ChatGPT
Authors: Haopeng Zhang, Xiao Liu, Jiawei ZhangPackage: Unknown - HuatuoGPT, Towards Taming Language Model to Be a Doctor
Authors: Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou LiPackage: Unknown - Diffusion Language Model with Query-Document Relevance for Query-Focused Summarization
Authors: Shaoyao Huang, Luozheng Qin, Ziqiang CaoNotes: Abstract MentionsRouge ScoresPackage: Unknown - TokenDrop + BucketSampler: Towards Efficient Padding-free Fine-tuning of Language Models
Authors: Amrit Nagarajan, Anand RaghunathanPackage: Unknown - Using In-Context Learning to Improve Dialogue Safety
Authors: Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-TurPackage: Unknown - Improving Consistency for Text Summarization with Energy Functions
Authors: Qi Zeng, Qingyu Yin, Zheng Li, Yifan Gao, Sreyashi Nag, Zhengyang Wang, Bing Yin, Heng Ji, Chao ZhangPackage: Unknown - PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning
Authors: Yongil Kim, Yerin Hwang, Hyeongu Yun, Seunghyun Yoon, Trung Bui, Kyomin JungPackage: Unknown - LLMs – the Good, the Bad or the Indispensable?: A Use Case on Legal Statute Prediction and Legal Judgment Prediction on Indian Court Cases
Authors: Shaurya Vats, Atharva Zope, Somsubhra De, Anurag Sharma, Upal Bhattacharya, Shubham Nigam, Shouvik Guha, Koustav Rudra, Kripabandhu GhoshPackage: Unknown - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Authors: Viet Lai, Nghia Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien NguyenPackage: Unknown - PARROT: Zero-Shot Narrative Reading Comprehension via Parallel Reading
Authors: Chao Zhao, Anvesh Vijjini, Snigdha ChaturvediPackage: Unknown - Synthesize, if you do not have: Effective Synthetic Dataset Creation Strategies for Self-Supervised Opinion Summarization in E-commerce
Authors: Tejpalsingh Siledar, Suman Banerjee, Amey Patil, Sudhanshu Singh, Muthusamy Chelliah, Nikesh Garera, Pushpak BhattacharyyaPackage: Unknown - InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT
Authors: Yichong Xu, Ruochen Xu, Dan Iter, Yang Liu, Shuohang Wang, Chenguang Zhu, Michael ZengPackage: Unknown - DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines
Authors: Prakhar Gupta, Yang Liu, Di Jin, Behnam Hedayatnia, Spandana Gella, Sijia Liu, Patrick Lange, Julia Hirschberg, Dilek Hakkani-TurPackage: Unknown - Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models
Authors: Hongli Zhan, Desmond Ong, Junyi LiPackage: Unknown - 1-PAGER: One Pass Answer Generation and Evidence Retrieval
Authors: Palak Jain, Livio Soares, Tom KwiatkowskiPackage: Unknown - LMGQS: A Large-scale Dataset for Query-focused Summarization
Authors: Ruochen Xu, Song Wang, Yang Liu, Shuohang Wang, Yichong Xu, Dan Iter, Pengcheng He, Chenguang Zhu, Michael ZengPackage: Unknown - Extrapolating Multilingual Understanding Models as Multilingual Generators
Authors: Bohong Wu, Fei Yuan, Hai Zhao, Lei Li, Jingjing XuNotes: Abstract MentionsRouge ScoresPackage: Unknown 3rd Workshop on Multi-lingual Representation Learning - Generating Continuations in Multilingual Idiomatic Contexts
Authors: Rhitabrat Pokharel, Ameeta AgrawalPackage: Unknown 4th New Frontiers in Summarization Workshop - Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Authors: Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie ZhouPackage: Unknown - In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis
Authors: Yuting Tang, Ratish Puduppully, Zhengyuan Liu, Nancy ChenPackage: Unknown - From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Authors: Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie ElhadadPackage: Unknown - Generating Extractive and Abstractive Summaries in Parallel from Scientific Articles Incorporating Citing Statements
Authors: Sudipta Singha Roy, Robert E. MercerPackage: Unknown - Analyzing Multi-Sentence Aggregation in Abstractive Summarization via the Shapley Value
Authors: Jingyi He, Meng Cao, Jackie Chi Kit CheungPackage: Unknown Natural Legal Language Processing Workshop 2023 - Questions about Contracts: Prompt Templates for Structured Answer Generation
Authors: Adam Roegiest, Radha Chitta, Jonathan Donnelly, Maya Lash, Alexandra Vtyurina, Francois LongtinPackage: Unknown 3rd Workshop for NLP Open Source Software - The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. BuiPackage: Unknown
Incorrect
Rouge Scores — EMNLP 2023 PapersThese papers or their code releases reference
Rouge software packages with that compute incorrectRouge scores because of implementation errors. IncorrectRouge scores differ from the officialROUGE-1.5.5
reference implementation ofRouge . See packages section for more detail.
Proceedings of CoNLL 2023 - ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
Authors: Mohammad Akbari, Saeed Ranjbar Alvar, Behnam Kamranian, Amin Banitalebi-Dehkordi, Yong ZhangPackage: LA/torchmetrics
EMNLP 2023 Main Proceedings - Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
Authors: Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae YuPackage: GL/rougescore
- BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Authors: Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui YanPackage: GL/rougescore
- MemeCap: A Dataset for Captioning and Interpreting Memes
Authors: EunJeong Hwang, Vered ShwartzPackage: MS/rouge
- Fast and Accurate Factual Inconsistency Detection Over Long Documents
Authors: Barrett Lattimer, Patrick CHen, Xinyuan Zhang, Yi YangPackage: GL/rougescore
,PT/rouge
- Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
Authors: Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun PengPackage: GL/rougescore
- API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin LiPackage: PT/rouge
- Lion: Adversarial Distillation of Proprietary Large Language Models
Authors: Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei WangPackage: GL/rougescore
- Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media
Authors: Shubham Mittal, Megha Sundriyal, Preslav NakovPackage: PT/rouge
- Investigating Efficiently Extending Transformers for Long Input Summarization
Authors: Jason Phang, Yao Zhao, Peter LiuPackage: GL/rougescore
- mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images
Authors: Keighley Overbay, Jaewoo Ahn, Fatemeh Pesaran zadeh, Joonsuk Park, Gunhee KimPackage: PT/rouge
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Authors: Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young YunPackage: GL/rougescore
- Towards Interpretable Mental Health Analysis with Large Language Models
Authors: Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, Sophia AnaniadouPackage: GL/rougescore
- Modeling Empathic Similarity in Personal Narratives
Authors: Jocelyn Shen, Maarten Sap, Pedro Colon-Hernandez, Hae Park, Cynthia BreazealPackage: PT/rouge
- Enabling Large Language Models to Generate Text with Citations
Authors: Tianyu Gao, Howard Yen, Jiatong Yu, Danqi ChenPackage: GL/rougescore
- A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Authors: Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, Ivan VulićPackage: GL/rougescore
- CiteBench: A Benchmark for Scientific Citation Text Generation
Authors: Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna GurevychPackage: GL/rougescore
- Instructive Dialogue Summarization with Query Aggregations
Authors: Bin Wang, Zhengyuan Liu, Nancy ChenPackage: DI/pyrouge
,GL/rougescore
- Enhancing Biomedical Lay Summarisation with External Knowledge Graphs
Authors: Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, Chenghua LinPackage: GL/rougescore
- Background Summarization of Event Timelines
Authors: Adithya Pratapa, Kevin Small, Markus DreyerNotes: Received Paper AwardPackage: GL/rougescore
- trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback
Authors: Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, Louis CastricatoPackage: GL/rougescore
- Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
Authors: Andrea Sottana, Bin Liang, Kai Zou, Zheng YuanPackage: GL/rougescore
- Detecting and Mitigating Hallucinations in Multilingual Summarisation
Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo Ponti, Shay CohenPackage: GL/rougescore
,PT/rouge
- Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
Authors: Manuel Faysse, Gautier Viaud, Céline Hudelot, Pierre ColomboPackage: GL/rougescore
- Instruct and Extract: Instruction Tuning for On-Demand Information Extraction
Authors: Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Siru Ouyang, Heng Ji, Jiawei HanPackage: GL/rougescore
- ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
Authors: Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit BansalPackage: GL/rougescore
- Contrastive Learning for Inference in Dialogue
Authors: Etsuko Ishii, Yan Xu, Bryan Wilie, Ziwei Ji, Holy Lovenia, Willy Chung, Pascale FungPackage: MS/rouge
- Paraphrase Types for Generation and Detection
Authors: Jan Wahle, Bela Gipp, Terry RuasPackage: PT/rouge
- Hallucination Mitigation in Natural Language Generation from Large-Scale Open-Domain Knowledge Graphs
Authors: Xiao Shi, Zhengyuan Zhu, Zeyu Zhang, Chengkai LiPackage: GL/rougescore
- Multilingual Large Language Models Are Not (Yet) Code-Switchers
Authors: Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, Alham AjiPackage: GL/rougescore
- KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection
Authors: Sehyun Choi, Tianqing Fang, Zhaowei Wang, Yangqiu SongPackage: GL/rougescore
- CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code
Authors: Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, Wenhai WangPackage: MS/rouge
- Length Does Matter: Summary Length can Bias Summarization Metrics
Authors: Xiaobo Guo, Soroush VosoughiPackage: BZ/pyrouge
- Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation
Authors: Jiayu Lin, Rong Ye, Meng Han, Qi Zhang, Ruofei Lai, Xinyu Zhang, Zhao Cao, Xuanjing Huang, Zhongyu WeiPackage: PT/rouge
Findings of the ACL: EMNLP 2023 - DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
Authors: Forrest Bao, Ruixuan Tu, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, Cen ChenPackage: GL/rougescore
- Execution-Based Evaluation for Open-Domain Code Generation
Authors: Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham NeubigPackage: GL/rougescore
- Improving the Robustness of Summarization Models by Detecting and Removing Input Noise
Authors: Kundan Krishna, Yao Zhao, Jie Ren, Balaji Lakshminarayanan, Jiaming Luo, Mohammad Saleh, Peter LiuNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- Extractive Summarization via ChatGPT for Faithful Summary Generation
Authors: Haopeng Zhang, Xiao Liu, Jiawei ZhangNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- InstructExcel: A Benchmark for Natural Language Instruction in Excel
Authors: Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz NouriPackage: GL/rougescore
- Multi-step Jailbreaking Privacy Attacks on ChatGPT
Authors: Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, Yangqiu SongPackage: GL/rougescore
- FREDSum: A Dialogue Summarization Corpus for French Political Debates
Authors: Virgile Rennard, Guokan Shang, Damien Grari, Julie Hunter, Michalis VazirgiannisPackage: GL/rougescore
- Frugal Prompting for Dialog Models
Authors: Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan GoyalPackage: GL/rougescore
- The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation
Authors: Mutian He, Philip GarnerPackage: GL/rougescore
- Is ChatGPT a Good Multi-Party Conversation Solver?
Authors: Chao-Hong Tan, Jia-Chen Gu, Zhen-Hua LingPackage: GL/rougescore
- Bipartite Graph Pre-training for Unsupervised Extractive Summarization with Graph Convolutional Auto-Encoders
Authors: Qianren Mao, Shaobo Zhao, Jiarui Li, Xiaolei Gu, Shizhu He, Bo Li, Jianxin LiPackage: BZ/pyrouge
- Adapting Pretrained Text-to-Text Models for Long Text Sequences
Authors: Wenhan Xiong, Anchit Gupta, Shubham Toshniwal, Yashar Mehdad, Scott YihPackage: PT/files2rouge
- Large-Scale and Multi-Perspective Opinion Summarization with Diverse Review Subsets
Authors: Han Jiang, Rui Wang, Zhihua Wei, Yu Li, Xinpeng WangPackage: PT/files2rouge
- Topic-Informed Dialogue Summarization using Topic Distribution and Prompt-based Modeling
Authors: Jaeah You, Youngjoong KoNotes: Abstract MentionsRouge ScoresPackage: DI/pyrouge
- A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization
Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong BingNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
Authors: Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young YunPackage: GL/rougescore
- Inverse Reinforcement Learning for Text Summarization
Authors: Yu Fu, Deyi Xiong, Yue DongNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- From Chaos to Clarity: Claim Normalization to Empower Fact-Checking
Authors: Megha Sundriyal, Tanmoy Chakraborty, Preslav NakovPackage: DI/pyrouge
- Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries
Authors: Prafulla Choubey, Alexander Fabbri, Caiming Xiong, Chien-Sheng WuNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Wang, Arman CohanPackage: GL/rougescore
- USB: A Unified Summarization Benchmark Across Tasks and Domains
Authors: Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron Wallace, Jeffrey Bigham, Zachary LiptonPackage: GL/rougescore
- Domain Adaptation for Conversational Query Production with the RAG Model Feedback
Authors: Ante Wang, Linfeng Song, Ge Xu, Jinsong SuPackage: GL/rougescore
- DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
Authors: Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng KongPackage: LA/torchmetrics
- PivotFEC: Enhancing Few-shot Factual Error Correction with a Pivot Task Approach using Large Language Models
Authors: Xingwei He, A-Long Jin, Jun Ma, Yuan Yuan, Siu YiuPackage: GL/rougescore
- Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization
Authors: Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy HuangPackage: GL/rougescore
- Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration
Authors: Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, Tat-Seng ChuaPackage: BZ/pyrouge
,MS/rouge
- Natural Response Generation for Chinese Reading Comprehension
Authors: Nuo Chen, Hongguang Li, Yinan Bao, Baoyuan Wang, Jia LiPackage: Custom reimplementation ofRouge - Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation
Authors: Jinglong Gao, Xiao Ding, Bing Qin, Ting LiuPackage: PT/rouge
- Enhancing Accessible Communication: from European Portuguese to Portuguese Sign Language
Authors: Catarina Sousa, Luisa Coheur, Mara MoitaPackage: GL/rougescore
- HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue
Authors: Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Yoon, Junyeong Kim, Chang YooPackage: MS/rouge
- Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs
Authors: Young-Suk Lee, Md Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón AstudilloPackage: GL/rougescore
- Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans
Authors: Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido DaganNotes: Abstract MentionsRouge ScoresPackage: GL/rougescore
- COMET-M: Reasoning about Multiple Events in Complex Sentences
Authors: Sahithya Ravi, Raymond Ng, Vered ShwartzPackage: GL/rougescore
,MS/rouge
- Cross-modality Data Augmentation for End-to-End Sign Language Translation
Authors: Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Hui XiongPackage: MS/rouge
- InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators
Authors: Heng Yang, Ke LiPackage: PT/rouge
- Re-Examining Summarization Evaluation across Multiple Quality Criteria
Authors: Ori Ernst, Ori Shapira, Ido Dagan, Ran LevyPackage: BZ/pyrouge
- NarrativeXL: a Large-scale Dataset for Long-Term Memory Models
Authors: Arsenii Moskvichev, Ky-Vinh MaiPackage: GL/rougescore
- PIVOINE: Instruction Tuning for Open-world Entity Profiling
Authors: Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu ChenPackage: ND/easyrouge
,PT/rouge
- Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
Authors: Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan, Jiaxing Zhang, Baoyuan Wang, Jia LiPackage: Custom reimplementation ofRouge - Mitigating Intrinsic Named Entity-Related Hallucinations of Abstractive Text Summarization
Authors: Jianbin Shen, Junyu Xuan, Christy LiangPackage: GL/rougescore
3rd Workshop on Multi-lingual Representation Learning - Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023
Authors: Francesco Tinner, David Ifeoluwa Adelani, Chris Emezue, Mammad Hajili, Omer Goldman, Muhammad Farid Adilazuarda, Muhammad Dehan Al Kautsar, Aziza Mirsaidova, Müge Kural, Dylan Massey, Chiamaka Chukwuneke, Chinedu Mbonu, Damilola Oluwaseun Oloyede, Kayode Olaleye, Jonathan Atala, Benjamin A. Ajibade, Saksham Bassi, Rahul Aralikatte, Najoung Kim, Duygu AtamanPackage: GL/rougescore
4th New Frontiers in Summarization Workshop - Extract, Select and Rewrite: A Modular Sentence Summarization Method
Authors: Shuo Guan, Vishakh PadmakumarPackage: PT/rouge
- Improving Multi-Stage Long Document Summarization with Enhanced Coarse Summarizer
Authors: Jinhyeong Lim, Hyun-Je SongPackage: BZ/pyrouge
3rd Workshop for NLP Open Source Software - nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources
Authors: Piotr NawrotPackage: GL/rougescore
Incorrect
Rouge Packages — Cited at EMNLP 2023These packages have implementation or configuration errors that result in incorrect
Rouge scores. These errors were first identified in the ACL 2023 Rogue Scores paper by comparing their output scores toROUGE-1.5.5
under various evaluation conditions.
- Package With Errors:
GL/rougescore
Incorrect implementation of Porter stemming. Incorrect default implementation ofRouge -L. Bootstrapping introduces random noise into scores (minor issue). Distributed by both Google Research (GL/rougescore
) and Hugging Face (HF/evaluate
). - Package With Errors:
PT/rouge
Implementation errors in bothRouge -N andRouge -L algorithms. Not capable of performing stemming or bootstrapping. - Package With Errors:
PT/files2rouge
Incorrectly tokenizes sentences using the period character (“.”), ignoring existing tokenization. Bootstrapping introduces random noise into scores (minor issue). - Package With Errors:
DI/pyrouge
Unclear implementation errors cause incorrectRouge scores for approximately 4% of model outputs during testing. Not capable of performing bootstrapping. - Package With Errors:
MS/rouge
Accidentally computes recall-biasedRouge F-scores using $ \beta=1.2 $. (Rouge F-scores are almost universally computed with $ \beta=1.0 $.) Performs incorrect sentence tokenization. Not capable of performing stemming or bootstrapping. - Package With Errors:
BZ/pyrouge
Contains single line of code that silently enables stemming, even when user attempts to disable stemming. Bootstrapping introduces random noise into scores (minor issue). Distributed and reused by several other packages, includingYL/summeval
. - Package With Errors:
ND/easyrouge
Omits many major components ofRouge scores: “Preprocessing like stopword removal, stemming and tokenization is left to the client.” - Package With Errors:
LA/torchmetrics
This custom reimplementation ofRouge has not been evaluated for correctness. It appears to be based on the incorrectGL/rougescore
implementation, including replicating the incorrect defaultRouge -L behavior. - Custom Reimplementations
Some papers link to code which contain custom ad hoc reimplementations or wrappers ofRouge not evaluated in Rogue Scores. Custom implementations correctness is determined by static analysis during review of code release.
Timeline — Paper and Code Review
- December 7, 2023 — Public release of the complete EMNLP 2023 proceedings.
- December 8, 2023 — Review of EMNLP 2023 Main Proceedings papers and code releases.
- December 10, 2023 — Review of EMNLP 2023 Findings papers and code releases.
- December 12, 2023 — Review of papers and code releases for all remaining EMNLP 2023 collocated events including demonstrations, tutorials, industry track, and workshops.
Methods — Paper and Code Review
- The review includes all EMNLP 2023 papers that compute
Rouge scores. Papers and citation information are downloaded from the ACL Anthology.- A preliminary identification of
Rouge papers is conducted automatically by searching “rouge
” across all full-text paper PDFs and excluding papers that do not match.- Matching papers are reviewed manually to identify if they compute
Rouge scores. This includesRouge scores computed but not reported, such as during model training. Papers not computingRouge scores are excluded from the review. Remaining papers are included in the review.- Remaining papers are first searched for in-text paper citations of
Rouge packages. Papers with in-text package citations are labeled accordingly and the review of the paper concludes.- Papers without in-text paper
Rouge citations are searched for in-text code release links. Papers without code links are labeled as “unknown package” and the review of the paper concludes.- Paper code releases are searched for references to
Rouge , includingREADME
documents, repository issues and pull requests, standard code files, shell scripts, and package management files such asrequirements.txt
orenvironment.yml
.- Papers with code referencing a
Rouge package are labeled accordingly. Papers whose code does not reference aRouge packages are labeled as “unknown package.” Review concludes.
Challenges and Limitations — Paper and Code Review
- Parameter Differences. This review only examines use of
Rouge software packages. It does not examine parameter differences, which can also lead to substantial differences in scores.- Automated Search. A preliminary case-insensitive search for “
rouge
” is conducted for all papers. Only matching papers receive a full manual paper and code review. Papers which evaluate withRouge without explicitly naming the evaluation metric, and papers which referenceRouge only inside non-searchable images may be excluded from the review.- Human Annotation. Manual paper review is used to identify
Rouge computation and packages. Despite best efforts, all human annotation has the potential to introduce labeling errors.- Code Availability. Because in-text package citations are rare, most identifications of
Rouge packages are made through code releases. At time of review, many papers link to code releases without code. It is possible that many papers currently labeled “unknown” will eventually link to code that contains an identifiableRouge package.- Non-Evaluation Metrics. Some papers use
Rouge for reasons other than evaluation, such as feature generation or for internal training validation. This review does not make any distinction between evaluation and non-evaluationRouge .- Assumed Correctness. The review assumes all papers that use
ROUGE-1.5.5
directly (rather than using a wrapper or reimplementation) report correctRouge scores. However, many of these papers may runROUGE-1.5.5
via custom ad hoc wrapper code that (like many wrapper packages) is also implemented incorrectly and introduces scoring errors.- Package Inference. Many code releases are missing explicit dependency specification, making identifying exact
Rouge packages challenging. In these cases, function signatures are used to identify the most likelyRouge package.- Multiple Packages. When a code release contains multiple
Rouge packages, an attempt is made to identify which packages are used to computeRouge scores reported in the paper. If this is unclear, allRouge packages appearing in the code release are listed.- External Materials. Only main paper text, appendices, and code linked in papers are reviewed. External materials such as websites, slides, videos, or code with no link appearing in papers are not examined as part of this review.