Evaluation Software Errors in the ACL 2023 Proceedings

ACL 2023 Internal Discussion Comment — Shared July 7, 2023
Over the past two days, I have spent a significant amount of time reviewing and annotating every paper in the ACL 2023 main proceedings, which became available several days ago. I would like to share the results, which I think are worrying.
At least 8% of ACL 2023 main proceedings papers report incorrect Rouge model evaluation scores. I discovered this by searching for in-text citations and in-code usage of evaluation software with known scoring errors that affect Rouge scores. I have previously tested many frequently cited Rouge packages in my Rogue Scores paper, and nearly all of them produce incorrect scores. However, as we all know, many papers do not cite software packages and do not release code. Therefore, 8% is the lower bound for incorrect evaluation at ACL 2023. The upper bound is the number of papers that use Rouge, which is ~~approximately 17%~~ 16% ¹ of ACL 2023 papers. Because nearly all papers with Rouge software citations refer to incorrect packages, evaluation errors may exist in closer to ~~17%~~ 16% ¹ of the proceedings.
My review was only able to identify 88 specific ACL 2023 papers with incorrect Rouge package citations (and includes paper sections or codebase line numbers in which packages are referenced). This is only a subset of the total papers that may be affected by this issue. I think it would be worthwhile for authors to review their evaluation protocol to make sure it is aligned with their specific NLP tasks and research objectives, and to make corrections if necessary.

Model Evaluation Errors at ACL 2023

ACL is the flagship conference of the Association for Computational Linguistics and one of the major publication venues for AI/ML/NLP research. However, this review of papers and their code finds that between 8% and 16% of all ACL 2023 main proceedings publications evaluate models using incorrect Rouge software packages with scoring errors. These errors may harm the reproducibility, comparability, and correctness of a large number of results.

Rouge is frequently used at ACL 2023:

Many ACL 2023 papers evaluate models using the Rouge metric.

16%

84%

Rouge Papers

Papers Without Rouge

Rouge is a benchmark model evaluation metric. It computes the similarity between texts generated by a model and human-written reference texts. Over the past 20 years, Rouge has become one of the most common evaluation metrics for generative language models. This is reflected in ACL 2023 proceedings, where 16% of papers use Rouge (n = 170).

Researchers use Rouge scores to measure model performance and compare models developed by different research teams. Because is still uncommon for researchers to publicly release model code and parameters, it is especially important that the Rouge scores reported in papers are reproducible, comparable, and correct.

Unfortunately, this is not the case:

Almost all ACL 2023 Rouge papers evaluate using incorrect or unknown packages.

52%

45%

Incorrect

Correct

Unknown

Out of ACL 2023 Rouge papers (n = 170), only 4% cite correct Rouge software (n = 6). On the other hand, 52% of papers cite Rouge software packages with errors (n = 88), which compute incorrect scores based on testing in Rogue Scores. Additionally, many papers evaluate with unknown software: 45% do not cite any Rouge package (n = 76).

These results suggest the majority of ACL 2023 papers compute incorrect Rouge scores. Even though many papers do not cite specific Rouge evaluation software, most papers with Rouge citations use Rouge packages with errors. Unfortunately, because Rouge package citations are so frequently omitted, it is impossible to determine the degree of incorrectness of Rouge scores reported in a large number of ACL 2023 papers. ²

Are award-winning papers also affected?
In total, ACL 2023 issued 3 best paper awards and 38 outstanding paper awards. Affected papers include 1 best paper (incorrect package) and 9 outstanding papers (3 incorrect packages, 6 unknown packages). This is disappointing because awardee papers received an additional round of review by a nominating committee that used methodological soundness as a criterion for award selection. ³

Incorrect Evaluation Software at ACL 2023

Table 1. List of Rouge evaluation packages found in ACL 2023 main proceedings papers. In total, 88 papers cite packages with errors and 76 papers contain no package citations. Note that, because some papers use multiple Rouge packages, the total number of packages citations in this table is larger than 88. Only 3 papers appear to use the original ROUGE-1.5.5 implementation of Rouge and only 3 papers appear to use an acceptable alternative Rouge package for multilingual evaluation.

Package	Papers	Software Errors
`GL/rougescore`	44	MAJOR ERRORS Performs incorrect Porter stemming. The default implementation of Rouge-L performs incorrect sentence tokenization. This package performs nondeterministic bootstrap resampling that introduces small random noise into Rouge scores. Each run produces slightly different scores. In some configurations, it is possible to compute correct Rouge scores using this package, except for the random noise introduced during bootstrapping.
		Note: This package is cited under many different names, including the original Google package `GL/rougescore` and several Python packages from HuggingFace including `HF/datasets` and `HF/evaluate`.
`MS/rouge`	17	SEVERE ERRORS Accidentally computes recall-biased Rouge F-scores with $ \beta=1.2 $. Rouge F-scores are normally computed using $ \beta=1.0 $. There is no evidence suggesting any paper has ever used $ \beta=1.2 $ until papers started using this package. Fails to perform sentence tokenization when computing Rouge-L scores. Computes scores without bootstrapping. It is impossible to compute correct Rouge scores using this package.
		Note: This package was originally developed at Microsoft to evaluate MSCOCO caption generation and later MSMARCO. This package is cited under many different names, including `MS/rouge`, `MS/nlgeval`, and `SA/pycocoevalcap`.
`PT/rouge`	17	SEVERE ERRORS Contains incorrect implementation of both Rouge-N and Rouge-L algorithms. Not capable of performing Porter stemming. This package is an modified version of another incorrect package, `GL/seq2seq`, and inherits many of its issues. Computes scores without bootstrapping. It is impossible to compute correct Rouge scores using this package.
`BZ/pyrouge`	7	MAJOR ERRORS Unintentionally enables Porter stemming even when users try to disable it. Bootstrapping introduces small deterministic random noise. In some configurations, it is possible to compute correct Rouge scores using this package, except for the random noise introduced during bootstrapping.
`DI/pyrouge`	7	MINOR ERRORS Unknown implementation errors cause incorrect Rouge scores for 4% of model outputs, based on the correctness evaluation conducted in the Rogue Scores paper. May also perform incorrect Porter stemming. Computes scores without bootstrapping. It is impossible to compute correct Rouge scores using this package.
`PT/files2rouge`	6	MINOR ERRORS Ignores user-provided sentence tokenization and re-tokenizes sentences by the period character (“.”), which falsely assumes all sentences end in periods and that all periods denote the end of sentences. Bootstrapping introduces small deterministic random noise. It is impossible to compute correct Rouge scores using this package.
`GL/seq2seq`	2	SEVERE ERRORS Contains incorrect implementation of both Rouge-N and Rouge-L algorithms. Not capable of performing Porter stemming. It appears this package was originally intended for quick non-publishable evaluation during model training only, not intended to be used to compute official scores reported in papers. Computes scores without bootstrapping. It is impossible to compute correct Rouge scores using this package.
`Others`	5	VARIOUS ERRORS Less common packages that were not tested in the Rogue Scores paper (`YL/summeval`, `LI/torchmetrics`, `PT/ignite`), and other custom Rouge implementations used by single papers. Untested and custom packages may have various errors, but they have not been tested against `ROUGE-1.5.5`. All packages are Python reimplementations of Rouge, and it unlikely any of them can be used to compute correct Rouge scores under any configuration.
		Note: Packages that are used exclusively in the ACL 2023 paper Rogue Scores are excluded from this table. For example, no ACL 2023 papers except Rogue Scores appear cite the package `DD/sacrerouge`, so this package has been excluded from this table.


`ROUGE-1.5.5`	3	NO ERRORS This is the original reference implementation of Rouge developed in 2004, written in Perl. The scores computed by this package define the Rouge metric and they are correct by definition.
		Note: Although the `ROUGE-1.5.5` package is correct, it is possible that papers citing this package actually use an incorrect wrapper package instead of directly using the original Perl implementation. Additionally, some authors may implement a custom ad hoc wrapper script for the `ROUGE-1.5.5` package that is implemented incorrectly and results in scoring errors.
`Multilingual`	3	NO ERRORS These packages are used for non-English or multilingual evaluation. There are no standardized methods for computing these scores. For this reason, these packages are labeled as correct despite differing from `ROUGE-1.5.5`.


`Unknown`	76	UNKNOWN ERRORS There are 76 papers published in ACL 2023 that do not cite any Rouge package, either in the paper text or in a code release. Because most papers that cite Rouge packages use one with errors, it is likely most of these papers also use incorrect packages. Unfortunately, all incorrect packages produce different scores. But because these papers do not cite a software package, it is unclear how their scores have been computed and how incorrect they are.

For ACL 2023 Authors: Papers on this list cite Rouge software packages with errors. When tested, these packages compute scores that deviate from the ROUGE-1.5.5 reference implementation. More information on this testing is included in the ACL 2023 paper Rogue Scores.
This does not necessarily mean these papers using these packages should be corrected. For example, the MS/rouge package computes incorrect Rouge scores. However, it is also the most common Rouge package in caption generation and many NLG tasks. Replacing this common incorrect package with an uncommon correct package would cause model scores to be incomparable with prior work.
It is unfortunate that authors must make this difficult research decision between comparability and correctness. However, even when researchers use incorrect packages for comparability reasons, the resulting Rouge scores are still incorrect. Therefore, these papers are included in this table.
These difficult research decisions today are the result of nearly two decades of systematic failure by the research community to enforce model evaluation reporting standards, and systematic failure to enforce the validity the scientific record through paper retraction and correction.

There are slightly different numbers of the total accepted ACL 2023 papers, which change this percentage. This article will use the most recent 1,074 figure found in the conference handbook (version 3):
To sum long and short papers, ACL 2023 accepted 1074 (22.08%) papers for the conference.
Many papers also omit critical Rouge configuration parameters. Omission of these details harms the reproducibility and comparability of papers, even when they use a correct package. Although these configuration discrepancies are equally troubling, this review focuses only on software errors.
Are non-award papers more likely to have Rouge citation issues than award-winning papers? There are 41 total best/outstanding papers with 10 affected: 4 with incorrect packages and 6 with unknown packages. There are 1,074 total papers with 164 affected: 88 with incorrect packages and 76 with unknown packages. This means there are 1,033 non-award papers with 154 affected. While a larger proportion of award-winning papers are affected than non-award papers, a Fisher’s exact test (p = 0.1178) indicates the sample size is too small to reasonably conclude that award-winning papers are more likely to be affected by Rouge software issues.

Analyses.org

ACL 2023 Internal Discussion Comment — Shared July 7, 2023

Model Evaluation Errors at ACL 2023

Are award-winning papers also affected?

Incorrect Evaluation Software at ACL 2023