ACL 2023 Internal Discussion Comment — Shared July 7, 2023
Over the past two days, I have spent a significant amount of time reviewing and annotating every paper in the ACL 2023 main proceedings, which became available several days ago. I would like to share the results, which I think are worrying.
At least 8% of ACL 2023 main proceedings papers report incorrect
Rouge model evaluation scores. I discovered this by searching for in-text citations and in-code usage of evaluation software with known scoring errors that affectRouge scores. I have previously tested many frequently citedRouge packages in my Rogue Scores paper, and nearly all of them produce incorrect scores. However, as we all know, many papers do not cite software packages and do not release code. Therefore, 8% is the lower bound for incorrect evaluation at ACL 2023. The upper bound is the number of papers that useRouge , which isapproximately 17%16% 1 of ACL 2023 papers. Because nearly all papers withRouge software citations refer to incorrect packages, evaluation errors may exist in closer to17%16% 1 of the proceedings.My review was only able to identify 88 specific ACL 2023 papers with incorrect
Rouge package citations (and includes paper sections or codebase line numbers in which packages are referenced). This is only a subset of the total papers that may be affected by this issue. I think it would be worthwhile for authors to review their evaluation protocol to make sure it is aligned with their specific NLP tasks and research objectives, and to make corrections if necessary.
Model Evaluation Errors at ACL 2023
ACL is the flagship conference of the Association for Computational Linguistics and one of the major publication venues for AI/ML/NLP research. However, this review of papers and their code finds that between 8% and 16% of all ACL 2023 main proceedings publications evaluate models using incorrect
Researchers use
Unfortunately, this is not the case:
Out of ACL 2023
These results suggest the majority of ACL 2023 papers compute incorrect
Are award-winning papers also affected?
In total, ACL 2023 issued 3 best paper awards and 38 outstanding paper awards. Affected papers include 1 best paper (incorrect package) and 9 outstanding papers (3 incorrect packages, 6 unknown packages). This is disappointing because awardee papers received an additional round of review by a nominating committee that used methodological soundness as a criterion for award selection. 3
Incorrect Evaluation Software at ACL 2023
Table 1. List of ROUGE-1.5.5
implementation of
Package | Papers | Software Errors |
---|---|---|
GL/rougescore | 44 | MAJOR ERRORS Performs incorrect Porter stemming. The default implementation of |
Note: This package is cited under many different names, including the original Google package GL/rougescore and several Python packages from HuggingFace including HF/datasets and HF/evaluate . | ||
| 17 | SEVERE ERRORS Accidentally computes recall-biased |
Note: This package was originally developed at Microsoft to evaluate MSCOCO caption generation and later MSMARCO. This package is cited under many different names, including MS/rouge , MS/nlgeval , and SA/pycocoevalcap . | ||
| 17 | SEVERE ERRORS Contains incorrect implementation of both GL/seq2seq , and inherits many of its issues. Computes scores without bootstrapping. It is impossible to compute correct |
| 7 | MAJOR ERRORS Unintentionally enables Porter stemming even when users try to disable it. Bootstrapping introduces small deterministic random noise. In some configurations, it is possible to compute correct |
| 7 | MINOR ERRORS Unknown implementation errors cause incorrect |
| 6 | MINOR ERRORS Ignores user-provided sentence tokenization and re-tokenizes sentences by the period character (“.”), which falsely assumes all sentences end in periods and that all periods denote the end of sentences. Bootstrapping introduces small deterministic random noise. It is impossible to compute correct |
| 2 | SEVERE ERRORS Contains incorrect implementation of both |
Others | 5 | VARIOUS ERRORS Less common packages that were not tested in the Rogue Scores paper (YL/summeval , LI/torchmetrics , PT/ignite ), and other custom ROUGE-1.5.5 . All packages are Python reimplementations of |
Note: Packages that are used exclusively in the ACL 2023 paper Rogue Scores are excluded from this table. For example, no ACL 2023 papers except Rogue Scores appear cite the package DD/sacrerouge , so this package has been excluded from this table. | ||
ROUGE-1.5.5 | 3 | NO ERRORS This is the original reference implementation of |
Note: Although the ROUGE-1.5.5 package is correct, it is possible that papers citing this package actually use an incorrect wrapper package instead of directly using the original Perl implementation. Additionally, some authors may implement a custom ad hoc wrapper script for the ROUGE-1.5.5 package that is implemented incorrectly and results in scoring errors. | ||
Multilingual | 3 | NO ERRORS These packages are used for non-English or multilingual evaluation. There are no standardized methods for computing these scores. For this reason, these packages are labeled as correct despite differing from ROUGE-1.5.5 . |
Unknown | 76 | UNKNOWN ERRORS There are 76 papers published in ACL 2023 that do not cite any |
For ACL 2023 Authors: Papers on this list cite
Rouge software packages with errors. When tested, these packages compute scores that deviate from theROUGE-1.5.5
reference implementation. More information on this testing is included in the ACL 2023 paper Rogue Scores.This does not necessarily mean these papers using these packages should be corrected. For example, the
MS/rouge
package computes incorrectRouge scores. However, it is also the most commonRouge package in caption generation and many NLG tasks. Replacing this common incorrect package with an uncommon correct package would cause model scores to be incomparable with prior work.It is unfortunate that authors must make this difficult research decision between comparability and correctness. However, even when researchers use incorrect packages for comparability reasons, the resulting
Rouge scores are still incorrect. Therefore, these papers are included in this table.These difficult research decisions today are the result of nearly two decades of systematic failure by the research community to enforce model evaluation reporting standards, and systematic failure to enforce the validity the scientific record through paper retraction and correction.
There are slightly different numbers of the total accepted ACL 2023 papers, which change this percentage. This article will use the most recent 1,074 figure found in the conference handbook (version 3):
To sum long and short papers, ACL 2023 accepted 1074 (22.08%) papers for the conference.
Many papers also omit critical
Rouge configuration parameters. Omission of these details harms the reproducibility and comparability of papers, even when they use a correct package. Although these configuration discrepancies are equally troubling, this review focuses only on software errors.Are non-award papers more likely to have
Rouge citation issues than award-winning papers? There are 41 total best/outstanding papers with 10 affected: 4 with incorrect packages and 6 with unknown packages. There are 1,074 total papers with 164 affected: 88 with incorrect packages and 76 with unknown packages. This means there are 1,033 non-award papers with 154 affected. While a larger proportion of award-winning papers are affected than non-award papers, a Fisher’s exact test (p = 0.1178) indicates the sample size is too small to reasonably conclude that award-winning papers are more likely to be affected byRouge software issues.