## ACL 2023 Internal Discussion Comment — Shared July 7, 2023

Over the past two days, I have spent a significant amount of time reviewing and annotating every paper in the ACL 2023 main proceedings, which became available several days ago. I would like to share the results, which I think are worrying.

At least 8% of ACL 2023 main proceedings papers report incorrect

Rouge model evaluation scores. I discovered this by searching for in-text citations and in-code usage of evaluation software with known scoring errors that affectRouge scores. I have previously tested many frequently citedRouge packages in myRogue Scorespaper, and nearly all of them produce incorrect scores. However, as we all know, many papers do not cite software packages and do not release code. Therefore, 8% is the lower bound for incorrect evaluation at ACL 2023. The upper bound is the number of papers that useRouge , which is~~approximately 17%~~16%^{1}of ACL 2023 papers. Because nearly all papers withRouge software citations refer to incorrect packages, evaluation errors may exist in closer to~~17%~~16%^{1}of the proceedings.My review was only able to identify 88 specific ACL 2023 papers with incorrect

Rouge package citations (and includes paper sections or codebase line numbers in which packages are referenced). This is only a subset of the total papers that may be affected by this issue. I think it would be worthwhile for authors to review their evaluation protocol to make sure it is aligned with their specific NLP tasks and research objectives, and to make corrections if necessary.

## Model Evaluation Errors at ACL 2023

ACL is the flagship conference of the Association for Computational Linguistics and one of the major publication venues for AI/ML/NLP research. However, this review of papers and their code finds that between 8% and 16% of all ACL 2023 main proceedings publications evaluate models using incorrect

*(n = 170)*.

Researchers use

Unfortunately, this is not the case:

Out of ACL 2023 *(n = 170)*, only 4% cite correct *(n = 6)*. On the other hand, 52% of papers cite *(n = 88)*, which compute incorrect scores based on testing in *Rogue Scores*. Additionally, many papers evaluate with unknown software: 45% do not cite any *(n = 76)*.

These results suggest the majority of ACL 2023 papers compute incorrect ^{2}

## Are award-winning papers also affected?

In total, ACL 2023 issued 3 best paper awards and 38 outstanding paper awards. Affected papers include

1 best paper(incorrect package) and9 outstanding papers(3 incorrect packages, 6 unknown packages). This is disappointing because awardee papers received an additional round of review by a nominating committee that used methodological soundness as a criterion for award selection.^{3}

## Incorrect Evaluation Software at ACL 2023

**Table 1**. List of `ROUGE-1.5.5`

implementation of

Package | Papers | Software Errors |
---|---|---|

`GL/rougescore` | 44 | MAJOR ERRORS Performs incorrect Porter stemming. The default implementation of |

Note: This package is cited under many different names, including the original Google package `GL/rougescore` and several Python packages from HuggingFace including `HF/datasets` and `HF/evaluate` . | ||

| 17 | SEVERE ERRORS Accidentally computes recall-biased |

Note: This package was originally developed at Microsoft to evaluate MSCOCO caption generation and later MSMARCO. This package is cited under many different names, including `MS/rouge` , `MS/nlgeval` , and `SA/pycocoevalcap` . | ||

| 17 | SEVERE ERRORS Contains incorrect implementation of both `GL/seq2seq` , and inherits many of its issues. Computes scores without bootstrapping. It is impossible to compute correct |

| 7 | MAJOR ERRORS Unintentionally enables Porter stemming even when users try to disable it. Bootstrapping introduces small deterministic random noise. In some configurations, it is possible to compute correct |

| 7 | MINOR ERRORS Unknown implementation errors cause incorrect Rogue Scores paper. May also perform incorrect Porter stemming. Computes scores without bootstrapping. It is impossible to compute correct |

| 6 | MINOR ERRORS Ignores user-provided sentence tokenization and re-tokenizes sentences by the period character (“.”), which falsely assumes all sentences end in periods and that all periods denote the end of sentences. Bootstrapping introduces small deterministic random noise. It is impossible to compute correct |

| 2 | SEVERE ERRORS Contains incorrect implementation of both |

`Others` | 5 | VARIOUS ERRORS Less common packages that were not tested in the Rogue Scores paper (`YL/summeval` , `LI/torchmetrics` , `PT/ignite` ), and other custom `ROUGE-1.5.5` . All packages are Python reimplementations of |

Note: Packages that are used exclusively in the ACL 2023 paper Rogue Scores are excluded from this table. For example, no ACL 2023 papers except Rogue Scores appear cite the package `DD/sacrerouge` , so this package has been excluded from this table. | ||

`ROUGE-1.5.5` | 3 | NO ERRORS This is the original reference implementation of |

Note: Although the `ROUGE-1.5.5` package is correct, it is possible that papers citing this package actually use an incorrect wrapper package instead of directly using the original Perl implementation. Additionally, some authors may implement a custom ad hoc wrapper script for the `ROUGE-1.5.5` package that is implemented incorrectly and results in scoring errors. | ||

`Multilingual` | 3 | NO ERRORS These packages are used for non-English or multilingual evaluation. There are no standardized methods for computing these scores. For this reason, these packages are labeled as correct despite differing from `ROUGE-1.5.5` . |

`Unknown` | 76 | UNKNOWN ERRORS There are 76 papers published in ACL 2023 that do not cite any |

For ACL 2023 Authors:Papers on this list citeRouge software packages with errors. When tested, these packages compute scores that deviate from the`ROUGE-1.5.5`

reference implementation. More information on this testing is included in the ACL 2023 paperRogue Scores.This does not necessarily mean these papers using these packages should be corrected. For example, the

`MS/rouge`

package computes incorrectRouge scores. However, it is also the most commonRouge package in caption generation and many NLG tasks. Replacing this common incorrect package with an uncommon correct package would cause model scores to be incomparable with prior work.It is unfortunate that authors must make this difficult research decision between comparability and correctness. However, even when researchers use incorrect packages for comparability reasons, the resulting

Rouge scores are still incorrect. Therefore, these papers are included in this table.These difficult research decisions today are the result of nearly two decades of systematic failure by the research community to enforce model evaluation reporting standards, and systematic failure to enforce the validity the scientific record through paper retraction and correction.

There are slightly different numbers of the total accepted ACL 2023 papers, which change this percentage. This article will use the most recent 1,074 figure found in the conference handbook (version 3):

To sum long and short papers, ACL 2023 accepted 1074 (22.08%) papers for the conference.

Many papers also omit critical

Rouge configuration parameters. Omission of these details harms the reproducibility and comparability of papers, even when they use a correct package. Although these configuration discrepancies are equally troubling, this review focuses only on software errors.Are non-award papers more likely to have

Rouge citation issues than award-winning papers? There are 41 total best/outstanding papers with 10 affected: 4 with incorrect packages and 6 with unknown packages. There are 1,074 total papers with 164 affected: 88 with incorrect packages and 76 with unknown packages. This means there are 1,033 non-award papers with 154 affected. While a larger proportion of award-winning papers are affected than non-award papers, a Fisher’s exact test*(p = 0.1178)*indicates the sample size is too small to reasonably conclude that award-winning papers are more likely to be affected byRouge software issues.