Analyses.org

Analyses

Rogue Scores: Evaluation Reproducibility Project

For two decades, thousands of AI language models have been evaluated with a standard benchmark metric called Rouge. This systematic review discovers that model evaluation discrepancies, unspecified protocol, and widespread software errors have quietly transformed thousands of Rouge scores into rogue scores.

Evaluation Software Errors at ACL 2023 »
Between 8–16% of publications use incorrect packages.

ACL 2023 Paper Reproducibility Guide » Software Guide »
Reproducibility guide now available. Software guide coming soon.
Watch a five-minute video that summarizes the goals, methods, and findings of the Rogue Scores project.

Figure 1. Watch a five-minute video that introduces the main objectives, methods, and results of the Rogue Scores project.

Figure 2. A sample of tasks and models that use Rouge evaluation. Rouge scores are used for evaluating model performance in many important language and vision tasks including summarization, image and video captioning, question answering, reading comprehension, dialogue, and other generation tasks. Rouge is also used for benchmarking large language models like T5, BART, PaLM, and GPT.


Is Language Model Evaluation Reproducible, Comparable, and Correct?

The Rogue Scores project examines machine learning reproducibility with a systematic review of model evaluation using Rouge, one of the most common evaluation metrics for generative language models, appearing in thousands of papers for language and vision tasks. Machine learning evaluation is supposed to be reproducible, comparable, and correct. Does model evaluation using Rouge meet these three evaluation criteria?

A review of 2,834 Rouge papers finds that model evaluation is hard to reproduce because important evaluation details are routinely omitted from papers. Rouge protocol sensitivity testing shows that model evaluation is difficult to compare because frequently unreported evaluation details affect reported scores. Finally, software testing of 17 common Rouge packages reveals that most popular packages have scoring errors. Because many papers appear to use Rouge packages with errors, model evaluation is frequently incorrect.

These findings suggest 2,000+ language and vision papers report incorrect Rouge scores. While some Rouge software package errors are minor, others dramatically affect scores. However, because the majority of papers do not cite software packages, it is nearly impossible to distinguish between correct Rouge scores and incorrect rogue scores.

Why are these results significant?

Rogue Scores is the largest machine learning reproducibility study to date. It discovers widespread evaluation integrity issues affecting thousands of papers published over two decades. The results affect the validity and interpretation of Rouge, a widespread benchmark model evaluation metric. They suggest that major evaluation issues may exist throughout machine learning research.


Model Evaluation is Hard to Reproduce

Figure 3. Language model evaluations using Rouge are more challenging to reproduce than results in other scientific fields.

20% reproducible
2,834 language model evaluations using Rouge
Rogue Scores Project (2023)
compared with other scientific fields
39% reproducible
100 psychology studies
Open Sci. Collab. (2015)
61% reproducible
18 economics studies
Camerer et al. (2016)
62% reproducible
21 social science studies
Camerer et al. (2018)
46% reproducible
23 cancer biology studies
Errington et al. (2021)

What percentage of Rouge model evaluations are reproducible?

The Rogue Scores project assesses the reproducibility of Rouge scores computed by papers published in over 40 major open-access machine learning venues, including the complete ACL Anthology. Over 100 thousand full-text papers published between 2004 and 2022 are downloaded and searched for Rouge evaluation details. A total of 2,834 Rouge papers and their 831 associated codebases are identified and annotated using a combination of automated labeling and expert human review. Each paper and codebase is labeled by whether it meets the basic reproducibility criteria required to reproduce Rouge scores.

Reproducibility studies in other scientific fields like psychology and economics were able to reproduce between 39% (Open Science Collaboration, 2015) and 62% (Camerer et al., 2018) of published results. By comparison, only 20% of language model evaluations met the basic reproducibility criteria for Rouge evaluation. The remaining 80% of papers and codebases omit critical details that make Rouge score reproduction challenging or impossible.

Does AI have a reproducibility problem?

Rogue Scores estimates that only 1 in 5 Rouge scores in peer-reviewed papers could be reproduced using details provided in papers and codebases. The reproduction rate of model evaluation using Rouge is much lower than comparable reproduction rates of results from other scientific fields.


Model Evaluation is Difficult to Compare

Figure 4. Language model evaluations with Rouge omit critical details that affect Rouge scoring, making scores more difficult to compare.

12% papers
release code with Rouge evaluation
5% papers
list Rouge configuration parameters
33% papers
cite any Rouge software package
6% papers
perform Rouge significance testing

How much do configuration differences affect reported Rouge scores?

Rogue Scores measures Rouge score sensitivity to 10 different Rouge configuration differences that are infrequently reported in peer-reviewed papers. Configurations include application of Porter stemming, removal of stopwords, selection of word and sentence tokenization, and other score reporting choices. The configuration sensitivity analysis finds many rarely-reported evaluation details materially affect reported Rouge scores.

The Rogue Scores systematic review finds only 5% of papers list Rouge configuration parameters and only 12% of papers include Rouge evaluation in their code release. Missing evaluation details introduce uncertainty into Rouge scores. This uncertainty makes comparing Rouge scores between models evaluated in different papers difficult.

Why does configuration variability matter?

Rogue Scores finds that Rouge configuration differences can meaningfully affect model evaluation scores. However, most papers do not report Rouge configuration details. This means many models may be evaluated with different unreported configurations, making it challenging to separate legitimate modeling improvements from artificial configuration-related score variability.


Model Evaluation is Frequently Incorrect

Figure 5. Language model evaluations with Rouge are frequently performed using untested, incorrect software packages.

16 of 17 packages
compute incorrect Rouge scores
76% citations
reference an incorrect Rouge package
2,000+ papers
may report incorrect Rouge scores

Are Rouge scores reported in papers correct?

Rogue Scores identifies 17 Rouge evaluation packages cited in peer-reviewed papers. Until now, none of these packages appears to have been tested against the original ROUGE-1.5.5 reference implementation (Lin, 2004). For the first time, detailed model-output-level correctness testing is conducted for all 17 packages. This package testing process identifies only 1 out of 17 common packages used in peer-reviewed papers computes fully correct Rouge scores. The other 16 packages have various software errors that affect scores.

The Rogue Scores systematic review finds 76% of papers with Rouge software citations cite one of these 16 packages with errors. However, only 35% of papers cite Rouge packages. For the 65% of papers without software citations, there is not enough information to determine if Rouge scores are correct. If 76% of all 2,834 reviewed Rouge papers also use packages with errors, over 2,000 papers may report incorrect Rouge scores.

How does this compare to other scientific errors?

Rogue Scores identifies errors in important evaluation software likely used in thousands of peer-reviewed papers. This kind of software error is unprecedented in scientific research and is likely the most significant and widespread research integrity issue to date in machine learning history.


Evaluation Case Study: Mistake of the Art

Figure 6. Any model can have state-of-the-art scores: just pick an incorrect Rouge package and configuration found in peer-reviewed papers!

State-of-the-Art
Summarization Models
Rouge Scores
R-1R-2R-L
Lead-3 (Simple Baseline)40.3417.5536.58
Rogue-3 is Lead-3 re-evaluated with a peer-reviewed incorrect Rouge configuration.
T5 (Raffel et al., 2020)43.5221.5540.69
BART (Lewis et al., 2020)44.1621.2840.90
PEGASUS (Zhang et al., 2020)44.1721.4741.11
SIMCLS (Liu and Liu, 2021)46.6722.1543.54
BRIO (Liu et al., 2022)47.7823.5544.57
Rogue-3 (Incorrect Lead-3)73.8955.8073.89

Could Rouge software errors materially affect research results?

In this evaluation case study, Rogue Scores uses nonstandard Rouge software and parameters found in other peer-reviewed papers to boost the scores of a simple baseline system (Lead-3) and transform it into a seemingly-state-of-the-art model (Rogue-3).

This case study evaluates models for text summarization, a well-studied and competitive language generation task that uses Rouge as its primary evaluation metric. One simple baseline (Lead-3) and four state-of-the-art transformer-based models are evaluated on the CNN / Daily Mail benchmark single-document summarization dataset. The rule-based Lead-3 baseline, which extracts the first three sentences of an article and uses them as a summary, has relatively low Rouge scores easily outperformed by recent work.

On the other hand, the incorrectly evaluated Rogue-3 outperforms the Rouge scores of other state-of-the-art models by nearly a factor of two! While this is an extreme example, Rouge packages with errors are used in most peer-reviewed papers. So, until machine learning venues strengthen reporting requirements to prevent publication of incorrect model evaluations, perhaps Rogue-3 is the new state-of-the-art summarization model!

Are all incorrect scores this extreme?

Rogue Scores uses an incorrect package that dramatically inflates scores by 20+ Rouge points. Not all incorrect packages have such obvious errors. Many have smaller or unpredictable errors, or errors that decrease scores. However, even small and subtle errors harm the integrity of the research record.


Learn More About Rogue Scores

  1. Watch the Overview Video — View a five-minute video that summarizes the objectives, methods, and major findings of the Rogue Scores project.

  2. Review the ACL 2023 Proceedings — Read a report on evaluation software errors found in the proceedings of ACL 2023, a major natural language processing conference.

  3. Read the ACL 2023 Paper — Learn more about the systematic review, configuration sensitivity experiments, and Rouge package testing methods and results.

  4. Consult the Software Guide — Read about specific errors discovered in common Rouge software packages used for language model evaluation in peer-reviewed papers.

    (Software guide will be available soon.)

  5. Follow the Reproducibility Guide — Follow a step-by-step guide for recreating all of the figures and tables in the Rogue Scores paper using the code and data release.

  6. Download the Code & Data — Experiment with Rouge packages and configurations used in peer-reviewed papers and examine the systematic review dataset.