Rogue Scores: Reproducibility Guide

Want to try reproducing the entire Rogue Scores paper? Download the code and data and follow this guide.

Rogue Scores Overview Data & Code [GZIP]
See overview for more detail on Rouge errors.

Table 1. Description of each field in the systematic review dataset. Download this dataset (above) and follow along with this reproducibility guide!

Paper Bibliographic Information
paper_identUnique paper identifier.
paper_urlPaper online abstract page URL.
paper_authorAuthor list.
paper_titlePaper title.
paper_venueVenue abbreviation.
paper_yearPublication year.
paper_monthPublication month.
paper_booktitleBibTeX booktitle field.
paper_addressBibTeX address field.
paper_publisherBibTeX publisher field
paper_pagesBibTeX pages.
Paper Systematic Review Labels
paper_rougeDoes the paper compute Rouge scores? Manually reviewed.
paper_rouge_prelimPreliminary Rouge identification using automated regular expressions, before any human review.
paper_paramsRouge parameter list cited in paper. Manually reviewed.
paper_protocolRouge protocol related terms related to bootstrapping, stemming, tokenization. Identified using automated regular expressions.
paper_packagesList of Rouge packages cited in the paper.
paper_variantsRouge variant terms, such as precision, recall, and f-score. Identified using regular expressions.
Codebase Systematic Review Labels
code_urlURL of code repository cited in paper.
code_rougeDoes the code perform reproducible Rouge evaluation? Manual static review, subjective assessment of Rouge evaluation reproducibility.
code_rouge_prelimDoes the code mention Rouge? Identified using GitHub API search.
code_packagesList of Rouge packages found in code. Identified using manual review.
Additional Review Labels
packagesComplete list of all Rouge packages found in paper and code, consolidated, cleaned, de-aliased, and de-duplicated.
software_errorDoes the paper/code use a Rouge package with software errors?
reproducibleIs the work reproducible? According to Figure 1(A) definition.
Paper Processing Pipeline Metadata
error_downloadError occurred in paper downloading?
error_extractError occurred in paper text extraction?

Download and Setup Environment

Download the dataset and code release using the link provided at the top of the page. Extract roguescores.tar.gz to the /roguescores directory.

To improve reproducibility, the entire project runs inside an Ubuntu container. If Docker is not yet installed on your computer, follow the Docker installation instructions. The only other required dependency is GNU Make, which you probably already have. 1

Enter the /roguescores directory and run:

$ make build    # Creates Docker image. 

This will create a new image called roguescores. This image contains all required Python and Perl dependencies. After the image has been created, start a container by running:

$ make start    # Starts console used in next sections. 

This launches into an ipython console. The container has access to your local /roguescores directory. When you exit the console by running exit or pressing Control+D, the container is automatically cleaned up. The image will not be removed. When you are finished using the roguescores image, manually remove it:

$ make clean    # WARNING: Removes Docker image! 

The rest of this guide will involve running Python functions in the console.

Overview of the Dataset

Rogue Scores conducts a systematic review that covers over 100,000 machine learning papers. This review results in a large papers dataset, which is provided as a JSON lines file. For more information about the structure of this dataset, see Table 1.

Part of the systematic review involves identifying papers that evaluate using Rouge. The dataset tracks the entire data processing pipeline, from start to finish. It includes all papers touched by the review, including papers that could not be downloaded or extracted. It also contains citations for many papers that do not evaluate using Rouge.

These entries are necessary to reproduce several figures, like Figure 4, which provides an overview of the systematic review process. An easy way to reduce the dataset to only Rouge papers is to filter entries by paper_rouge = true.

What to Expect When Running the Code

There are eight figures and tables in Rogue Scores. This guide will walk you through generating all of these tables and figures, except two figures that are purely explanatory and do not contain any numbers. Here is a list of figures this guide will reproduce:

The code release includes a Python script, Dockerfile, and Makefile. The functions provided in the Python file are a DAG that end-to-end reproduce the entire paper, including download and setup of specific Rouge package versions evaluated in the paper.

Figure 1: Systematic Review Overview

This figure is on the first page of Rogue Scores and gives a summary of the major results of the paper. Use the ipython console to generate this figure by running:

In [1]: generate_overview_figure() 

The output is a basic text-only version of the paper figure.

=================== (A) REPRODUCIBILITY ===================  2834 model evaluations using ROUGE 20% reproducible  (NOTE: see paper for details on comparison studies)  ================= (B) COMPARABILITY =================  Release code -- including incomplete and nonfunctional 33% papers  Release code with ROUGE evaluation 12% papers  Perform ROUGE significance testing / bootstrapping 6% papers  List ROUGE configuration parameters 5% papers  Cite ROUGE software package -- including unofficial 35% papers  =============== (C) CORRECTNESS ===============  Percentage of ROUGE software citations that reference software with scoring errors 76% papers 

Simply rerunning these functions is useful. But the real value of a code release is exploring how each of these items is operationalized. Try running generate_overview_figure?? in the ipython console, which will allow you to page through the function definition.

In [2]: generate_overview_figure?? 

The figure claims “76% of papers with Rouge software citations reference software with scoring errors.” Do you see the process by which that number was computed?

Figure 2: Rouge Evaluation Diagram

This figure is a simple TikZ diagram that demonstrates a Rouge evaluation. The figure does not contain any data to reproduce using the code and dataset.

In [3]: print("Done!") 

Figure 3: Reproducibility and Correctness Plots

This dual figure expands on the Figure 1 statistics by showing how Rouge reproducibility and correctness has changed over time since ROUGE-1.5.5 was introduced in 2004.

In [4]: generate_historical_plot() 

The output is two PDF files (correctness.pdf and reproducibility.pdf) which form the top and bottom halves of Figure 3 in Rogue Scores. They look like this:

20042005200620072008200920102011201220132014201520162017201820192020202120220100200300400500600Papers Performing ROUGE EvaluationNo Package Citation (n = 1,835)Cites Incorrect Package (n = 755)Cites Correct Package (n = 244)

0100200300400500600Papers Performing ROUGE EvaluationMeets Basic Reproducibility Criteria (n = 568)Fails Basic Reproducibility Criteria (n = 2,266)

Figure 4: Systematic Review Procedure

This figure is based on the Prisma flow diagram, which is a standardized way to represent inclusion and exclusion processes and criteria for systematic reviews.

In [5]: generate_process_figure() 

The output is the raw information underlying the figure, which can be plugged into the appropriate places to create the Rogue Scores TikZ diagram.

Overall Citations Collected ===========================  ACL Citations: 70676 DBLP Citations: 40013 ---------- Total Citations: 110689  Download and Extract Text =========================  Before 2002: 6976 Paper Inaccessible: 3101 Extraction Errors: 30 ---------- Citations Excluded: 10107  Full-Text Machine Learning Papers =================================  100582  Screen Papers for ROUGE =======================  Automated Rules: 96861 Manual Review: 887 ---------- Papers Excluded: 97748  ROUGE Papers Included in Review ===============================  2834  Screen Code for ROUGE =====================  Code Unavailable: 1697 Linking Errors: 306 ---------- Codebases Excluded: 2003  ROUGE Codebases Included in Review ==================================  831 

Table 1: Rouge Comparability Experiments

This table lists Rouge score discrepancies introduced by ROUGE-1.5.5 configuration differences, highlighting the importance of reporting exact evaluation configurations.

In [6]: generate_configs_table() 

Generating Table 1 involves running the ROUGE-1.5.5 package 11 times (for the 1 baseline configuration and 10 comparisons), which will take a couple minutes. The results will be automatically cached after the first run, and will run faster in the future.

You will see something like this for 10-30 minutes:

Protocol Experiments:  18%|████                | 2/11 [11:00<22:00, 120.00s/it]  [RUNNING] run_configs_period_sentence_splits [RUNNING] run_configs_no_sentence_splits [RUNNING] run_configs_apply_stemming [RUNNING] run_configs_baseline [RUNNING] run_configs_truncate_75_bytes [RUNNING] run_configs_remove_stopwords [RUNNING] run_configs_nltk_sentence_splits [RUNNING] run_configs_nltk_tokenize [RUNNING] run_configs_truncate_100_words [RUNNING] run_configs_fscore_beta12 [RUNNING] run_configs_misreport_recall  ... full output omitted ... 

When complete, the following results table will be printed. 2

================================================= Rogue Scores Table 1: Comparability Experiments                          ROUGE-1  ROUGE-2  ROUGE-L apply_stemming             1.68     0.54     1.31 remove_stopwords          -2.21    -0.58    -0.99 no_sentence_splits         0.00     0.00   -11.17 period_sentence_splits     0.00     0.00    -3.44 nltk_sentence_splits       0.00     0.00    -0.16 nltk_tokenize             -0.00     0.00    -0.00 fscore_beta12              1.33     0.61     1.21 misreport_recall          10.88     5.00     9.92 truncate_75_bytes        -27.92   -12.93   -33.44 truncate_100_words        -0.07    -0.05    -0.07 ================================================= 

Rogue Scores slightly reorders this table and adds sections (e.g., “Preprocessing”). There are slight differences for all nltk_tokenize results, reported as “<0.01” in Rogue Scores. Because sentence splits only affect Rouge-L, Rouge-1 and Rouge-2 scores corresponding to sentence split experiments are exactly zero (as indicated in Rogue Scores).

Table 2: Rouge Correctness Experiments

Ths table tests nonstandard Rouge packages against the official ROUGE-1.5.5 reference implementation, recording how frequently they compute incorrect scores.

In [7]: generate_packages_table() 

This table will also take a long time to generate.

Correctness Experiments:  49%|██████████          | 18/37 [3:00:00<3:00:00, 580.00s/it]  [RUNNING] run_packages_baseline_nostem [RUNNING] run_packages_baseline_stem [RUNNING] run_packages_andersjo_pyrouge_nostem [RUNNING] run_packages_andersjo_pyrouge_stem [RUNNING] run_packages_bheinzerling_pyrouge_nostem [RUNNING] run_packages_bheinzerling_pyrouge_stem [RUNNING] run_packages_chakkiworks_sumeval_nostem [RUNNING] run_packages_chakkiworks_sumeval_stem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_nostem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_stem [RUNNING] run_packages_diego999_pyrouge_nostem [RUNNING] run_packages_diego999_pyrouge_stem [RUNNING] run_packages_google_rougescore_nostem [RUNNING] run_packages_google_rougescore_stem [RUNNING] run_packages_google_rougescore_lsum_nostem [RUNNING] run_packages_google_rougescore_lsum_stem [RUNNING] run_packages_google_seq2seq_nostem [RUNNING] run_packages_kavgan_rouge20_nostem [RUNNING] run_packages_kavgan_rouge20_stem [RUNNING] run_packages_kavgan_rouge20_stopwords_nostem [RUNNING] run_packages_kavgan_rouge20_stopwords_stem [RUNNING] run_packages_liplus_rougemetric_wrapper_nostem [RUNNING] run_packages_liplus_rougemetric_wrapper_stem [RUNNING] run_packages_liplus_rougemetric_reimplementation_nostem [RUNNING] run_packages_neuraldialoguemetrics_easyrouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_stem [RUNNING] run_packages_pltrdy_pyrouge_nostem [RUNNING] run_packages_pltrdy_pyrouge_stem [RUNNING] run_packages_pltrdy_rouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_stem [RUNNING] run_packages_tylin_cococaption_nostem  ... full output omitted ... 

This version of Table 2 is presented in alphabetical order by package name. In Rogue Scores, this table has been reordered (to distinguish between wrappers and reimplementations).

Additionally, some packages do not implement Porter stemming. These packages are not evaluated using stemming in Table 2, which is why they are missing entries. For example, tylin_cococaption_stem is missing because this package does not stem. This is indicated with a dash in Rogue Scores Table 2. Other packages (or configurations) only support certain Rouge versions. These packages are indicated with NaN in Table 2. For example, google_rougescore_lsum_nostem evaluates an alternative version of Rouge-L and does not compute Rouge-N. This is indicated with a dash in Rogue Scores Table 2.

=========================================================================== Rogue Scores Table 2: Correctness Experiments                                                    ROUGE-1  ROUGE-2  ROUGE-L andersjo_pyrouge_nostem                             100.0    100.0    100.0 andersjo_pyrouge_stem                               100.0    100.0    100.0 bheinzerling_pyrouge_nostem                          46.0     28.0     56.0 bheinzerling_pyrouge_stem                             0.0      0.0      0.0 chakkiworks_sumeval_nostem                           98.0     97.0    100.0 chakkiworks_sumeval_stem                             98.0     97.0    100.0 chakkiworks_sumeval_stopwords_nostem                  0.0      0.0     97.0 chakkiworks_sumeval_stopwords_stem                   73.0     61.0     99.0 danieldeutsch_sacrerouge_reimplementation_nostem      0.0      0.0     97.0 danieldeutsch_sacrerouge_reimplementation_stem        0.0      0.0     98.0 danieldeutsch_sacrerouge_wrapper_nostem               0.0      0.0      0.0 danieldeutsch_sacrerouge_wrapper_stem                 0.0      0.0      0.0 diego999_pyrouge_nostem                               4.0      4.0      4.0 diego999_pyrouge_stem                                 4.0      4.0      4.0 google_rougescore_lsum_nostem                         NaN      NaN      0.0 google_rougescore_lsum_stem                           NaN      NaN     19.0 google_rougescore_nostem                              0.0      0.0     97.0 google_rougescore_stem                               14.0      6.0     98.0 google_seq2seq_nostem                                98.0     97.0    100.0 kavgan_rouge20_nostem                                98.0     97.0    100.0 kavgan_rouge20_stem                                  98.0     97.0    100.0 kavgan_rouge20_stopwords_nostem                      93.0     97.0    100.0 kavgan_rouge20_stopwords_stem                        94.0     97.0    100.0 liplus_rougemetric_reimplementation_nostem           97.0     95.0     99.0 liplus_rougemetric_wrapper_nostem                     0.0      0.0      0.0 liplus_rougemetric_wrapper_stem                      13.0      6.0     18.0 neuraldialoguemetrics_easyrouge_nostem               98.0     97.0    100.0 pltrdy_files2rouge_nostem                             0.0      0.0     83.0 pltrdy_files2rouge_stem                              13.0      6.0     86.0 pltrdy_pyrouge_nostem                                 0.0      0.0      0.0 pltrdy_pyrouge_stem                                   0.0      0.0      0.0 pltrdy_rouge_nostem                                  98.0     96.0    100.0 tagucci_pythonrouge_nostem                          100.0    100.0     84.0 tagucci_pythonrouge_stem                            100.0    100.0     86.0 tylin_cococaption_nostem                              NaN      NaN    100.0 =========================================================================== 

Many of these packages have configuration options that influence Rouge scores. However, as noted in Rogue Scores, Rouge configuration options are rarely included in papers.

Unfortunately, this makes assessing package correctness more difficult. For example, some packages have subtle defaults that authors may not notice (e.g., removal of stopwords by default, default stemming, default no stemming). Should packages be evaluated based on their unusual defaults? Or, should we assume authors will notice and correct them?

This means there is some guesswork involved in picking which package configurations hould be evaluated: what is the most likely way (or ways) that authors would configure this package? The code provides the exact details of how each implementation was evaluated. For example, let’s investigate how kavgan_rouge20 was evaluated.

To see a specific package experiment, view run_packages_{package}_{config}:

In [8]: run_packages_kavgan_rouge20_stem?? 

Or see an entire package wrapper implementation, view rouge_package_{package}:

In [9]: rouge_package_kavgan_rouge20?? 

Table 3: Rogue-3 “Model” Case Study

This table contains an evaluation of a baseline model (Lead-3) using a carefully chosen nonstandard Rouge package and achieves state of the art scores in summarization!

In [10]: generate_models_table() 

This table could easily take several hours to generate: the Rouge package used to evaluate Rogue-3 is extremely slow. Unfortunately, there is no way to parallelize this slow Rouge package without potentially compromising the integrity of its (incorrect) scores.

Case Study Experiments:  100%|████████████████████| 2/2 [3:30:00<00:00, 6300.00s/it]  [RUNNING] run_models_lead_3 [RUNNING] run_models_rogue_3  ... full output omitted ... 

Once complete, the code will print a full model evaluation table containing Lead-3 and Rogue-3 only. The results for the other models were copied from their respective papers.

================================== Rogue Scores Table 3: Rogue-3 Case Study           ROUGE-1  ROUGE-2  ROUGE-L lead_3     40.34    17.55    36.58 rogue_3    73.89    55.80    73.89 ================================== 

To see exactly how the “state of the art” Rogue-3 model was evaluated, try viewing:

In [11]: run_models_rogue_3?? 

Figure 5: Rouge Package Code Excerpt

The code excerpt is found on GitHub. The $ \beta = 1.2 $ error occurs on line 43.


In Rogue Scores, code comments were condensed and reformatted for presentation.

# Description: Computes ROUGE-L metric # as described by Lin and Hovey (2004)  class Rouge():      '''Class for computing ROUGE-L score for a set of     candidate sentences for the MS COCO test set'''      def __init__(self):          # updated the value below         # based on discussion with Hovey          self.beta = 1.2 

  1. If you don’t have make, you probably know why.

  2. Note: A recent rerun of the code produced a slightly different result. This does not meaningfully change any of the findings of this table (the difference is one hundredth of a Rouge point), but is being investigated. It may be the result of different versions of dependencies being used in the code release.

    ================================================= Rogue Scores Table 1: Comparability Experiments                          ROUGE-1  ROUGE-2  ROUGE-L apply_stemming             1.68     0.54     1.31 remove_stopwords          -2.20    -0.58    -0.99 no_sentence_splits         0.00     0.00   -11.17 period_sentence_splits     0.00     0.00    -3.44 nltk_sentence_splits       0.00     0.00    -0.16 nltk_tokenize             -0.00     0.00    -0.00 fscore_beta12              1.33     0.61     1.21 misreport_recall          10.88     5.00     9.92 truncate_75_bytes        -27.91   -12.93   -33.43 <- different results                          -27.92   -12.93   -33.44 <- numbers in paper truncate_100_words        -0.07    -0.05    -0.07 =================================================