Rogue Scores: Reproducibility Guide

Table 1. Description of each field in the systematic review dataset. Download this dataset (above) and follow along with this reproducibility guide!

Paper Bibliographic Information
paper_ident	Unique paper identifier.
paper_url	Paper online abstract page URL.
paper_author	Author list.
paper_title	Paper title.
paper_venue	Venue abbreviation.
paper_year	Publication year.
paper_month	Publication month.
paper_booktitle	BibTeX booktitle field.
paper_address	BibTeX address field.
paper_publisher	BibTeX publisher field
paper_pages	BibTeX pages.
Paper Systematic Review Labels
paper_rouge	Does the paper compute Rouge scores? Manually reviewed.
paper_rouge_prelim	Preliminary Rouge identification using automated regular expressions, before any human review.
paper_params	Rouge parameter list cited in paper. Manually reviewed.
paper_protocol	Rouge protocol related terms related to bootstrapping, stemming, tokenization. Identified using automated regular expressions.
paper_packages	List of Rouge packages cited in the paper.
paper_variants	Rouge variant terms, such as precision, recall, and f-score. Identified using regular expressions.
Codebase Systematic Review Labels
code_url	URL of code repository cited in paper.
code_rouge	Does the code perform reproducible Rouge evaluation? Manual static review, subjective assessment of Rouge evaluation reproducibility.
code_rouge_prelim	Does the code mention Rouge? Identified using GitHub API search.
code_packages	List of Rouge packages found in code. Identified using manual review.
Additional Review Labels
packages	Complete list of all Rouge packages found in paper and code, consolidated, cleaned, de-aliased, and de-duplicated.
software_error	Does the paper/code use a Rouge package with software errors?
reproducible	Is the work reproducible? According to Figure 1(A) definition.
Paper Processing Pipeline Metadata
error_download	Error occurred in paper downloading?
error_extract	Error occurred in paper text extraction?

Download and Setup Environment

Download the dataset and code release using the link provided at the top of the page. Extract roguescores.tar.gz to the /roguescores directory.

To improve reproducibility, the entire project runs inside an Ubuntu container. If Docker is not yet installed on your computer, follow the Docker installation instructions. The only other required dependency is GNU Make, which you probably already have. ¹

Enter the /roguescores directory and run:

$ make build    # Creates Docker image.

This will create a new image called roguescores. This image contains all required Python and Perl dependencies. After the image has been created, start a container by running:

$ make start    # Starts console used in next sections.

This launches into an ipython console. The container has access to your local /roguescores directory. When you exit the console by running exit or pressing Control+D, the container is automatically cleaned up. The image will not be removed. When you are finished using the roguescores image, manually remove it:

$ make clean    # WARNING: Removes Docker image!

The rest of this guide will involve running Python functions in the console.

Overview of the Dataset

Rogue Scores conducts a systematic review that covers over 100,000 machine learning papers. This review results in a large papers dataset, which is provided as a JSON lines file. For more information about the structure of this dataset, see Table 1.

Part of the systematic review involves identifying papers that evaluate using Rouge. The dataset tracks the entire data processing pipeline, from start to finish. It includes all papers touched by the review, including papers that could not be downloaded or extracted. It also contains citations for many papers that do not evaluate using Rouge.

These entries are necessary to reproduce several figures, like Figure 4, which provides an overview of the systematic review process. An easy way to reduce the dataset to only Rouge papers is to filter entries by paper_rouge = true.

What to Expect When Running the Code

There are eight figures and tables in Rogue Scores. This guide will walk you through generating all of these tables and figures, except two figures that are purely explanatory and do not contain any numbers. Here is a list of figures this guide will reproduce:

Figure 1, Figure 3, Figure 4: These figures are based on the systematic review dataset. They require little additional computation and will be generated very quickly.
Table 1, Table 2, Table 3: These tables are generated after running many different configurations and Rouge packages. If possible on your computer, this will be parallelized across multiple processes. They may take substantial time.
Figure 2, Figure 5: These figures do not contain any numbers to reproduce.

The code release includes a Python script, Dockerfile, and Makefile. The functions provided in the Python file are a DAG that end-to-end reproduce the entire paper, including download and setup of specific Rouge package versions evaluated in the paper.

Figure 1: Systematic Review Overview

This figure is on the first page of Rogue Scores and gives a summary of the major results of the paper. Use the ipython console to generate this figure by running:

In [1]: generate_overview_figure()

The output is a basic text-only version of the paper figure.

=================== (A) REPRODUCIBILITY ===================  2834 model evaluations using ROUGE 20% reproducible  (NOTE: see paper for details on comparison studies)  ================= (B) COMPARABILITY =================  Release code -- including incomplete and nonfunctional 33% papers  Release code with ROUGE evaluation 12% papers  Perform ROUGE significance testing / bootstrapping 6% papers  List ROUGE configuration parameters 5% papers  Cite ROUGE software package -- including unofficial 35% papers  =============== (C) CORRECTNESS ===============  Percentage of ROUGE software citations that reference software with scoring errors 76% papers

Simply rerunning these functions is useful. But the real value of a code release is exploring how each of these items is operationalized. Try running generate_overview_figure?? in the ipython console, which will allow you to page through the function definition.

In [2]: generate_overview_figure??

The figure claims “76% of papers with Rouge software citations reference software with scoring errors.” Do you see the process by which that number was computed?

Figure 2: Rouge Evaluation Diagram

This figure is a simple TikZ diagram that demonstrates a Rouge evaluation. The figure does not contain any data to reproduce using the code and dataset.

In [3]: print("Done!")

Done!

Figure 3: Reproducibility and Correctness Plots

This dual figure expands on the Figure 1 statistics by showing how Rouge reproducibility and correctness has changed over time since ROUGE-1.5.5 was introduced in 2004.

In [4]: generate_historical_plot()

The output is two PDF files (correctness.pdf and reproducibility.pdf) which form the top and bottom halves of Figure 3 in Rogue Scores. They look like this:

Figure 4: Systematic Review Procedure

This figure is based on the Prisma flow diagram, which is a standardized way to represent inclusion and exclusion processes and criteria for systematic reviews.

In [5]: generate_process_figure()

The output is the raw information underlying the figure, which can be plugged into the appropriate places to create the Rogue Scores TikZ diagram.

Overall Citations Collected ===========================  ACL Citations: 70676 DBLP Citations: 40013 ---------- Total Citations: 110689  Download and Extract Text =========================  Before 2002: 6976 Paper Inaccessible: 3101 Extraction Errors: 30 ---------- Citations Excluded: 10107  Full-Text Machine Learning Papers =================================  100582  Screen Papers for ROUGE =======================  Automated Rules: 96861 Manual Review: 887 ---------- Papers Excluded: 97748  ROUGE Papers Included in Review ===============================  2834  Screen Code for ROUGE =====================  Code Unavailable: 1697 Linking Errors: 306 ---------- Codebases Excluded: 2003  ROUGE Codebases Included in Review ==================================  831

Table 1: Rouge Comparability Experiments

This table lists Rouge score discrepancies introduced by ROUGE-1.5.5 configuration differences, highlighting the importance of reporting exact evaluation configurations.

In [6]: generate_configs_table()

Generating Table 1 involves running the ROUGE-1.5.5 package 11 times (for the 1 baseline configuration and 10 comparisons), which will take a couple minutes. The results will be automatically cached after the first run, and will run faster in the future.

You will see something like this for 10-30 minutes:

Protocol Experiments:  18%|████                | 2/11 [11:00<22:00, 120.00s/it]  [RUNNING] run_configs_period_sentence_splits [RUNNING] run_configs_no_sentence_splits [RUNNING] run_configs_apply_stemming [RUNNING] run_configs_baseline [RUNNING] run_configs_truncate_75_bytes [RUNNING] run_configs_remove_stopwords [RUNNING] run_configs_nltk_sentence_splits [RUNNING] run_configs_nltk_tokenize [RUNNING] run_configs_truncate_100_words [RUNNING] run_configs_fscore_beta12 [RUNNING] run_configs_misreport_recall  ... full output omitted ...

When complete, the following results table will be printed. ²

================================================= Rogue Scores Table 1: Comparability Experiments                          ROUGE-1  ROUGE-2  ROUGE-L apply_stemming             1.68     0.54     1.31 remove_stopwords          -2.21    -0.58    -0.99 no_sentence_splits         0.00     0.00   -11.17 period_sentence_splits     0.00     0.00    -3.44 nltk_sentence_splits       0.00     0.00    -0.16 nltk_tokenize             -0.00     0.00    -0.00 fscore_beta12              1.33     0.61     1.21 misreport_recall          10.88     5.00     9.92 truncate_75_bytes        -27.92   -12.93   -33.44 truncate_100_words        -0.07    -0.05    -0.07 =================================================

Rogue Scores slightly reorders this table and adds sections (e.g., “Preprocessing”). There are slight differences for all nltk_tokenize results, reported as “<0.01” in Rogue Scores. Because sentence splits only affect Rouge-L, Rouge-1 and Rouge-2 scores corresponding to sentence split experiments are exactly zero (as indicated in Rogue Scores).

Table 2: Rouge Correctness Experiments

Ths table tests nonstandard Rouge packages against the official ROUGE-1.5.5 reference implementation, recording how frequently they compute incorrect scores.

In [7]: generate_packages_table()

This table will also take a long time to generate.

Correctness Experiments:  49%|██████████          | 18/37 [3:00:00<3:00:00, 580.00s/it]  [RUNNING] run_packages_baseline_nostem [RUNNING] run_packages_baseline_stem [RUNNING] run_packages_andersjo_pyrouge_nostem [RUNNING] run_packages_andersjo_pyrouge_stem [RUNNING] run_packages_bheinzerling_pyrouge_nostem [RUNNING] run_packages_bheinzerling_pyrouge_stem [RUNNING] run_packages_chakkiworks_sumeval_nostem [RUNNING] run_packages_chakkiworks_sumeval_stem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_nostem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_stem [RUNNING] run_packages_diego999_pyrouge_nostem [RUNNING] run_packages_diego999_pyrouge_stem [RUNNING] run_packages_google_rougescore_nostem [RUNNING] run_packages_google_rougescore_stem [RUNNING] run_packages_google_rougescore_lsum_nostem [RUNNING] run_packages_google_rougescore_lsum_stem [RUNNING] run_packages_google_seq2seq_nostem [RUNNING] run_packages_kavgan_rouge20_nostem [RUNNING] run_packages_kavgan_rouge20_stem [RUNNING] run_packages_kavgan_rouge20_stopwords_nostem [RUNNING] run_packages_kavgan_rouge20_stopwords_stem [RUNNING] run_packages_liplus_rougemetric_wrapper_nostem [RUNNING] run_packages_liplus_rougemetric_wrapper_stem [RUNNING] run_packages_liplus_rougemetric_reimplementation_nostem [RUNNING] run_packages_neuraldialoguemetrics_easyrouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_stem [RUNNING] run_packages_pltrdy_pyrouge_nostem [RUNNING] run_packages_pltrdy_pyrouge_stem [RUNNING] run_packages_pltrdy_rouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_stem [RUNNING] run_packages_tylin_cococaption_nostem  ... full output omitted ...

This version of Table 2 is presented in alphabetical order by package name. In Rogue Scores, this table has been reordered (to distinguish between wrappers and reimplementations).

Additionally, some packages do not implement Porter stemming. These packages are not evaluated using stemming in Table 2, which is why they are missing entries. For example, tylin_cococaption_stem is missing because this package does not stem. This is indicated with a dash in Rogue Scores Table 2. Other packages (or configurations) only support certain Rouge versions. These packages are indicated with NaN in Table 2. For example, google_rougescore_lsum_nostem evaluates an alternative version of Rouge-L and does not compute Rouge-N. This is indicated with a dash in Rogue Scores Table 2.

=========================================================================== Rogue Scores Table 2: Correctness Experiments                                                    ROUGE-1  ROUGE-2  ROUGE-L andersjo_pyrouge_nostem                             100.0    100.0    100.0 andersjo_pyrouge_stem                               100.0    100.0    100.0 bheinzerling_pyrouge_nostem                          46.0     28.0     56.0 bheinzerling_pyrouge_stem                             0.0      0.0      0.0 chakkiworks_sumeval_nostem                           98.0     97.0    100.0 chakkiworks_sumeval_stem                             98.0     97.0    100.0 chakkiworks_sumeval_stopwords_nostem                  0.0      0.0     97.0 chakkiworks_sumeval_stopwords_stem                   73.0     61.0     99.0 danieldeutsch_sacrerouge_reimplementation_nostem      0.0      0.0     97.0 danieldeutsch_sacrerouge_reimplementation_stem        0.0      0.0     98.0 danieldeutsch_sacrerouge_wrapper_nostem               0.0      0.0      0.0 danieldeutsch_sacrerouge_wrapper_stem                 0.0      0.0      0.0 diego999_pyrouge_nostem                               4.0      4.0      4.0 diego999_pyrouge_stem                                 4.0      4.0      4.0 google_rougescore_lsum_nostem                         NaN      NaN      0.0 google_rougescore_lsum_stem                           NaN      NaN     19.0 google_rougescore_nostem                              0.0      0.0     97.0 google_rougescore_stem                               14.0      6.0     98.0 google_seq2seq_nostem                                98.0     97.0    100.0 kavgan_rouge20_nostem                                98.0     97.0    100.0 kavgan_rouge20_stem                                  98.0     97.0    100.0 kavgan_rouge20_stopwords_nostem                      93.0     97.0    100.0 kavgan_rouge20_stopwords_stem                        94.0     97.0    100.0 liplus_rougemetric_reimplementation_nostem           97.0     95.0     99.0 liplus_rougemetric_wrapper_nostem                     0.0      0.0      0.0 liplus_rougemetric_wrapper_stem                      13.0      6.0     18.0 neuraldialoguemetrics_easyrouge_nostem               98.0     97.0    100.0 pltrdy_files2rouge_nostem                             0.0      0.0     83.0 pltrdy_files2rouge_stem                              13.0      6.0     86.0 pltrdy_pyrouge_nostem                                 0.0      0.0      0.0 pltrdy_pyrouge_stem                                   0.0      0.0      0.0 pltrdy_rouge_nostem                                  98.0     96.0    100.0 tagucci_pythonrouge_nostem                          100.0    100.0     84.0 tagucci_pythonrouge_stem                            100.0    100.0     86.0 tylin_cococaption_nostem                              NaN      NaN    100.0 ===========================================================================

Many of these packages have configuration options that influence Rouge scores. However, as noted in Rogue Scores, Rouge configuration options are rarely included in papers.

Unfortunately, this makes assessing package correctness more difficult. For example, some packages have subtle defaults that authors may not notice (e.g., removal of stopwords by default, default stemming, default no stemming). Should packages be evaluated based on their unusual defaults? Or, should we assume authors will notice and correct them?

This means there is some guesswork involved in picking which package configurations hould be evaluated: what is the most likely way (or ways) that authors would configure this package? The code provides the exact details of how each implementation was evaluated. For example, let’s investigate how kavgan_rouge20 was evaluated.

To see a specific package experiment, view run_packages_{package}_{config}:

In [8]: run_packages_kavgan_rouge20_stem??

Or see an entire package wrapper implementation, view rouge_package_{package}:

In [9]: rouge_package_kavgan_rouge20??

Table 3: Rogue-3 “Model” Case Study

This table contains an evaluation of a baseline model (Lead-3) using a carefully chosen nonstandard Rouge package and achieves state of the art scores in summarization!

In [10]: generate_models_table()

This table could easily take several hours to generate: the Rouge package used to evaluate Rogue-3 is extremely slow. Unfortunately, there is no way to parallelize this slow Rouge package without potentially compromising the integrity of its (incorrect) scores.

Case Study Experiments:  100%|████████████████████| 2/2 [3:30:00<00:00, 6300.00s/it]  [RUNNING] run_models_lead_3 [RUNNING] run_models_rogue_3  ... full output omitted ...

Once complete, the code will print a full model evaluation table containing Lead-3 and Rogue-3 only. The results for the other models were copied from their respective papers.

================================== Rogue Scores Table 3: Rogue-3 Case Study           ROUGE-1  ROUGE-2  ROUGE-L lead_3     40.34    17.55    36.58 rogue_3    73.89    55.80    73.89 ==================================

To see exactly how the “state of the art” Rogue-3 model was evaluated, try viewing:

In [11]: run_models_rogue_3??

Figure 5: Rouge Package Code Excerpt

The code excerpt is found on GitHub. The $ \beta = 1.2 $ error occurs on line 43.

tylin/coco-caption@3a9afb2:pycocoevalcap/rouge/rouge.py#L43

In Rogue Scores, code comments were condensed and reformatted for presentation.

# Description: Computes ROUGE-L metric # as described by Lin and Hovey (2004)  class Rouge():      '''Class for computing ROUGE-L score for a set of     candidate sentences for the MS COCO test set'''      def __init__(self):          # updated the value below         # based on discussion with Hovey          self.beta = 1.2

If you don’t have make, you probably know why.

Note: A recent rerun of the code produced a slightly different result. This does not meaningfully change any of the findings of this table (the difference is one hundredth of a Rouge point), but is being investigated. It may be the result of different versions of dependencies being used in the code release.

================================================= Rogue Scores Table 1: Comparability Experiments                          ROUGE-1  ROUGE-2  ROUGE-L apply_stemming             1.68     0.54     1.31 remove_stopwords          -2.20    -0.58    -0.99 no_sentence_splits         0.00     0.00   -11.17 period_sentence_splits     0.00     0.00    -3.44 nltk_sentence_splits       0.00     0.00    -0.16 nltk_tokenize             -0.00     0.00    -0.00 fscore_beta12              1.33     0.61     1.21 misreport_recall          10.88     5.00     9.92 truncate_75_bytes        -27.91   -12.93   -33.43 <- different results                          -27.92   -12.93   -33.44 <- numbers in paper truncate_100_words        -0.07    -0.05    -0.07 =================================================

Analyses.org