Table 1. Description of each field in the systematic review dataset. Download this dataset (above) and follow along with this reproducibility guide!
Paper Bibliographic Information | |
paper_ident | Unique paper identifier. |
paper_url | Paper online abstract page URL. |
paper_author | Author list. |
paper_title | Paper title. |
paper_venue | Venue abbreviation. |
paper_year | Publication year. |
paper_month | Publication month. |
paper_booktitle | BibTeX booktitle field. |
paper_address | BibTeX address field. |
paper_publisher | BibTeX publisher field |
paper_pages | BibTeX pages. |
Paper Systematic Review Labels | |
paper_rouge | Does the paper compute |
paper_rouge_prelim | Preliminary |
paper_params | |
paper_protocol | |
paper_packages | List of |
paper_variants | |
Codebase Systematic Review Labels | |
code_url | URL of code repository cited in paper. |
code_rouge | Does the code perform reproducible |
code_rouge_prelim | Does the code mention |
code_packages | List of |
Additional Review Labels | |
packages | Complete list of all |
software_error | Does the paper/code use a |
reproducible | Is the work reproducible? According to Figure 1(A) definition. |
Paper Processing Pipeline Metadata | |
error_download | Error occurred in paper downloading? |
error_extract | Error occurred in paper text extraction? |
Download and Setup Environment
Download the dataset and code release using the link provided at the top of the page. Extract roguescores.tar.gz
to the /roguescores
directory.
To improve reproducibility, the entire project runs inside an Ubuntu container. If Docker is not yet installed on your computer, follow the Docker installation instructions. The only other required dependency is GNU Make, which you probably already have. 1
Enter the /roguescores
directory and run:
$ make build # Creates Docker image.
This will create a new image called roguescores
. This image contains all required Python and Perl dependencies. After the image has been created, start a container by running:
$ make start # Starts console used in next sections.
This launches into an ipython
console. The container has access to your local /roguescores
directory. When you exit the console by running exit
or pressing Control+D
, the container is automatically cleaned up. The image will not be removed. When you are finished using the roguescores
image, manually remove it:
$ make clean # WARNING: Removes Docker image!
The rest of this guide will involve running Python functions in the console.
Overview of the Dataset
Rogue Scores conducts a systematic review that covers over 100,000 machine learning papers. This review results in a large papers dataset, which is provided as a JSON lines file. For more information about the structure of this dataset, see Table 1.
Part of the systematic review involves identifying papers that evaluate using
These entries are necessary to reproduce several figures, like Figure 4, which provides an overview of the systematic review process. An easy way to reduce the dataset to only paper_rouge = true
.
What to Expect When Running the Code
There are eight figures and tables in Rogue Scores. This guide will walk you through generating all of these tables and figures, except two figures that are purely explanatory and do not contain any numbers. Here is a list of figures this guide will reproduce:
Figure 1, Figure 3, Figure 4: These figures are based on the systematic review dataset. They require little additional computation and will be generated very quickly.
Table 1, Table 2, Table 3: These tables are generated after running many different configurations and
Rouge packages. If possible on your computer, this will be parallelized across multiple processes. They may take substantial time.Figure 2, Figure 5: These figures do not contain any numbers to reproduce.
The code release includes a Python script, Dockerfile, and Makefile. The functions provided in the Python file are a DAG that end-to-end reproduce the entire paper, including download and setup of specific
Figure 1: Systematic Review Overview
This figure is on the first page of Rogue Scores and gives a summary of the major results of the paper. Use the ipython
console to generate this figure by running:
In [1]: generate_overview_figure()
The output is a basic text-only version of the paper figure.
=================== (A) REPRODUCIBILITY =================== 2834 model evaluations using ROUGE 20% reproducible (NOTE: see paper for details on comparison studies) ================= (B) COMPARABILITY ================= Release code -- including incomplete and nonfunctional 33% papers Release code with ROUGE evaluation 12% papers Perform ROUGE significance testing / bootstrapping 6% papers List ROUGE configuration parameters 5% papers Cite ROUGE software package -- including unofficial 35% papers =============== (C) CORRECTNESS =============== Percentage of ROUGE software citations that reference software with scoring errors 76% papers
Simply rerunning these functions is useful. But the real value of a code release is exploring how each of these items is operationalized. Try running generate_overview_figure??
in the ipython
console, which will allow you to page through the function definition.
In [2]: generate_overview_figure??
The figure claims “76% of papers with
Figure 2: Rouge Evaluation Diagram
This figure is a simple TikZ diagram that demonstrates a
In [3]: print("Done!")
Done!
Figure 3: Reproducibility and Correctness Plots
This dual figure expands on the Figure 1 statistics by showing how ROUGE-1.5.5
was introduced in 2004.
In [4]: generate_historical_plot()
The output is two PDF files (correctness.pdf
and reproducibility.pdf
) which form the top and bottom halves of Figure 3 in Rogue Scores. They look like this:
Figure 4: Systematic Review Procedure
This figure is based on the
In [5]: generate_process_figure()
The output is the raw information underlying the figure, which can be plugged into the appropriate places to create the Rogue Scores TikZ diagram.
Overall Citations Collected =========================== ACL Citations: 70676 DBLP Citations: 40013 ---------- Total Citations: 110689 Download and Extract Text ========================= Before 2002: 6976 Paper Inaccessible: 3101 Extraction Errors: 30 ---------- Citations Excluded: 10107 Full-Text Machine Learning Papers ================================= 100582 Screen Papers for ROUGE ======================= Automated Rules: 96861 Manual Review: 887 ---------- Papers Excluded: 97748 ROUGE Papers Included in Review =============================== 2834 Screen Code for ROUGE ===================== Code Unavailable: 1697 Linking Errors: 306 ---------- Codebases Excluded: 2003 ROUGE Codebases Included in Review ================================== 831
Table 1: Rouge Comparability Experiments
This table lists ROUGE-1.5.5
configuration differences, highlighting the importance of reporting exact evaluation configurations.
In [6]: generate_configs_table()
Generating Table 1 involves running the ROUGE-1.5.5
package 11 times (for the 1 baseline configuration and 10 comparisons), which will take a couple minutes. The results will be automatically cached after the first run, and will run faster in the future.
You will see something like this for 10-30 minutes:
Protocol Experiments: 18%|████ | 2/11 [11:00<22:00, 120.00s/it] [RUNNING] run_configs_period_sentence_splits [RUNNING] run_configs_no_sentence_splits [RUNNING] run_configs_apply_stemming [RUNNING] run_configs_baseline [RUNNING] run_configs_truncate_75_bytes [RUNNING] run_configs_remove_stopwords [RUNNING] run_configs_nltk_sentence_splits [RUNNING] run_configs_nltk_tokenize [RUNNING] run_configs_truncate_100_words [RUNNING] run_configs_fscore_beta12 [RUNNING] run_configs_misreport_recall ... full output omitted ...
When complete, the following results table will be printed. 2
================================================= Rogue Scores Table 1: Comparability Experiments ROUGE-1 ROUGE-2 ROUGE-L apply_stemming 1.68 0.54 1.31 remove_stopwords -2.21 -0.58 -0.99 no_sentence_splits 0.00 0.00 -11.17 period_sentence_splits 0.00 0.00 -3.44 nltk_sentence_splits 0.00 0.00 -0.16 nltk_tokenize -0.00 0.00 -0.00 fscore_beta12 1.33 0.61 1.21 misreport_recall 10.88 5.00 9.92 truncate_75_bytes -27.92 -12.93 -33.44 truncate_100_words -0.07 -0.05 -0.07 =================================================
Rogue Scores slightly reorders this table and adds sections (e.g., “Preprocessing”). There are slight differences for all nltk_tokenize
results, reported as “<0.01” in Rogue Scores. Because sentence splits only affect
Table 2: Rouge Correctness Experiments
Ths table tests nonstandard ROUGE-1.5.5
reference implementation, recording how frequently they compute incorrect scores.
In [7]: generate_packages_table()
This table will also take a long time to generate.
Correctness Experiments: 49%|██████████ | 18/37 [3:00:00<3:00:00, 580.00s/it] [RUNNING] run_packages_baseline_nostem [RUNNING] run_packages_baseline_stem [RUNNING] run_packages_andersjo_pyrouge_nostem [RUNNING] run_packages_andersjo_pyrouge_stem [RUNNING] run_packages_bheinzerling_pyrouge_nostem [RUNNING] run_packages_bheinzerling_pyrouge_stem [RUNNING] run_packages_chakkiworks_sumeval_nostem [RUNNING] run_packages_chakkiworks_sumeval_stem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_nostem [RUNNING] run_packages_chakkiworks_sumeval_stopwords_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_wrapper_stem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_nostem [RUNNING] run_packages_danieldeutsch_sacrerouge_reimplementation_stem [RUNNING] run_packages_diego999_pyrouge_nostem [RUNNING] run_packages_diego999_pyrouge_stem [RUNNING] run_packages_google_rougescore_nostem [RUNNING] run_packages_google_rougescore_stem [RUNNING] run_packages_google_rougescore_lsum_nostem [RUNNING] run_packages_google_rougescore_lsum_stem [RUNNING] run_packages_google_seq2seq_nostem [RUNNING] run_packages_kavgan_rouge20_nostem [RUNNING] run_packages_kavgan_rouge20_stem [RUNNING] run_packages_kavgan_rouge20_stopwords_nostem [RUNNING] run_packages_kavgan_rouge20_stopwords_stem [RUNNING] run_packages_liplus_rougemetric_wrapper_nostem [RUNNING] run_packages_liplus_rougemetric_wrapper_stem [RUNNING] run_packages_liplus_rougemetric_reimplementation_nostem [RUNNING] run_packages_neuraldialoguemetrics_easyrouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_nostem [RUNNING] run_packages_pltrdy_files2rouge_stem [RUNNING] run_packages_pltrdy_pyrouge_nostem [RUNNING] run_packages_pltrdy_pyrouge_stem [RUNNING] run_packages_pltrdy_rouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_nostem [RUNNING] run_packages_tagucci_pythonrouge_stem [RUNNING] run_packages_tylin_cococaption_nostem ... full output omitted ...
This version of Table 2 is presented in alphabetical order by package name. In Rogue Scores, this table has been reordered (to distinguish between wrappers and reimplementations).
Additionally, some packages do not implement Porter stemming. These packages are not evaluated using stemming in Table 2, which is why they are missing entries. For example, tylin_cococaption_stem
is missing because this package does not stem. This is indicated with a dash in Rogue Scores Table 2. Other packages (or configurations) only support certain NaN
in Table 2. For example, google_rougescore_lsum_nostem
evaluates an alternative version of
=========================================================================== Rogue Scores Table 2: Correctness Experiments ROUGE-1 ROUGE-2 ROUGE-L andersjo_pyrouge_nostem 100.0 100.0 100.0 andersjo_pyrouge_stem 100.0 100.0 100.0 bheinzerling_pyrouge_nostem 46.0 28.0 56.0 bheinzerling_pyrouge_stem 0.0 0.0 0.0 chakkiworks_sumeval_nostem 98.0 97.0 100.0 chakkiworks_sumeval_stem 98.0 97.0 100.0 chakkiworks_sumeval_stopwords_nostem 0.0 0.0 97.0 chakkiworks_sumeval_stopwords_stem 73.0 61.0 99.0 danieldeutsch_sacrerouge_reimplementation_nostem 0.0 0.0 97.0 danieldeutsch_sacrerouge_reimplementation_stem 0.0 0.0 98.0 danieldeutsch_sacrerouge_wrapper_nostem 0.0 0.0 0.0 danieldeutsch_sacrerouge_wrapper_stem 0.0 0.0 0.0 diego999_pyrouge_nostem 4.0 4.0 4.0 diego999_pyrouge_stem 4.0 4.0 4.0 google_rougescore_lsum_nostem NaN NaN 0.0 google_rougescore_lsum_stem NaN NaN 19.0 google_rougescore_nostem 0.0 0.0 97.0 google_rougescore_stem 14.0 6.0 98.0 google_seq2seq_nostem 98.0 97.0 100.0 kavgan_rouge20_nostem 98.0 97.0 100.0 kavgan_rouge20_stem 98.0 97.0 100.0 kavgan_rouge20_stopwords_nostem 93.0 97.0 100.0 kavgan_rouge20_stopwords_stem 94.0 97.0 100.0 liplus_rougemetric_reimplementation_nostem 97.0 95.0 99.0 liplus_rougemetric_wrapper_nostem 0.0 0.0 0.0 liplus_rougemetric_wrapper_stem 13.0 6.0 18.0 neuraldialoguemetrics_easyrouge_nostem 98.0 97.0 100.0 pltrdy_files2rouge_nostem 0.0 0.0 83.0 pltrdy_files2rouge_stem 13.0 6.0 86.0 pltrdy_pyrouge_nostem 0.0 0.0 0.0 pltrdy_pyrouge_stem 0.0 0.0 0.0 pltrdy_rouge_nostem 98.0 96.0 100.0 tagucci_pythonrouge_nostem 100.0 100.0 84.0 tagucci_pythonrouge_stem 100.0 100.0 86.0 tylin_cococaption_nostem NaN NaN 100.0 ===========================================================================
Many of these packages have configuration options that influence
Unfortunately, this makes assessing package correctness more difficult. For example, some packages have subtle defaults that authors may not notice (e.g., removal of stopwords by default, default stemming, default no stemming). Should packages be evaluated based on their unusual defaults? Or, should we assume authors will notice and correct them?
This means there is some guesswork involved in picking which package configurations hould be evaluated: what is the most likely way (or ways) that authors would configure this package? The code provides the exact details of how each implementation was evaluated. For example, let’s investigate how kavgan_rouge20
was evaluated.
To see a specific package experiment, view run_packages_{package}_{config}
:
In [8]: run_packages_kavgan_rouge20_stem??
Or see an entire package wrapper implementation, view rouge_package_{package}
:
In [9]: rouge_package_kavgan_rouge20??
Table 3: Rogue-3 “Model” Case Study
This table contains an evaluation of a baseline model (Lead-3) using a carefully chosen nonstandard
In [10]: generate_models_table()
This table could easily take several hours to generate: the
Case Study Experiments: 100%|████████████████████| 2/2 [3:30:00<00:00, 6300.00s/it] [RUNNING] run_models_lead_3 [RUNNING] run_models_rogue_3 ... full output omitted ...
Once complete, the code will print a full model evaluation table containing Lead-3 and Rogue-3 only. The results for the other models were copied from their respective papers.
================================== Rogue Scores Table 3: Rogue-3 Case Study ROUGE-1 ROUGE-2 ROUGE-L lead_3 40.34 17.55 36.58 rogue_3 73.89 55.80 73.89 ==================================
To see exactly how the “state of the art” Rogue-3 model was evaluated, try viewing:
In [11]: run_models_rogue_3??
Figure 5: Rouge Package Code Excerpt
The code excerpt is found on GitHub. The $ \beta = 1.2 $ error occurs on line 43.
In Rogue Scores, code comments were condensed and reformatted for presentation.
# Description: Computes ROUGE-L metric # as described by Lin and Hovey (2004) class Rouge(): '''Class for computing ROUGE-L score for a set of candidate sentences for the MS COCO test set''' def __init__(self): # updated the value below # based on discussion with Hovey self.beta = 1.2
If you don’t have
make
, you probably know why.Note: A recent rerun of the code produced a slightly different result. This does not meaningfully change any of the findings of this table (the difference is one hundredth of a
Rouge point), but is being investigated. It may be the result of different versions of dependencies being used in the code release.================================================= Rogue Scores Table 1: Comparability Experiments ROUGE-1 ROUGE-2 ROUGE-L apply_stemming 1.68 0.54 1.31 remove_stopwords -2.20 -0.58 -0.99 no_sentence_splits 0.00 0.00 -11.17 period_sentence_splits 0.00 0.00 -3.44 nltk_sentence_splits 0.00 0.00 -0.16 nltk_tokenize -0.00 0.00 -0.00 fscore_beta12 1.33 0.61 1.21 misreport_recall 10.88 5.00 9.92 truncate_75_bytes -27.91 -12.93 -33.43 <- different results -27.92 -12.93 -33.44 <- numbers in paper truncate_100_words -0.07 -0.05 -0.07 =================================================