The metrics that are used to benchmark AI and machine learning models often do not adequately reflect the real performance of these models. This is the result of a preprint study by researchers at the Institute for Artificial Intelligence and Decision Support in Vienna, in which data in over 3,000 results of the model performance of the web-based open source platform Papers with Code were analyzed. They claim that alternative, more appropriate metrics are rarely used in benchmarking, and that reporting on metrics is inconsistent and unspecific, creating confusion.
Benchmarking is an important driver for progress in AI research. A task (or tasks) and the metrics (or tasks) associated with it can be perceived as an abstraction of a problem that the scientific community is trying to solve. Benchmark data sets are designed as fixed representative samples for tasks that are to be solved by a model. While benchmarks have been set for a number of tasks, including machine translation, object recognition, or answering questions, the paper̵
In their analysis, the researchers examined 32,209 benchmark results in 2,298 data sets from 3,867 publications published between 2000 and June 2020. They found that the studies used a total of 187 different top-level metrics and that the most commonly used metric, Accuracy, was 38% of the benchmark datasets. The second and third most frequently reported metrics were “Accuracy” or the proportion of relevant instances among the instances retrieved and “F-Measure” or the weighted mean for accuracy and recall (the proportion of the total relevant instances actually accessed). Additionally, for the subset of articles dealing with natural language processing, the top three metrics reported were the BLEU score (for things like summary and text generation), the ROUGE metrics (video captioning and summary) and METEOR (answering questions)).
For more than two-thirds (77.2%) of the benchmark datasets analyzed, only a single performance metric was reported, according to the researchers. A fraction (14.4%) of the benchmark datasets had two top-level metrics and 6% had three metrics.
Researchers note irregularities in reporting metrics they have identified, such as referring to “area under the curve” as “AUC”. The area under the curve is a measure of accuracy that can be interpreted in different ways, depending on whether the drawing accuracy and recall against each other (PR-AUC) or recall and false positive rate (ROC-AUC) are recorded were. Similarly, several articles referred to a natural language processing benchmark – ROUGE – without specifying the variant used. ROUGE has precision- and recall-specific sub-variants, and while the recall sub-variant is more common, it could create confusion when comparing results between papers, the researchers argue.
Inconsistencies aside, many of the benchmarks used in the papers studied are problematic, the researchers say. The accuracy often used to evaluate binary and multi-class classifier models does not provide informative results when dealing with unbalanced corpora that have large differences in the number of instances per class. If a classifier predicts the majority class in all cases, then the accuracy is the majority class share of the total cases. For example, if a particular “Class A” makes up 95% of all instances, a classifier that constantly predicts “Class A” has an accuracy of 95%.
Precision and recall also have limitations in that they focus only on instances that were predicted as positive by a classifier or on true positives (accurate predictions). Both ignore the models’ ability to accurately predict negative cases. F-scores sometimes give more weight to precision and recall, which leads to misleading results for classifiers aimed at predicting the majority class. In addition, they can only focus on one class.
In the area of natural language processing, researchers highlight problems with benchmarks such as BLEU and ROUGE. BLEU does not consider recall and does not correlate with human judgments about the quality of machine translations, and ROUGE covers tasks that rely on extensive paraphrases such as abstract summary and extractive summary of transcripts with many different speakers, such as: B. Meeting minutes, not appropriate.
The researchers found that better metric alternatives, such as the Matthews Correlation Coefficient and Fowlkes-Mallows Index, which address some of the shortcomings in accuracy and F-score metrics, were not used in any of the work they analyzed. In fact, 83.1% of the benchmark records that reported the “accuracy” of the top-level metric had no other top-level metrics, and F-Measure was the only metric in 60.9% of the records. This was also true of the natural language processing metrics. METEOR, which was shown to be highly correlated with human judgment across tasks, was used only 13 times. And GLEU, which is used to assess how well generated text corresponds to “normal” language usage, was only used three times.
The researchers admit that their decision to analyze preprints, unlike articles that were accepted in scientific journals, could skew the results of their study. However, they stand behind their conclusion that the majority of metrics currently used to evaluate AI benchmarking tasks have properties that may result in an inadequate representation of a classifier’s performance, especially when used with unbalanced datasets. “Although alternative metrics have been proposed to address problematic properties, they are rarely used as performance metrics in benchmarking tasks that use a small set of historical metrics instead. NLP-specific tasks pose additional challenges for metric design due to the linguistic and task-specific complexity, ”the researchers write.
A growing number of scientists are calling for a focus on scientific advancement in AI rather than better benchmark performance. In a June interview, Denny Britz, a former Google Brain team member, said that tracking state-of-the-art was a bad practice because there were too many confusing variables and preferred large, well-funded labs like this one would DeepMind and OpenAI. Separately, Zachary Lipton (assistant professor at Carnegie Mellon University) and Jacob Steinhardt (member of the University of California, Berkeley’s Department of Statistics) suggested in a recent meta-analysis that AI researchers suggested the how and why of an approach versus performance and lead conduct more error analyzes, ablation studies, and robustness tests as the research progresses.