Qualitatively Evaluating Large Language Models

Why evaluation is hard and a new approach to interpretable evaluations

Oct 16, 2024

Choosing the right language model to use can be a daunting task. Traditional evaluation methods, while useful, often fall short in providing a comprehensive understanding of a model's capabilities. In this post, we'll explore the limitations of current evaluation techniques and introduce a novel approach: qualitative Report Cards for language models.

Pitfalls of existing approaches

At first glance, platforms like ChatbotArena seem to offer a solution. ChatbotArena allows users to compare two anonymous models by asking questions and voting for the better answer. An Elo-rating system then generates a leaderboard based on these votes. However, this approach is vulnerable to Goodhart's Law, which states, "When a measure becomes a target, it ceases to be a good measure."

As Nathan Lambert aptly summarizes:

For most of its existence, people correlated the general capabilities tested in ChatbotArena with a definitive ranking of which models can do the hardest things for me.
This is not true. The ChatbotArena ranking benchmark shows the strongest correlations with: certain stylistic outputs from language models, and language models that have high rates of complying with user requests.

Any objective which can be specified can become optimized, and companies were good and became better at tuning their models to maximize their scores on pairwise comparisons on ChatbotArena.

Moving away from human-rated evaluations, another common approach considers summary statistics such as validation set accuracy. These statistics are concise and easily comparable and what is commonly reported on leaderboards, such as Holistic Evaluation of Language Models (HELM). One challenge with these scores is that they can be hard to interpret. What exactly does an 80% accuracy on a high school math dataset really mean in terms of practical capabilities?

Recent research has shown that model performance can vary significantly across different types of math problems. For example, Liu et al. (2024) found that some models performed better on theoretical versus applied math problems, and there were tradeoffs when assessing math abilities in a bilingual context.

Introducing Qualitative Report Cards

To address these limitations, our recent paper, which will be featured as a spotlight at the NeurIPS Socially Responsible Language Modeling workshop, introduces a novel framework for evaluating large language models (LLMs).

Report card snippet. You can find more examples here.

Our approach aims to bridge the gap between easily comparable metrics and comprehensive qualitative assessments. We propose automated Report Cards that provide interpretable, detailed insights into a model's capabilities.

We wanted our evaluations to achieve three objectives:

Specificity: A Report Card should accurately describe unique aspects of model behavior, so that it may be used to distinguish between models.
Faithfulness: The specific behaviors described by a Report Card, taken as a whole, should accurately capture the model’s overall capability with respect to the skill it describes.
Interpretability: A Report Card should be relevant, informative, and clear to humans.

We posit that these are necessary properties that any good qualitative report should satisfy. The first two properties can also be measured with minimal supervision.

Evaluating Report Cards

To assess the effectiveness of our Report Cards, we developed several evaluation methods:

Specificity: We test whether it's possible to match a model with its responses based solely on its Report Card. This pairwise comparison is depicted below and verifies that the reports capture distinctive characteristics of each model.
Faithfulness: We compare Elo scores generated by pairwise comparisons of Report Cards with Elo scores based on direct model outputs. This checks whether the reports accurately reflect true model capabilities.
Interpretability: We recruited volunteers to rate Report Card excerpts on a 5-point Likert scale, assessing clarity, relevance, and informativeness.

Our results show that the Report Cards perform strongly on all three metrics, outperforming baseline methods. Importantly, they tend to capture substantive properties of model behavior rather than superficial stylistic features.

Specificity evaluation: we check if it is possible to match the model with the response based on the report cards.

Generating Report Cards

We've also developed an initial algorithm for generating these Report Cards. The process works as follows:

A "teacher" LLM summarizes a student model's answers to an initial set of questions, creating a draft Report Card. We found that the teacher had to have a certain level of capability and primarily used Claude-3.5 Sonnet.
As more questions are processed, the teacher updates the report to incorporate new insights about the model's capabilities.

This iterative approach allows the report to capture nuances that might be missed in a single-pass evaluation. This method can be applied to various datasets, including knowledge-based school subjects and those assessing potential model safety risks:

We find qualitative examples that show Report Cards can offer insights on unseen test set examples. Here the Report Card correctly identifies a weakness in reasoning with combinatorial concepts, which matches the behavior on an unseen test question.

To explore and learn more about report cards, we invite you to read the paper or explore the model dashboard.

This is a first step, and there are important limitations to our work. Our experiments are limited to specific topics and datasets. Future work could consider applying Report Cards to a wider range of domains—including open-ended tasks like reasoning, creative writing, and emotional understanding. There are potential biases in LLM-based evaluations, which remain an important challenge to ensure fair and comprehensive assessments. Future research could explore methods to mitigate biases in LLM-based evaluations and investigate how Report Cards could be used in model development and selection processes.

As language models continue to advance, the need for comprehensive, interpretable evaluation methods becomes increasingly crucial. Our Report Card approach is a step towards more holistic LLM assessment, providing practitioners with insights beyond simple leaderboard rankings.

This work would not be possible without the hard work and dedication of our entire team of Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, and Silviu Pitis.

Trajectories

Discussion about this post