Benchmarking Overton Pluralism in LLMs

TL;DR

Models achieve OvertonScores of 0.35–0.41 on average; DeepSeek V3 performs best, but all models fall far short of the theoretical maximum of 1.0, revealing substantial room for improvement.
OvertonScore formalizes pluralistic coverage as the fraction of opinion clusters — derived from a large-scale U.S.-representative study (N = 1,208) — that a model's response represents.
An automated LLM judge achieves ρ = 0.88 rank correlation with human OvertonScores, enabling scalable evaluation without repeated human studies.
Dataset, code, and benchmark are publicly available to support progress toward more pluralistic LLMs.

Abstract

Overview of the OvertonBench pipeline: LLM responses are rated by diverse human participants, opinions are clustered, and OvertonScore measures how many clusters a response covers.

Figure 1. OvertonBench pipeline. Prolific participants rate LLM responses on how well they represent their viewpoint. Ratings are clustered into opinion groups, and OvertonScore measures what fraction of clusters a response covers above a threshold τ.

Results

Dataset

Scoring

#	Model	OvertonScore ↓	Adj. Coverage	95% CI (vs. grand mean)	p-value

τ = 4.0. Adj. Coverage = OLS-adjusted mean coverage with question fixed effects and cluster-robust SEs. The 95% CI bar shows each model's deviation from the grand mean; bars fully to the right (blue) or left (red) of zero indicate significance. Bold p-values are significant at α = 0.05.

Citation

@inproceedings{poole-dayan2026benchmarking,
  author    = {Poole-Dayan, Elinor and Wu, Jiayi and Sorensen, Taylor
               and Pei, Jiaxin and Bakker, Michiel A.},
  title     = {Benchmarking Overton Pluralism in {LLMs}},
  booktitle = {The Fourteenth International Conference on Learning
               Representations (ICLR)},
  year      = {2026},
  month     = apr,
  url       = {https://arxiv.org/abs/2512.01351}
}