Figure 1. OvertonBench pipeline. Prolific participants rate LLM responses on how well they represent their viewpoint. Ratings are clustered into opinion groups, and OvertonScore measures what fraction of clusters a response covers above a threshold τ.
| # | Model | OvertonScore ↓ | Adj. Coverage | 95% CI (vs. grand mean) | p-value |
|---|
τ = 4.0. Adj. Coverage = OLS-adjusted mean coverage with question fixed effects and cluster-robust SEs. The 95% CI bar shows each model's deviation from the grand mean; bars fully to the right (blue) or left (red) of zero indicate significance. Bold p-values are significant at α = 0.05.
@inproceedings{poole-dayan2026benchmarking,
author = {Poole-Dayan, Elinor and Wu, Jiayi and Sorensen, Taylor
and Pei, Jiaxin and Bakker, Michiel A.},
title = {Benchmarking Overton Pluralism in {LLMs}},
booktitle = {The Fourteenth International Conference on Learning
Representations (ICLR)},
year = {2026},
month = apr,
url = {https://arxiv.org/abs/2512.01351}
}
Questions or feedback? Reach out to Elinor Poole-Dayan.