ICLR 2026

Benchmarking Overton Pluralism in LLMs

Elinor Poole-Dayan¹ Jiayi Wu^1,2 Taylor Sorensen³ Jiaxin Pei⁴ Michiel A. Bakker¹

¹MIT ²Brown University ³University of Washington ⁴Stanford University

📄 Paper 🤗 Dataset 💻 Code

8 Models Assessed

60 Questions

28,992 Human Ratings

Abstract

Overview of the OvertonBench pipeline: LLM responses are rated by diverse human participants, opinions are clustered, and OvertonScore measures how many clusters a response covers.

Figure 1. OvertonBench pipeline. Prolific participants rate LLM responses on how well they represent their viewpoint. Ratings are clustered into opinion groups, and OvertonScore measures what fraction of clusters a response covers above a threshold τ.

Results

Dataset

Scoring

#	Model	OvertonScore ↓	95% CI	Company

← scroll to see all columns →

τ = 4.0. OvertonScore = OLS-adjusted fraction of opinion clusters covered (question fixed effects). 95% bootstrap CIs from 1,000 question-level resamples. Unweighted scores range 0.35–0.42; weighted scores (upweighting larger opinion clusters) range 0.43–0.53. Variation across models is modest.

Citation

@inproceedings{poole-dayan2026benchmarking,
  author = {Poole-Dayan, Elinor and Wu, Jiayi and Sorensen, Taylor and Pei, Jiaxin and Bakker, Michiel A.},
  title = {Benchmarking Overton Pluralism in LLMs},
  booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year = {2026},
  month = apr,
  url = {https://arxiv.org/abs/2512.01351}
}

Questions or feedback? Reach out to Elinor Poole-Dayan.