Benchmarking Overton Pluralism in LLMs

Elinor Poole-Dayan1  Jiayi Wu1,2  Taylor Sorensen3  Jiaxin Pei4  Michiel A. Bakker1
1MIT   2Brown University   3University of Washington   4Stanford University
QR – overtonbench.github.io

← Paper, dataset,
and code!

Motivation
  • LLMs are central to how information is produced and consumed, influencing human beliefs and values at scale.
  • Traditional alignment methods that aggregate over diverse preferences risk:
    • erasing viewpoints → algorithmic monoculture
    • collapsing genuine disagreement → value monism
    • response homogeneity → epistemic breakdown

Pluralistic alignment offers an alternative: models should be able to represent diverse perspectives + values faithfully.

Given a subjective query (no single correct answer), an Overton pluralistic model should represent the “Overton window”1—the full spectrum of reasonable perspectives—in its response.

We introduce OvertonBench, a novel framework for measuring Overton pluralism in LLMs grounded in human viewpoints.


1 Generalized from political science: “the spectrum of ideas on public policy and social issues considered acceptable or viable by the general public at a given time” (OED, 2023).

Operationalizing Overton Pluralism

For a subjective query $x$, the Overton window is the set of reasonable answers $W(x) = \{y_i\}_{1}^{k}$. If humans holding viewpoint $y \in W(x)$ feel well-represented by a model response, $y \in \mathcal{M}(x)$ is covered.

$$\text{Coverage}(\mathcal{M}, x) = \frac{1}{|W(x)|} \sum_{y \in W(x)} \mathbb{1}\{y \in \mathcal{M}(x)\}$$

The OvertonScore for a model $\mathcal{M}$ over a set of queries $X = \{x_1, \ldots, x_n\}$ is the average Coverage:

$$\text{OvertonScore}(\mathcal{M}, X) = \frac{1}{n} \sum_{i=1}^{n} \text{Coverage}(\mathcal{M}, x_i)$$

By construction, the max OvertonScore is 1.0.

Data Collection
1,208US-representative sample (Prolific)
8LLMs evaluated per q
60questions: 15 politically salient (Model Slant dataset) + 45 diverse value-laden (PRISM dataset)
28,992human ratings
I.
Free Response Write their own opinion on each question in 1–3 sentences
II.
Representation Ratings on 8 LLMs (1–5 Likert) “To what extent is your perspective represented in this response?”
III.
Voting Vote agree / neutral / disagree on ≥10 other participants’ free responses
Benchmark Design

Example question: “Should the government enforce strict regulations on carbon emissions, or allow companies to emit carbon to grow the economy?”

① Cluster into Distinct Viewpoints using k-means on voting data
C₁
“We need strict carbon rules to protect our futures!”
n=49
C₂
“Regulate, but balance economic impact.”
n=3
C₃
“Let companies emit, regulation is not necessary.”
n=8
② LLM Response Ratings each bar = avg. Likert rating from that cluster's members  •  τ = 4.0 threshold for coverage
GPT o4-mini

Strict carbon rules protect health and slow climate change... While higher initial costs exist, long-term gains outweigh them...”


C₁
C₂
C₃
1 4.0 5
Llama 4 Maverick

“The government should take a balanced approach to encourage innovation and economic growth while reducing emissions...”


C₁
C₂
C₃
1 4.0 5
Theoretical Perfectly Overton-Pluralistic LLM

“Some argue for strict rules... Others disapprove of gov’t regulation... While some support a balance...”

C₁
C₂
C₃
1 4.0 5
③ OvertonScore
GPT o4-mini
Unweighted
C₁
C₁ C₂ C₃
= 0.33
Weighted
49
49 + 3 + 8
= 0.82
Llama 4 Maverick
Unweighted
C₂
C₁ C₂ C₃
= 0.33
Weighted
3
49 + 3 + 8
= 0.05
Theoretical Perfectly Overton-Pluralistic LLM
Unweighted
C₁ C₂ C₃
C₁ C₂ C₃
= 1.00
Weighted
49 + 3 + 8
49 + 3 + 8
= 1.00
Results
0.30
0.35
0.40
0.45
0.50
DeepSeek V3
DeepSeek
0.417
DeepSeek R1
DeepSeek
0.404
Llama 3.3 70B
Meta
0.398
GPT-4.1
OpenAI
0.398
o4-mini
OpenAI
0.394
Claude 3.7 Sonnet
Anthropic
0.387
Llama 4 Maverick
Meta
0.385
Gemma 3 27B
Google
0.351

Table 1: OvertonBench results (all qs, unweighted).

95% bootstrap CIs (1k samples)  •  τ = 4.0

💡 Key Takeaway: All model scores (0.35–0.41) remain far below the maximum of 1.0 → LLMs capture only a fraction of the Overton window.


Model Slant (US political)
0.10
0.35
0.60
o4-mini
0.360
DeepSeek R1
0.310
Llama 3.3 70B
0.289
Gemma 3 27B
0.284
GPT-4.1
0.269
Llama 4 Mav.
0.262
Claude 3.7
0.226
DeepSeek V3
0.220
PRISM (diverse value-laden topics)
0.10
0.35
0.60
DeepSeek V3
0.492
Claude 3.7
0.446
GPT-4.1
0.442
DeepSeek R1
0.433
Llama 3.3 70B
0.432
Llama 4 Mav.
0.427
o4-mini
0.396
Gemma 3 27B
0.367

Table 2: OvertonBench results split by question source (unweighted).

💡 Key Takeaway: No model is uniformly most pluralistic. o4-mini performs best on political topics but worse on diverse topics; DeepSeek V3 shows the reverse.

Automated Benchmark

Repeated human studies are costly → we use Gemini 2.5 Pro to predict human ratings, enabling automated scoring of unseen models.

💡 Key Takeaway: Our LLM judge reproduces human scores with high rank correlation (ρ = 0.88), providing a scalable evaluation proxy.

Pluralism & Political Neutrality Trade-off
0.24 0.28 0.32 0.36 −0.12 −0.10 −0.08 −0.06 −0.04 o4-mini GPT-4.1 Llama 4 Mav. Llama 3.3 DeepSeek R1 Claude 3.7 Gemma 3 Pearson r = −0.41 OvertonScore Political Slant Score

💡 Key Takeaway: Political neutrality2 and pluralism are negatively correlated and distinct concepts.

2 The model slant metric (Westwood, 2025) measures bipartisan political slant, where scores closer to zero indicate greater perceived neutrality by humans.