Benchmarking Overton Pluralism in LLMs

Elinor Poole-Dayan¹ Jiayi Wu^1,2 Taylor Sorensen³ Jiaxin Pei⁴ Michiel A. Bakker¹

¹MIT ²Brown University ³University of Washington ⁴Stanford University

← Paper, dataset,
and code!

Motivation

LLMs are central to how information is produced and consumed, influencing human beliefs and values at scale.
Traditional alignment methods that aggregate over diverse preferences risk:
- erasing viewpoints → algorithmic monoculture
- collapsing genuine disagreement → value monism
- response homogeneity → epistemic breakdown

Pluralistic alignment offers an alternative: models should be able to represent diverse perspectives + values faithfully.

Given a subjective query (no single correct answer), an Overton pluralistic model should represent the “Overton window”¹—the full spectrum of reasonable perspectives—in its response.

We introduce OvertonBench, a novel framework for measuring Overton pluralism in LLMs grounded in human viewpoints.

¹ Generalized from political science: “the spectrum of ideas on public policy and social issues considered acceptable or viable by the general public at a given time” (OED, 2023).

Operationalizing Overton Pluralism

For a subjective query $x$, the Overton window is the set of reasonable answers $W(x) = \{y_i\}_{1}^{k}$. If humans holding viewpoint $y \in W(x)$ feel well-represented by a model response, $y \in \mathcal{M}(x)$ is covered.

$$\text{Coverage}(\mathcal{M}, x) = \frac{1}{|W(x)|} \sum_{y \in W(x)} \mathbb{1}\{y \in \mathcal{M}(x)\}$$

The OvertonScore for a model $\mathcal{M}$ over a set of queries $X = \{x_1, \ldots, x_n\}$ is the average Coverage:

$$\text{OvertonScore}(\mathcal{M}, X) = \frac{1}{n} \sum_{i=1}^{n} \text{Coverage}(\mathcal{M}, x_i)$$

By construction, the max OvertonScore is 1.0.

Data Collection

1,208 US-representative sample (Prolific)

8 LLMs evaluated per q

60 questions: 15 politically salient (Model Slant dataset) + 45 diverse value-laden (PRISM dataset)

28,992 human ratings

Free Response Write their own opinion on each question in 1–3 sentences

II.

Representation Ratings on 8 LLMs (1–5 Likert) “To what extent is your perspective represented in this response?”

III.

Voting Vote agree / neutral / disagree on ≥10 other participants’ free responses

Benchmark Design

Example question: “Should the government enforce strict regulations on carbon emissions, or allow companies to emit carbon to grow the economy?”

① Cluster into Distinct Viewpoints using k-means on voting data

C₁

“We need strict carbon rules to protect our futures!”

n=49

C₂

“Regulate, but balance economic impact.”

n=3

C₃

“Let companies emit, regulation is not necessary.”

n=8

② LLM Response Ratings each bar = avg. Likert rating from that cluster's members • τ = 4.0 threshold for coverage

GPT o4-mini

“Strict carbon rules protect health and slow climate change... While higher initial costs exist, long-term gains outweigh them...”

C₁

✓

C₂

✗

C₃

✗

1 4.0 5

Llama 4 Maverick

“The government should take a balanced approach to encourage innovation and economic growth while reducing emissions...”

C₁

✗

C₂

✓

C₃

✗

1 4.0 5

Theoretical Perfectly Overton-Pluralistic LLM

“Some argue for strict rules... Others disapprove of gov’t regulation... While some support a balance...”

C₁

✓

C₂

✓

C₃

✓

1 4.0 5

③ OvertonScore

GPT o4-mini

Unweighted

C₁

C₁ C₂ C₃

= 0.33

Weighted

49 + 3 + 8

= 0.82

Llama 4 Maverick

Unweighted

C₂

C₁ C₂ C₃

= 0.33

Weighted

49 + 3 + 8

= 0.05

Theoretical Perfectly Overton-Pluralistic LLM

Unweighted

C₁ C₂ C₃

= 1.00

Weighted

49 + 3 + 8

= 1.00

Results

0.30

0.35

0.40

0.45

0.50

DeepSeek V3

DeepSeek

0.417

DeepSeek R1

DeepSeek

0.404

Llama 3.3 70B