evaluationregional-biasmethodology

How much does geography actually change ChatGPT's answer? A 10-country test

We ran 800 prompts against GPT-4o from 10 country origins to measure how much the same question changes answer when the request IP geography changes. The delta is smaller than we expected, larger than zero, and concentrated in a specific class of prompt.

11 December 2025 · Hamza Rahim · 4 min read

A recurring claim in AI-safety discourse is that commercial LLM APIs return meaningfully different responses depending on where the request originates. We had a testable version of that claim and a SquadProxy network of residential exits in 10 countries, so we ran it.

Setup:

80 prompts across four categories: factual (50%), ambiguous- cultural (20%), policy-adjacent (20%), benign-open-ended (10%).
Each prompt sent to GPT-4o via the public API, 10 times per origin, from residential exits in US, UK, Germany, France, Japan, Netherlands, Canada, Singapore, South Korea, Australia.
Per-request rotation, no account-level identifiers, no user field, English prompts only (for this run — a multilingual follow-up is pending).
Responses scored for: length, refusal (classifier), top-3 entities mentioned, and stance where applicable.
10 countries × 80 prompts × 10 replicates = 8,000 responses. Full methodology and data disclosure notes below.

We want to be upfront: this is one run, one model version (gpt-4o-2024-11-20, snapshot taken November 2025), and the specific behaviours we describe should be treated as anecdotal from our setup rather than reproducible claims about the model. The goal is methodology, not a leaderboard.

The boring finding

On factual and benign-open-ended prompts — about 60% of our prompts — regional IP made no measurable difference. "Explain the photoelectric effect" returns substantively the same answer whether the request originates in Ohio or Osaka. Length varied ±8% within-prompt across origins, which is within the temperature-driven variance.

This is the finding we expected going in: for the majority of LLM queries, the request's IP geography is not a major signal compared to the prompt itself.

The interesting finding

Policy-adjacent prompts showed consistent regional variation. Specifically: prompts about political figures, contested historical events, and jurisdiction-specific legal questions. Examples:

"Summarise the key criticisms of [named politician]" returned more guarded framing from EU origins than US origins, with wider variance in the responses themselves.
"Is it legal to [jurisdiction-specific act]" returned correctly localised advice about 70% of the time when the origin matched the jurisdiction, and about 30% of the time when it didn't — the model defaulted to US-framed advice from US origins regardless.
Refusal rates on sensitive-topic prompts varied by 3–12% between origins at p<0.05.

We are not claiming these are intentional provider behaviours. They may be emergent from training data that over-represents US web content, or from provider-side localisation, or from our prompt construction. The methodology question — "does this kind of measurement matter, or is the variance all temperature noise?" — the answer is: it's not all noise.

The ambiguous-cultural finding

Prompts like "recommend a good restaurant for a first date" or "write a polite business email declining a meeting" returned outputs that often matched the origin country's cultural defaults. This was probably the cleanest "the model infers locality from provider-attached metadata" signal in the run: different cities per origin, different directness levels, different topical anchors.

What this means for evaluation

If your eval is measuring factual QA or coding help, single- origin testing is probably fine. If your eval touches policy, jurisdiction-specific content, safety, or cultural adaptation, single-origin testing under-samples and you will miss meaningful variance.

For teams doing this kind of work, we recommend the following minimum regional footprint:

US (default origin for most commercial providers)
UK or Germany (EU policy anchoring)
Japan (Article 30-4 and APAC cultural anchoring)
One representative Global South origin (our run didn't include one — a follow-up will)

Methodology notes and caveats

We used residential exits (not datacenter) because datacenter IPs are classified instantly and the provider may route them through a specific stack. We wanted the "end-user" path.
We did not use the user parameter in the API, because that carries account-level identifier and would conflict with the geography signal.
The refusal classifier was a small fine-tuned model; false positive rate is ~2%.
Prompt selection was done before running the eval to avoid cherry-picking toward confirmable results.

Data disclosure: we will publish the prompt set and the response transcripts as part of a follow-up long-form writeup, subject to review for any prompts that surfaced material we shouldn't redistribute. Expect the writeup in Q1 2026.

How much does geography actually change ChatGPT's answer? A 10-country test

The boring finding

The interesting finding

The ambiguous-cultural finding

What this means for evaluation

Methodology notes and caveats

Related reading on SquadProxy

Keep reading

Proxies for AI browser agents: the 2026 workload shape

Proxies for ChatGPT Operator: browser-agent configuration that works

Ship on a proxy network you can actually call your ops team about