Skip to content
llm-evaluationbenchmarkingmultilingual

Proxies as methodology for multilingual LLM benchmarks

Multilingual LLM evaluation that uses only US-cloud-origin requests under-reports regional content policy and geo-dependent response divergence. A proxy layer anchored to each benchmark language's primary country is methodology, not infrastructure.

· Reeya Patel · 6 min read

Multilingual evaluation has a methodology gap that's older than it should be: benchmarks like MMLU-ProX, FLORES, Aya, and the newer multilingual safety suites are typically run from a single origin region (the lab's AWS account, usually us-east-1) even when the benchmark explicitly tests regional competence. This produces results that measure the model's language competence but not its geo-anchored competence — which, for any production deployment, is what the operator actually needs.

Proxies anchored to each benchmark language's primary country aren't infrastructure in this context; they're part of the evaluation methodology. This post walks through why, and how to wire the layer into an eval harness without breaking reproducibility.

What a single-origin eval misses

Commercial LLM APIs apply regional policy in three ways that matter for multilingual evaluation:

  1. Content policy per origin region. The same request in the same language may receive different refusal patterns if the origin IP is in the target country vs. US-cloud. This is documented behaviour for some APIs and undocumented for others; either way it's measurable.
  2. Regional routing at the inference layer. Providers increasingly route inference to region-local POPs when the origin is in-region. The routing can pick different model-serving fleets, different safety checkpoints, or different content filters applied per deployment.
  3. Retrieval-augmented responses (where present). For providers that add retrieval, the retrieval corpus is often region-specific. US-origin requests to a Japanese- language prompt may pull English-language retrieval sources; JP-origin requests pull JP sources.

A benchmark that ignores these three factors is measuring "model behaviour in the lab's home region" and calling it "multilingual competence." The gap between those two things is sometimes small, sometimes very large, and the sign of the gap differs by language-region pair.

The proxy layer as an evaluation variable

The methodological move is to treat origin region as an evaluation variable and run the benchmark at least twice: once from US-cloud (the baseline that matches most published numbers), and once from an origin in the target language's primary region. The delta is the regional policy effect, and it's reportable.

For MMLU-ProX running across 29 languages, the target origins that matter in our experience:

  • fr — France or Canada (Quebec)
  • de — Germany
  • es — Spain or Mexico (they differ)
  • pt — Portugal or Brazil (they differ substantially)
  • ja — Japan
  • ko — Korea
  • zh — Taiwan or Singapore (mainland China access is a separate problem, covered below)
  • ar — UAE or Saudi Arabia
  • hi — India

The residential proxy page lists the pool specifics. For multilingual eval, residential is the right class — the model APIs classify datacenter origins differently even when the IP geolocation is correct.

A minimal multi-origin eval harness

import asyncio, httpx

ORIGINS = {
    "fr": "fr",
    "de": "de",
    "ja": "jp",
    "ko": "kr",
    # ... map benchmark language → country code
}

PROXY = "http://USER:PASS@gateway.squadproxy.com:7777"

async def eval_prompt(prompt: str, lang: str, model: str):
    country = ORIGINS[lang]
    async with httpx.AsyncClient(
        proxies={"https": PROXY},
        headers={
            "X-Squad-Class": "residential",
            "X-Squad-Country": country,
            "X-Squad-Session": "sticky-10m",
        },
        timeout=httpx.Timeout(120.0),
    ) as client:
        return await client.post(
            f"https://api.provider.example/v1/{model}/complete",
            json={"prompt": prompt},
            headers={"Authorization": f"Bearer {TOKEN}"},
        )

async def run(benchmark_rows):
    # Each row has (prompt, language, expected). Run each
    # against both US-cloud (direct) and the target-country
    # origin. Emit both scores.
    results = []
    for row in benchmark_rows:
        us_resp = await eval_prompt_direct(row.prompt, row.model)
        regional_resp = await eval_prompt(row.prompt, row.language, row.model)
        results.append({
            "prompt_id": row.id,
            "language": row.language,
            "us_origin_score": score(us_resp, row.expected),
            "regional_origin_score": score(regional_resp, row.expected),
            "regional_delta": ...,
        })
    return results

The sticky-10m session window is deliberate: multi-turn evaluation needs consistent IP across the conversation, and 10-minute stickiness covers most eval turns without locking in a single IP long enough to hit provider-side rate limits.

What to do with the delta

The two-column output (US-origin score, regional-origin score) makes the regional policy effect legible. For publication purposes, report both. For safety evaluation specifically, the regional column is usually the more honest measurement — that's the context your users actually hit.

A delta threshold in the ~5-10% range is noise across test-retest runs on the same origin. A delta above ~15% is real and worth investigating. Deltas over 30% are almost always a content-policy or retrieval-source difference; they're measuring the model-as-deployed, not the model-as-weights.

Edge cases we've hit

Mainland China targets. Access to Chinese-language model APIs from a mainland-China residential IP requires separate compliance infrastructure and isn't well-served by a Western-hosted proxy network. Benchmark Chinese-language competence from Taiwan or Singapore residentials instead; note the limitation in the write-up.

Language-region dissonance. "Spanish" is not a single target origin. es-MX and es-ES measure different things because the deployed model applies different safety stacks per region. If the benchmark design is language-level, pick one target origin per language and stick with it across the full benchmark run for reproducibility; note the region selection.

Model provider region routing that doesn't honour the origin. Some model APIs route all requests to a single global region regardless of client origin. In that case the origin experiment shows no delta — which is itself a finding: that provider doesn't apply regional policy at the routing layer for your workload. Report it.

Reproducibility

For a multi-origin eval to be reproducible by a reviewer, the published harness needs:

  • Exact country codes used per language
  • Exit class used (residential, ISP, etc.)
  • Session stickiness window
  • Timestamp of the eval run (regional policy drifts on the order of weeks)
  • Provider-side model version pins where the API exposes them

Without these, a future re-run diverges in ways that look like regression but are actually infrastructure drift. The safety red-team use case covers the adjacent case of geo-anchored red-teaming methodology, which shares the same reproducibility constraints.

Cost shape

Multi-origin eval is bandwidth-cheap (prompts + completions are small) but concurrency-heavy during a benchmark run. A run across 29 languages at 100 concurrent per language completes in under an hour on our Team plan's 1000-concurrent ceiling; see the pricing page for how the plans scale. The Lab plan adds BGP-dedicated prefixes per country where a research publication needs to cite infrastructure stability for the eval run.

What this is not

Multi-origin eval is not a way to bypass content policy. If a model refuses a prompt from one origin and answers it from another, both answers are real data; the methodology is about measuring the difference, not about routing around it. Our AUP is explicit that circumvention of platform-level access controls is out of scope for the network, and the workflows this post describes are entirely within intended-use for commercial model APIs that don't block residential origins.

Further reading

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.