common-crawltraining-datainfrastructure

Proxies for Common Crawl: when you need one, when you don't, and how to route

Common Crawl publishes ~250 TB of web content per monthly snapshot and makes most of it freely accessible from S3. Proxies still have a role — but a narrower one than most scraping guides suggest. A working engineer's breakdown.

22 April 2026 · Nathan Brecher · 6 min read

Common Crawl is unusual among large web archives: the operators want you to have the data. Access is staged through AWS S3 requester-pays buckets, indexed via the CDX server, and the whole archive is updated on a roughly monthly cadence. None of that requires a proxy if you stay inside the intended access pattern.

Where proxies matter for Common Crawl work is at the edges — and knowing which edges are which is the difference between paying for bandwidth you didn't need and burning days on rate-limits you could have avoided.

What Common Crawl actually publishes

Each monthly snapshot ships three artifact classes, all in S3 under commoncrawl public buckets:

WARC files — the raw HTTP response payloads, ~90 TB per month, gzipped. This is what you want if you're building a training corpus; it's the source of truth for what the open web looked like at crawl time.
WAT files — metadata extractions (headers, links), ~20 TB. Useful for link-graph analysis and dedup.
WET files — plaintext extractions from the WARC, ~8 TB. Useful if you don't want to run your own text extractor.

The monthly index (columnar Parquet in S3) lets you pick out exactly the URLs you want without scanning the full WARC. Most AI teams pulling selected slices of Common Crawl work from the index, not the raw WARC. A typical training run pulls ~10 TB from a specific month's WET, not the whole 250 TB snapshot.

Access patterns, and where proxies fit

Pattern 1: direct S3 pull (no proxy needed)

The intended path. Requester-pays S3 is happy to serve bulk traffic to any AWS account that can pay egress. Your EC2 instances in us-east-1 can aws s3 cp at line rate and nothing will rate-limit you. Proxies here add cost and latency for no benefit.

The only caveat: if your collection pipeline runs outside AWS, you pay egress when pulling from AWS S3, which is real money at terabyte scale. That's an architecture problem, not a proxy problem — the answer is "move your collection job into AWS," not "add a residential pool in front of it."

Pattern 2: the CDX index server (proxies sometimes help)

Common Crawl's CDX server is the lookup surface for "what URLs did you see for this domain in this month?" It's free, HTTP-accessible, and rate-limited. The limit is generous for research use but real — running an enumeration job across 500 domains from a single IP will trip it and the responses degrade to timeouts before they outright 429.

Proxies help here by spreading the queries across multiple origins. Datacenter is fine — CDX doesn't care about origin ASN, it cares about request rate per source. A rotating datacenter pool of 50–100 IPs lets an enumeration job finish in hours instead of days. See the datacenter proxy page for how we configure this class for CDX workloads specifically.

Pattern 3: pulling from common-crawl mirrors or derivatives

A growing secondary ecosystem republishes Common Crawl subsets with different processing — Dolma, C4, RefinedWeb, SlimPajama, and a long tail of academic mirrors hosted on university infrastructure. These mirrors tend to have their own access limits, some require attribution tokens, and some are hosted behind Cloudflare with bot-management rules.

This is where residential proxies occasionally earn their keep: university mirrors that front-end behind Cloudflare will frequently 403 requests from known cloud ASNs during peak hours. A small residential pool anchored to the mirror's region unblocks the download. See the residential proxy page.

Pattern 4: re-fetching URLs Common Crawl indexed

The most common AI-team use case that actually needs proxies. You've used the CDX index to find URLs of interest (regional news archives, specific forum threads, domain-specific knowledge), and now you want to re-fetch them live — either because the snapshot is stale or because Common Crawl's text extraction lost structure you need (tables, code blocks, metadata).

Live re-fetch is where the proxy strategy from residential vs datacenter for AI workloads applies in full. The routing matrix there — source class → exit class — is what you want for the re-fetch job, not for the Common Crawl pull itself.

Rate-limits to plan around

Common Crawl's documented limits are permissive, but a few edges bite in practice:

CDX server: ~3 requests per second per IP is comfortable. Higher and you'll see 429s. A pool of 30–50 datacenter proxies comfortably runs at 100 rps aggregate.
S3 requester-pays: no rate limit in the normal sense, but you will see bandwidth and request-count charges. Budget for ~$0.09 per GB egress if pulling outside us-east-1. Run your collector in us-east-1 and it's free.
S3 object-per-second: S3 doesn't advertise a hard per- prefix limit, but pulling many small objects (WET segments) with high concurrency can hit 503s. Keep per-prefix concurrency under 5,000; batch small object reads where possible.

A working configuration

A Common Crawl ingestion job we've seen work reliably for AI teams pulling ~20 TB per month:

# Concurrency shape
#   CDX index queries:  ~20 workers, rotating datacenter proxies
#   WARC/WET pulls:     direct S3, in us-east-1, no proxy
#   Secondary mirrors:  ~5 workers, residential (target-country)
#   Live re-fetch:      per-source routing via X-Squad-Class

from squadproxy_client import Client

client = Client(
    endpoint="gateway.squadproxy.com:7777",
    default_class="datacenter",
)

# CDX enumeration
for hit in client.cdx_query(
    domain="example.fr",
    crawl="CC-MAIN-2026-04",
    exit_class="datacenter",  # rotating
):
    ...

# Live re-fetch (see routing matrix)
resp = client.fetch(
    url=hit.url,
    exit_class="residential",     # .fr geoblocks AWS
    country="fr",
    session="per-request",
)

The classification of each source stays in the pipeline, not in the proxy layer — that's the shape that scales.

When proxies are the wrong answer

Three patterns we see regularly that are not a proxy problem:

"I keep getting S3 403s on requester-pays." Check your AWS credentials and bucket region. A proxy cannot fix an auth problem.
"Egress is too expensive." Run inside us-east-1. A proxy adds to the bill; it doesn't reduce it.
"WARC decompression is slow." That's a pipeline compute problem — you need more workers, parallel decompression, or a different text-extraction stage. No proxy helps here.

Where to route Common Crawl traffic through SquadProxy

The short version: datacenter for index queries and mirror pulls, residential for regional-mirror unblocks, direct S3 for the actual WARC/WET volume. Our Team plan includes the concurrency to run this comfortably; see pricing.

For the pipeline-level picture on how Common Crawl slots into a training-data architecture, the RAG data collection use case covers the upstream side (what to pull and why), and the training-corpus engineering blog series covers the downstream side (tokenization, dedup, versioning).