Skip to content
SquadProxy engineering

WARC format

WARC (Web ARChive) is the ISO-standardised file format for storing multiple web resources and their HTTP metadata in one file. Common Crawl and the Internet Archive publish in WARC.

Definition

WARC (Web ARChive) is the ISO-standardised file format (ISO 28500) for storing multiple web resources and their associated HTTP metadata in one file. A WARC file is essentially a sequence of records, each containing a URL, HTTP response headers, HTTP response body, and additional metadata about the crawl.

Common Crawl and the Internet Archive publish in WARC. Most large-scale web-archive corpora use WARC as the canonical interchange format.

Why WARC matters for AI training

For training-corpus engineering that ingests from Common Crawl or similar archive sources, WARC is the primary format you'll handle:

  • Compressed size: WARCs are gzipped; a single monthly CC snapshot is ~90 TB compressed WARCs.
  • Record-level access: you can seek to a specific record by byte offset (available in CC's index), avoiding full WARC scan for selected URLs.
  • HTTP headers preserved: useful for downstream processing (Content-Language, Content-Type, Last-Modified all inform corpus quality filtering).

Processing WARC at AI-corpus scale

For extraction of plaintext from WARC for training corpus, Common Crawl also publishes WET files (plaintext extractions) that skip the WARC parsing step. For most AI training workloads, WET is sufficient and faster.

WAT files (metadata-only) are useful for link-graph analysis and dedup work that doesn't need the full text.

When to go to raw WARC rather than WET: (1) you need HTML structure for downstream processing (tables, code blocks, metadata), (2) you're doing a different text-extraction approach than CC's default, or (3) you need HTTP headers for Accept-Language-aware deduplication.

Related

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.