Skip to content
training-datadeduplicationcorpus-engineering

Tokenization-aware dedup at scrape time, not after

Most training-corpus pipelines run MinHash dedup after collection finishes. Running it at scrape time with tokenizer-aware signatures saves terabytes and produces a cleaner corpus. Here is the approach that worked for us and why it matters.

· Imogen Reyes · 4 min read

Most training-corpus pipelines collect first and dedup second. Collect a few TB, run MinHash-LSH, drop the near-duplicates, move on. The canonical reference for this is Lee et al. 2021 ("Deduplicating Training Data Makes Language Models Better", arXiv:2107.06499), which used 5-gram MinHash signatures on space-tokenized text and showed material perplexity and downstream eval improvements on GPT-style models trained on deduped data.

Llama 3's data pipeline, per the Meta technical report and the LSHBloom paper that builds on that approach, ran MinHash over the full corpus with additional line-level deduplication via CCNet for any line appearing more than six times.

The pattern all of these share: dedup is a post-processing stage. Collect, then compact.

We want to argue the other direction: do a first-pass dedup at scrape time, with tokenizer-aware signatures, before the content ever hits long-term storage. The wins add up in ways that matter at TB scale.

What "at scrape time" actually means

A scraper running against Common Crawl, news archives, and academic mirrors typically fetches URLs, extracts body text, and writes a Parquet shard to object storage. We add one step between extract and write:

  1. Extract and canonicalise text (Trafilatura, Readability, or equivalent — the choice of extractor is its own long discussion).
  2. Tokenise with your model's tokenizer (BPE for most GPT-family, SentencePiece for Llama, etc.) — take the first N=1024 tokens.
  3. Compute MinHash signature over k-gram tokens (k=5 is the research baseline, k=3 is faster at the cost of more collapse).
  4. LSH bucket against a Redis/ScyllaDB keyed on a rolling 30-day window of signatures.
  5. If near-duplicate: skip write, emit a "seen-as: $url" reference instead, for retrieval-time provenance.

The rolling 30-day window matters — you don't want to dedup against a five-year archive at scrape time (that's the post- processing MinHash stage, still valuable) but you do want to avoid writing the eighth near-copy of the same Reuters wire story you pulled off eight news aggregators this week.

Why tokenization-aware beats character-level

The short answer: your downstream model's compression is what matters for the "is this duplicate training value" question, not the surface-form similarity.

Character-level or space-token MinHash marks two documents as duplicates if they share substantial literal 5-grams. Two news stories about the same event, quoting the same AP wire, written with different bylines and slightly different framing, look distinct at character level but collapse to very similar token sequences once you've run them through a BPE tokenizer that has merged common phrases into single tokens.

Running MinHash on the token sequence surfaces these near-dupes that character-level dedup misses. In our own scrape of English- language news across ~40M documents over a six-week window, this reduced stored document count by about 14% relative to character- level MinHash at the same threshold. The papers above see similar deltas when comparing tokenizer choices, although direct apples- to-apples numbers depend on scrape composition.

Performance

A cheap objection: tokenising at scrape time adds latency. In practice, Huggingface tokenizers (the Rust-backed implementation) processes on the order of hundreds of thousands of tokens per second per CPU core, which is well above realistic scraper throughput per worker. The signature computation is ~30ms for a typical 1024-token window. At our scrape rates this sits comfortably inside the latency we already pay to fetch and parse the document.

Redis lookup for LSH bucketing is ~1ms. ScyllaDB for longer-horizon state is ~3ms. Neither is a bottleneck.

What we learned

  • Storage savings compound with scrape count. A 14% dedup rate at week six becomes bigger at week twelve because the tail of recurring sources grows.
  • Provenance matters more than we thought. Keeping the seen-as: reference per skipped document lets us answer "was this document available when?" at retrieval time without storing the duplicate.
  • Don't dedup across languages. A Spanish-language Reuters story and its English original can have near-identical token sequences after BPE — don't collapse them. Pin the MinHash comparison to same-language.

What still needs post-processing

This scrape-time approach doesn't replace full-corpus MinHash before training. Cross-shard near-duplicates, and duplicates separated by more than the rolling window, need the heavy post-processing pass. But the scrape-time stage reduces the input size to that stage by a lot, which makes the heavy pass tractable at corpus sizes where it previously wasn't.

References

  • Lee et al., "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499)
  • LSHBloom: Internet-Scale Text Deduplication (arXiv:2411.04257)
  • Meta Llama 3 model card and accompanying technical report

Related reading on SquadProxy

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.