Research engineer — Common Crawl + open data

Elena Novak

Research engineer at SquadProxy focused on Common Crawl sampling, open-web corpus construction, and the provenance chains that make research-grade training data defensible.

Six years on data pipelines for open-data projects, including direct Common Crawl operations experience and open-source dataset curation.

Elena's background is in the open-data side of training-corpus construction — specifically, how Common Crawl actually samples, how derivative datasets propagate (and sometimes amplify) those biases, and how to document provenance well enough that a research paper citing your dataset doesn't collapse when a reviewer asks "where did this come from?"

Background

Elena worked on an open-data collaborative (public-benefit data sharing org) before joining SquadProxy. Her familiarity with Common Crawl's quirks comes from first-hand sampling work across several years of the monthly archives.

Writing on SquadProxy

The hidden bias in Common Crawl sampling — and how to fix it from your side

What she's working on

An expanded analysis of sampling bias across Common Crawl derivatives (Dolma, SlimPajama, RefinedWeb, FineWeb-Edu) that quantifies how different post-processing choices amplify or attenuate the base-CC biases. Target publication Q3 2026.

Contact

Questions about Common Crawl processing, open-data dataset construction, or provenance chains for published research. hello@squadproxy.com with "open-data" in the subject.

Writing by Elena Novak

20 Jan 2026
The hidden bias in Common Crawl sampling — and how to fix it from your side
Common Crawl is the default corpus backbone for open LLM training. Its sampling is not uniform, and the biases it introduces show up downstream in very specific ways. Here is what to look for and how to correct it in your own pipeline.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.

See pricing Contact sales