Skip to content
Research engineer — Common Crawl + open data

Elena Novak

Research engineer at SquadProxy focused on Common Crawl sampling, open-web corpus construction, and the provenance chains that make research-grade training data defensible.

Six years on data pipelines for open-data projects, including direct Common Crawl operations experience and open-source dataset curation.

Elena's background is in the open-data side of training-corpus construction — specifically, how Common Crawl actually samples, how derivative datasets propagate (and sometimes amplify) those biases, and how to document provenance well enough that a research paper citing your dataset doesn't collapse when a reviewer asks "where did this come from?"

Background

Elena worked on an open-data collaborative (public-benefit data sharing org) before joining SquadProxy. Her familiarity with Common Crawl's quirks comes from first-hand sampling work across several years of the monthly archives.

Writing on SquadProxy

What she's working on

An expanded analysis of sampling bias across Common Crawl derivatives (Dolma, SlimPajama, RefinedWeb, FineWeb-Edu) that quantifies how different post-processing choices amplify or attenuate the base-CC biases. Target publication Q3 2026.

Contact

Questions about Common Crawl processing, open-data dataset construction, or provenance chains for published research. hello@squadproxy.com with "open-data" in the subject.

Ship on a proxy network you can actually call your ops team about

Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.