Elena Novak
Research engineer at SquadProxy focused on Common Crawl sampling, open-web corpus construction, and the provenance chains that make research-grade training data defensible.
Six years on data pipelines for open-data projects, including direct Common Crawl operations experience and open-source dataset curation.
Elena's background is in the open-data side of training-corpus construction — specifically, how Common Crawl actually samples, how derivative datasets propagate (and sometimes amplify) those biases, and how to document provenance well enough that a research paper citing your dataset doesn't collapse when a reviewer asks "where did this come from?"
Background
Elena worked on an open-data collaborative (public-benefit data sharing org) before joining SquadProxy. Her familiarity with Common Crawl's quirks comes from first-hand sampling work across several years of the monthly archives.
Writing on SquadProxy
What she's working on
An expanded analysis of sampling bias across Common Crawl derivatives (Dolma, SlimPajama, RefinedWeb, FineWeb-Edu) that quantifies how different post-processing choices amplify or attenuate the base-CC biases. Target publication Q3 2026.
Contact
Questions about Common Crawl processing, open-data dataset construction, or provenance chains for published research. hello@squadproxy.com with "open-data" in the subject.
Ship on a proxy network you can actually call your ops team about
Real ASNs, real edge capacity, and an engineer who answers your Slack the first time.