Related: 10 Business Tasks You Should Automate in 2026 (and How)
A single script can scrape a few thousand pages. Push it to millions of pages, on a schedule, with failures handled and data you can trust, and that script breaks. Large scale extraction is a distributed systems problem. This guide is a reference architecture for a web scraping system that is reliable, observable, and compliant by design.
Key takeaways
- Use a decoupled, queue driven pipeline so a slow site or a parsing spike never stalls the whole system.
- A URL frontier schedules and prioritises work and enforces politeness per domain.
- Default to plain HTTP fetching. Use a headless browser only when content truly needs JavaScript.
- Build in retries with backoff, a dead letter queue, deduplication, and data validation, not as extras but as core parts.
- Legal and ethical limits are part of the architecture: respect robots.txt and terms, rate limit politely, and never harvest personal data.
Why single machine scrapers break
One process has a ceiling. Past a certain scale it cannot fetch fast enough, and everything is coupled, so one slow or blocked site stalls the entire job. There is no clean recovery when it crashes halfway. The answer is to break the work into independent stages that scale and fail on their own.
The reference architecture
A scalable scraper is a pipeline of decoupled stages connected by a queue. Each stage does one job and can be scaled by itself.
The URL frontier and scheduler
The frontier is the brain. It decides what to crawl next, in what order, and how often to revisit. It does two jobs at once. A front queue prioritises by value and freshness, so important or fast changing pages get crawled sooner. A back queue enforces politeness so you never hit one domain too hard. The frontier lives outside memory, in a database or distributed queue, because the list of discovered URLs grows far larger than the list of visited ones. It also normalises URLs and limits crawl depth so the crawler does not fall into infinite traps such as endless calendar links.
Work distribution and the message broker
A message queue sits between the frontier and the workers. The frontier produces URLs, a pool of workers consumes them, and you scale throughput by adding workers. The broker you choose is a real tradeoff.
| Broker | Strengths | Notes |
|---|---|---|
| Redis / RabbitMQ / Celery / Bull | Light, simple, fast to start | Great for small to mid scale |
| AWS SQS | Managed, visibility timeout, native dead letter queue | Low ops, good default in the cloud |
| Apache Kafka | Very high throughput, durable log, replay | More to operate, best at large scale |
Bound your queues and watch their depth. Queue depth is your natural throttle: if it grows, you are fetching faster than you can process, and that signal should drive autoscaling. This is backpressure, and it beats ad hoc sleeps for keeping the system stable.
Fetcher workers: HTTP versus headless
Most of the web can be fetched with a plain HTTP request, which is fast and cheap. Reach for a headless browser such as Playwright or Puppeteer only when the content is rendered by JavaScript and is not in the raw HTML. Browsers are 10 to 100 times more expensive in CPU and memory, so do not make them the default. When you do need them at scale, run a pool, recycle browser contexts between jobs, cap memory, and scale the pool horizontally, for example on Kubernetes.
Proxy and rotation, done responsibly
At scale you distribute requests across many IP addresses so you do not overload a single source or trip rate limits. Datacenter proxies are cheap, residential and mobile proxies look more like normal users. Keep the pool deeper than your per domain cooldown so no single IP hammers one site within its rest window. The right framing is polite distribution of load on public data, not evading access controls. If a site blocks logged out access or sits behind a login, that is a signal to stop, not a puzzle to defeat.
Reliability: retries, backoff, idempotency
Networks fail and sites have bad moments, so reliability is built in, not bolted on.
- Exponential backoff with jitter: on failure, wait longer each time, with randomness so workers do not retry in lockstep.
- A retry budget: try 3 to 5 times, then send the URL to a dead letter queue for review instead of silently dropping it.
- Idempotency: write results in a way that is safe to repeat, so a redelivered message does not create duplicates.
- Conditional requests: send ETag or If-Modified-Since and accept a 304 Not Modified to skip re-downloading unchanged pages, which saves bandwidth and respects the site.
Rate limiting and politeness
Politeness is both ethics and survival. Hammer a site and you get blocked and you may cause real harm. Keep a per domain limiter, typically around one request per second per domain or whatever the site's robots.txt Crawl-delay states. Slow down automatically when you see HTTP 429 or 503, which mean the server is asking you to back off. Schedule heavy crawls for off peak hours where you can.
Deduplication and data quality
Two kinds of duplication waste resources and dirty your data. URL duplication is handled by canonicalising URLs (stripping session and tracking parameters) and checking a seen set. A Bloom filter is a memory efficient way to ask "have I seen this URL" at scale: it never gives a false negative, so a "not seen" answer is always trustworthy. Content duplication, where the same content appears under different URLs, is caught by hashing the page content and comparing hashes.
Then validate. Check types, ranges, and completeness, and alert on schema drift, which is when a site changes its layout and your selectors quietly start returning empty fields. A useful pattern is a three tier data model: raw captured data, a cleaned and validated layer, and an analytics ready layer. Field level null rates are often the first sign that a scraper has silently broken.
Observability and self healing
Run a scraper blind and you will not notice it is broken until the data is wrong. Track success rate, block rate, field level extraction rate, queue depth, worker throughput, and data freshness. A sudden rise in empty fields usually means a target site changed its structure. With queue depth driving autoscaling and structure change alerts driving fixes, the system becomes mostly self healing.
Legal and ethical compliance
This is not a footnote. It shapes the architecture and it protects you. Responsible scraping follows a few firm rules.
- Respect robots.txt and Crawl-delay. They are not always legally binding, but ignoring them is a clear bad faith signal and, in the EU, honoring opt outs supports a lawful basis.
- Honor terms of service and the public data line. Courts have found that scraping public data is not unauthorized access under the US CFAA (hiQ versus LinkedIn), but a site's terms can still be enforceable, so stay logged out and never bypass paywalls or logins.
- Avoid personal data. Public personal data is still protected under GDPR and CCPA, and you need a lawful basis to process it. Regulators have issued very large fines for scraping personal data at scale, so the safe and ethical default is to not collect it.
We build extraction systems for public, non personal data, with provenance and audit trails, and we decline work that needs personal data harvested or access controls bypassed. Compliance is a feature, not a limitation. For more on putting automation to work safely, see our guide to business tasks worth automating.
FAQ
What is the core architecture of a scalable web scraping system?
A decoupled, queue driven pipeline: a URL frontier prioritises and gates URLs, a message broker hands work to a pool of fetcher workers, a proxy layer spreads polite load, parsers extract data, a dedup and validation stage cleans it, and storage plus monitoring close the loop. Each stage scales on its own, so one slow site never stalls everything.
How do you deduplicate data at large scale?
In two layers. First, canonicalise URLs and check a seen set, where a Bloom filter answers "have I seen this" using little memory. Second, hash page content to catch identical or near identical pages served under different URLs, so each unique record is stored once.
Is large scale web scraping legal?
Scraping public, non personal data is generally lawful in the US and EU when you stay logged out, respect robots.txt and a site's terms, rate limit politely, and avoid personal data. Courts have held public data is not unauthorized access under the CFAA, but terms of service and GDPR or CCPA rules on personal data still apply, so responsible systems avoid personal data and never bypass access controls.
Working with Apex Logic
We build reliable, observable, and compliant data extraction pipelines for public web data. If you need clean data at scale without the legal risk, see our services or tell us what data you need for a fixed, fair quote.
References
System design references on web crawlers - URL frontier, dedup, and throughput figures.
hiQ Labs versus LinkedIn - public data and the US CFAA.
GDPR and CCPA guidance - personal data obligations that apply to scraped data.
Apex Logic project data (2024 to 2026) - compliant extraction pipelines.
Comments