Insights AI News Fix HTTP 429 when scraping with 7 proven tactics
post

AI News

14 Jan 2026

Read 9 min

Fix HTTP 429 when scraping with 7 proven tactics

Fix HTTP 429 when scraping and regain steady access with seven proven tactics to bypass rate limits.

To fix HTTP 429 when scraping, slow your requests, honor Retry-After, cap concurrency, and add exponential backoff. Rotate IPs and user agents, keep sessions stable, and cache pages. Use first‑party APIs when possible and schedule crawls. These seven tactics reduce blocks and keep your data pipeline healthy. Scrapers hit 429 when a site says, “Too many requests.” It is a rate limit, not a bug. Good scrapers act like polite users. They space out calls, reuse sessions, and follow rules. If you need to fix HTTP 429 when scraping at scale, you must lower your footprint and look more like normal traffic.

Why sites return 429

  • Bursts of requests hit the same host or path too fast.
  • Many parallel workers open too many connections at once.
  • Bot signatures: identical user agents, missing cookies, no referrer.
  • Ignored rules: robots.txt, crawl-delay, or published API limits.
  • Repeat downloads of unchanged pages that waste server budget.
  • The fix starts with control: limit velocity, respect signals, and spread load.

    7 proven ways to fix HTTP 429 when scraping

    1) Respect rate limits and the Retry-After header

    If the server sends Retry-After, wait that long before the next try. If there is no header, back off for at least 30–60 seconds on repeat 429s. Read docs for posted limits, like “60 requests per minute.” Quick steps:
  • Track requests per host and throttle to a safe ceiling.
  • Use per-endpoint limits if they differ (e.g., /search vs /item).
  • Log and audit 429s to tune your ceiling.
  • 2) Add exponential backoff with jitter

    Do not retry at fixed times. Use exponential backoff (e.g., 1s, 2s, 4s, 8s) plus jitter (a small random delay). This spreads retries so you do not create new bursts that trigger more 429s. Quick steps:
  • On 429 or 503, pause with backoff; reset after a success.
  • Add a random 10–30% jitter to each delay.
  • 3) Cap concurrency and queue requests

    Most 429s come from too much parallelism. Limit concurrent requests per domain and per path. Use a central queue so workers do not stampede the same host. Quick steps:
  • Set global concurrency (e.g., 5–10 per domain).
  • Use HTTP/2 keep-alive to reuse connections.
  • Spread tasks in time with a token bucket or leaky bucket.
  • This alone can fix HTTP 429 when scraping busy pages because it keeps your peak load low.

    4) Rotate IPs (ethically) and regions

    Many sites rate limit per IP. Use a quality proxy pool (datacenter or residential) with clear compliance and opt-in sources. Rotate slowly, not on every request, to avoid fingerprints. Quick steps:
  • Bind 20–50 requests to one IP before rotating.
  • Match region to target audience (e.g., US site → US IPs).
  • Avoid known abusive ranges; monitor blocklists.
  • Always check the site’s terms. Some sites ban automated access. If that is the case, seek permission or use their official API.

    5) Randomize headers and keep sessions

    Stable, human-like sessions reduce suspicion. Random changes on every call look fake. Use real browser headers and keep them steady within a session. Quick steps:
  • Pick a modern user agent per session and reuse it.
  • Send common headers: Accept, Accept-Language, Referer, DNT.
  • Keep cookies; log in once if allowed, then reuse the session.
  • Avoid headless-only tells; consider a headless browser when needed.
  • 6) Cache and use conditional requests

    If a page did not change, do not re-download it. Cache responses and use ETag and Last-Modified to ask only for updates. This cuts your request volume and reduces 429 risk. Quick steps:
  • Store ETag/Last-Modified, send If-None-Match/If-Modified-Since.
  • Honor Cache-Control and Expires headers.
  • Skip assets you do not need (images, fonts) to save requests.
  • 7) Prefer first‑party APIs, sitemaps, and scheduling

    APIs often have clear limits and stable data. Sitemaps show fresh URLs without heavy crawling. Smart schedules avoid peak hours. Quick steps:
  • Use official APIs where available.
  • Read robots.txt and respect crawl‑delay and disallow rules.
  • Crawl during off-peak times; spread jobs across the day.
  • Implementation checklist

  • Add per-host throttling with a token bucket.
  • Implement exponential backoff with jitter on 429/503.
  • Honor Retry-After; parse it as seconds or HTTP date.
  • Limit concurrency; centralize a queue to avoid stampedes.
  • Use a compliant proxy pool; rotate by session, not per request.
  • Stabilize headers and cookies; reuse sessions.
  • Enable caching and conditional GETs.
  • Prefer APIs and sitemaps; schedule off-peak crawls.
  • Monitor 2xx/3xx/4xx/5xx rates; alert on rising 429s.
  • Code and monitoring tips

  • Set timeouts and circuit breakers so blocked hosts rest before retry.
  • Group logs by domain to see which sites need stricter limits.
  • Record Retry-After values to choose better default waits.
  • Use per-site profiles: different caps, headers, and schedules.
  • Common mistakes to avoid

  • Retrying immediately after 429 without waiting.
  • Scaling workers before fixing rate control.
  • Rotating user agents or IPs on every single request.
  • Ignoring robots.txt and posted API limits.
  • Re-downloading unchanged pages without conditional requests.
  • Strong scrapers act politely. They ask less often, wait when told, and blend with normal traffic. You can fix HTTP 429 when scraping by lowering pressure, reading server signals, and choosing data sources that welcome automated access. Conclusion: When you combine throttling, backoff, smart concurrency, ethical IP rotation, stable sessions, caching, and API-first strategies, you fix HTTP 429 when scraping and keep your pipeline steady over time.

    (Source: https://phys.org/news/2026-01-ai-tool-discovery-life-medicines.html)

    For more news: Click Here

    FAQ

    Q: What does HTTP 429 mean when scraping? A: HTTP 429 means “Too Many Requests” and indicates a server-side rate limit rather than a bug. To fix HTTP 429 when scraping, slow your requests, honor Retry-After headers, cap concurrency, and add exponential backoff. Q: Why do sites return 429 errors? A: Sites return 429 when they detect bursts of requests to the same host or path, too many parallel workers, bot-like signatures (identical user agents, missing cookies, no referrer), ignored robots or API limits, or repeated downloads of unchanged pages. Good scrapers act like polite users by spacing calls, reusing sessions, and following rules. Q: What should I do immediately after receiving a 429? A: If the server sends a Retry-After header, wait the specified time before retrying and if there is no header, back off for at least 30–60 seconds on repeat 429s. Also implement exponential backoff with jitter and cap concurrency to help fix HTTP 429 when scraping. Q: How do I implement exponential backoff correctly? A: Use exponential delays (for example 1s, 2s, 4s, 8s) combined with a small random jitter (about 10–30%) and reset the delay after a successful request. Apply this strategy on 429 or 503 responses to spread retries and avoid creating new request bursts. Q: How can caching and conditional requests reduce 429s? A: Cache responses and use ETag or Last-Modified with If-None-Match/If-Modified-Since so you only fetch content that changed, and honor Cache-Control and Expires headers. Skipping unnecessary assets like images and fonts further cuts request volume. Q: What are safe practices for rotating IPs and user agents? A: Rotate IPs ethically by using a compliant proxy pool and binding multiple requests (for example 20–50) to one IP before rotating, and pick a modern user agent per session and reuse it. Match proxy region to the target site, avoid known abusive ranges, and check the site’s terms or prefer an official API if automated access is banned. Q: How should I cap concurrency to prevent 429s? A: Limit concurrent requests per domain to modest levels (the article suggests around 5–10) and use a central queue with token-bucket or leaky-bucket algorithms to spread tasks in time. Reusing HTTP/2 keep-alive connections and keeping sessions stable can fix HTTP 429 when scraping busy pages. Q: What monitoring and profiling helps avoid future 429 errors? A: Log and audit 429s by domain, record Retry-After values to choose better default waits, and monitor 2xx/3xx/4xx/5xx rates with alerts on rising 429s. Use per-site profiles with different caps, headers, and schedules to proactively reduce blocks.

    Contents