Insights AI News How to fix 403 when scraping and unblock your scraper
post

AI News

26 Jan 2026

Read 10 min

How to fix 403 when scraping and unblock your scraper

how to fix 403 when scraping and unblock pages quickly using retries, headers and rotating proxies.

Learn how to fix 403 when scraping with a clear, step-by-step plan. Check if the page needs login, copy real browser headers, keep cookies, slow your requests, and rotate trusted proxies. When simple requests fail, switch to a real browser like Playwright, handle tokens, and respect robots.txt to lower block rates. A 403 means the server saw your request and refused to serve it. Sites do this to stop bots, preserve resources, or enforce rules. You can still collect data in a safe and stable way. This guide shows how to look like a real user, control your speed, and unblock your scraper without waste.

Understand what a 403 means (and what it does not)

Common status codes

  • 403 Forbidden: The server understood you but will not allow access.
  • 401 Unauthorized: You must log in or send a valid token.
  • 429 Too Many Requests: You sent requests too fast.
  • How sites detect bots

  • Missing or fake headers (User-Agent, Accept-Language, Referer).
  • No or stale cookies, missing CSRF tokens.
  • High speed, high concurrency, or fixed timing.
  • IP reputation, data center IPs, blocked regions.
  • JavaScript not executed; anti-bot checks not passed.
  • Browser fingerprint mismatch (fonts, WebGL, timezone, HTTP/2/TLS quirks).
  • Quick triage before deep fixes

  • Open the page in a normal browser. If it is behind login or paywall, you must authenticate.
  • Use your browser’s DevTools to copy request headers and cookies from a successful load.
  • Check robots.txt and the site’s terms. Make sure you are allowed to fetch the data.
  • Try from a different IP or network. Some ranges are blocked.
  • Reduce speed to 0.5–1 request per second and add jitter.
  • Fetch the homepage first, then the target page, carrying cookies forward.
  • Confirm you need JavaScript. If yes, plan to use a real browser.
  • Compare the 403 HTML body to a normal page. Some “403” pages return 200 with a block message.
  • How to fix 403 when scraping: a step-by-step playbook

    Before you begin, make a checklist for how to fix 403 when scraping. Start with the least invasive steps. Move to heavier tools only if needed.

    Mimic a real browser

  • Use a modern User-Agent string (Chrome, Firefox, or Safari) and keep it consistent.
  • Send realistic headers: Accept, Accept-Language, Accept-Encoding, Referer, and Connection.
  • Use HTTP/2 if your client supports it. Many sites expect it.
  • Match language and timezone to your target audience.
  • Keep header order and casing stable across requests.
  • Keep and reuse cookies and tokens

  • Persist cookies between requests. Do not start a fresh session every time.
  • If the site shows a CSRF token, include it with form posts or AJAX calls.
  • If login is required, log in once per session, then reuse the session until it fails.
  • Fetch the homepage first to obtain consent or anti-bot cookies.
  • Slow down and add randomness

  • Set a base delay (e.g., 800–1500 ms) and randomize it.
  • Limit concurrency. Start with 2–4 threads and scale carefully.
  • Use exponential backoff on 403/429 and stop after a few retries.
  • Stagger crawl times. Avoid bursts at the top of the minute.
  • Rotate IPs and choose the right proxies

  • Prefer residential or mobile proxies for tough sites.
  • Use sticky sessions to keep the same IP for a flow that needs state.
  • Rotate on 403 or after N pages per IP to spread load.
  • Match geolocation to the site’s main users or the content region.
  • Avoid known VPN/datacenter IPs when blocks are strict.
  • Use a real browser when needed

  • Adopt Playwright or Puppeteer to execute JavaScript and pass dynamic checks.
  • Enable stealth features to reduce bot signals. Keep a stable viewport and device metrics.
  • Block heavy assets (images, fonts, video) to save bandwidth while still running scripts.
  • Wait for network idle or a key selector before scraping.
  • Pass common anti-bot checks

  • Send a valid Referer when moving between pages.
  • Honor cookies set by consent banners and anti-bot providers.
  • Set sensible Accept-Language (e.g., en-US,en;q=0.9).
  • Align timezone, locales, and platform strings with your chosen User-Agent.
  • Handle CAPTCHAs and WAF pages

  • Detect challenge pages. Do not retry blindly.
  • Solve CAPTCHAs only if allowed and appropriate. Consider manual review or official site APIs.
  • Sometimes the best move is to pause and request access from the site owner.
  • Advanced signals that can trigger blocks

  • TLS and HTTP/2 fingerprints (JA3, cipher suites, frame settings). Managed proxy/bot services can help normalize this.
  • Header order and pseudo-headers in HTTP/2. Real browsers use consistent patterns.
  • WebGL, canvas, and font lists that do not match your User-Agent.
  • Too-perfect timing. Human traffic is noisy; add jitter to clicks and scrolls in headless runs.
  • Resilience: logging, metrics, and fallbacks

  • Log status codes, response sizes, and fetch times. Track 403 rate by route and IP.
  • Store sample HTML when a block occurs for later review.
  • Use circuit breakers: stop a task when 403 spikes and switch IP pools or slow down.
  • Cache pages with ETags/Last-Modified to avoid needless hits.
  • Queue retries with longer delays and a different session.
  • Stay ethical and reduce risk

  • Respect robots.txt and terms. Ask for permission when in doubt.
  • Only collect public data you need. Do not scrape personal data without consent.
  • Identify yourself in a contact email or User-Agent comment where possible.
  • Prefer official APIs or exports if they exist. They are more stable and safer.
  • These steps often answer how to fix 403 when scraping on most sites. Start simple: copy real browser behavior, slow down, and keep state. Move to proxies and headless browsers only if needed. With good pacing, clean headers, and stable sessions, your block rate will fall. In short, if you came here to learn how to fix 403 when scraping, focus on three pillars: look real, go slow, and keep sessions. Combine these with smart proxy use and respectful conduct, and your scraper will run longer with fewer blocks.

    (Source: https://www.axios.com/2026/01/21/google-anthropic-microsoft-education)

    For more news: Click Here

    FAQ

    Q: What does a 403 error mean when scraping a page? A: A 403 means the server understood your request and refused to allow access. Sites often return it to stop bots, preserve resources, or enforce site rules. Q: How do websites detect and block scrapers? A: Sites detect bots through missing or fake headers, no or stale cookies, missing CSRF tokens, and unusual request speed or concurrency. They also use IP reputation, data center ranges, JavaScript execution checks, and browser fingerprint mismatches like WebGL, fonts, timezone, or TLS quirks. Q: What quick triage steps should I take when I see a 403? A: For quick triage when trying to learn how to fix 403 when scraping, open the page in a normal browser, copy request headers and cookies from a successful load, and confirm whether login or a paywall is required. Also try a different IP or network, check robots.txt and the site’s terms, and slow your requests to about 0.5–1 requests per second with added jitter. Q: How can I mimic a real browser to reduce 403 responses? A: Use a modern User-Agent string and send realistic headers such as Accept, Accept-Language, Accept-Encoding, Referer, and Connection while keeping header order and casing stable. If your client supports it, use HTTP/2 and match language, timezone, and other platform strings to your chosen User-Agent. Q: Why are cookies and tokens important, and how should I handle them? A: Persist cookies between requests, fetch the homepage first to obtain consent or anti-bot cookies, and reuse a logged-in session until it fails rather than starting a fresh session each time. Include CSRF tokens with form posts or AJAX calls and carry cookies forward to maintain state. Q: When should I switch to a real browser like Playwright or Puppeteer? A: Switch to a real browser when the site requires JavaScript execution or runs dynamic anti-bot checks that simple HTTP clients cannot pass. Use Playwright or Puppeteer with stealth features, stable viewport and device metrics, block heavy assets to save bandwidth, and wait for network idle or a key selector before scraping. Q: How should I use proxies and IP rotation to lower block rates? A: Prefer residential or mobile proxies for tougher sites, use sticky sessions when a flow needs a consistent IP, and rotate IPs on 403 or after a set number of pages to spread load. Match proxy geolocation to the site’s main users and avoid known VPN/datacenter IPs when blocks are strict. Q: What logging, fallbacks, and ethical steps help keep a scraper running safely? A: Log status codes, response sizes, and fetch times, store sample HTML on block, and use circuit breakers to stop tasks when 403 spikes and switch IP pools or slow down. Respect robots.txt and terms of service, prefer official APIs when available, and ask site owners for permission when in doubt.

    Contents