Skip to content

Retry Improvements + Rate Limit Support #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ikreymer opened this issue Feb 7, 2025 · 1 comment
Open

Retry Improvements + Rate Limit Support #758

ikreymer opened this issue Feb 7, 2025 · 1 comment

Comments

@ikreymer
Copy link
Member

ikreymer commented Feb 7, 2025

Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.

This is getting slightly messy, but hopefully there's a clear path to figure this out.

There's a few options to consider:

  • Which status code should be counted as page failures, for purposes of ending crawl
  • Which status codes should result in retrying the page
  • Should capture of pages with invalid status codes be skipped when they will be retried.
  • Which status code should result in slowing down the crawl / adding a delay before loading those pages again if retrying..

It's probably useful to list the various use cases:

  • The crawler should treat 4xx and 5xx as failed, possibly customizing which status codes are included?
  • The crawler should fail the crawl if a certain number of pages have failed or if any of the seeds have failed.
  • The crawler should retry failed pages a certain number of times, possibly customizing which status codes are eligible for retries.
  • The crawler should not write any data for pages that are being retried, until the final retry.

With this in mind, probably should add at least a:

  • --retryStatusCodes flag which indicates which status codes will be retried.
  • Is there a need to also specify --invalidStatusCodes that is separate from --retryStatusCodes? Leaning against it.
  • Is there a need to also specify if failed pages that are being retried should be captured to WARC? Sort of leaning against it as well, since retries are part of the capture process
  • How to handle rate limiting, eg. add exponential backoff via pageExtraDelay for certain status codes, like 429, 503 maybe 403.. Possibly using Retry-After, if available (from Slow down + retry on HTTP 429 errors #392)
@ikreymer ikreymer changed the title Retry + Rate Limit Improvements Retry Improvements + Rate Limit Support Feb 7, 2025
@Mr0grog
Copy link

Mr0grog commented Mar 17, 2025

Unsolicited comment, apologies if not helpful…

I’m really interested in better rate limiting and 429 support. A strategy I’ve found useful in some other tools is treating my working queue as a queue of queues, one for each hostname. Each of the subqueues tracks a time it can be delayed until (so if I receive a 429 response or a response with Retry-After or X-RateLimit or RateLimit header, delay the queue for that host). The main queue just loops through the subqueues and yields the next value from the first queue with no delay, or waits for the queue with the shortest delay and then yields the next from it. One issue here might high overhead for really broad crawls with lots of hosts — most of the stuff I’ve used this for has had a countably small number of hosts involved. Depending on the hostnames involved, sometimes I’ve grouped by eTLD+1 instead of the full domain name, which might be a tunable way to manage the number of queues involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants