Retry Improvements + Rate Limit Support #758

ikreymer · 2025-02-07T03:09:59Z

Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.

This is getting slightly messy, but hopefully there's a clear path to figure this out.

There's a few options to consider:

Which status code should be counted as page failures, for purposes of ending crawl
Which status codes should result in retrying the page
Should capture of pages with invalid status codes be skipped when they will be retried.
Which status code should result in slowing down the crawl / adding a delay before loading those pages again if retrying..

It's probably useful to list the various use cases:

The crawler should treat 4xx and 5xx as failed, possibly customizing which status codes are included?
The crawler should fail the crawl if a certain number of pages have failed or if any of the seeds have failed.
The crawler should retry failed pages a certain number of times, possibly customizing which status codes are eligible for retries.
The crawler should not write any data for pages that are being retried, until the final retry.

With this in mind, probably should add at least a:

--retryStatusCodes flag which indicates which status codes will be retried.
Is there a need to also specify --invalidStatusCodes that is separate from --retryStatusCodes? Leaning against it.
Is there a need to also specify if failed pages that are being retried should be captured to WARC? Sort of leaning against it as well, since retries are part of the capture process
How to handle rate limiting, eg. add exponential backoff via pageExtraDelay for certain status codes, like 429, 503 maybe 403.. Possibly using Retry-After, if available (from Slow down + retry on HTTP 429 errors #392)

The text was updated successfully, but these errors were encountered:

Mr0grog · 2025-03-17T01:28:05Z

Unsolicited comment, apologies if not helpful…

I’m really interested in better rate limiting and 429 support. A strategy I’ve found useful in some other tools is treating my working queue as a queue of queues, one for each hostname. Each of the subqueues tracks a time it can be delayed until (so if I receive a 429 response or a response with Retry-After or X-RateLimit or RateLimit header, delay the queue for that host). The main queue just loops through the subqueues and yields the next value from the first queue with no delay, or waits for the queue with the shortest delay and then yields the next from it. One issue here might high overhead for really broad crawls with lots of hosts — most of the stuff I’ve used this for has had a countably small number of hosts involved. Depending on the hostnames involved, sometimes I’ve grouped by eTLD+1 instead of the full domain name, which might be a tunable way to manage the number of queues involved.

github-project-automation bot added this to Webrecorder Projects Feb 7, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Feb 7, 2025

ikreymer changed the title ~~Retry + Rate Limit Improvements~~ Retry Improvements + Rate Limit Support Feb 7, 2025

ikreymer mentioned this issue Feb 20, 2025

Cloudflare check timed out & Link extraction timed out ! #768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Retry Improvements + Rate Limit Support #758

Retry Improvements + Rate Limit Support #758

ikreymer commented Feb 7, 2025 •

edited

Loading

Mr0grog commented Mar 17, 2025

Uh oh!

Uh oh!

Retry Improvements + Rate Limit Support #758

Retry Improvements + Rate Limit Support #758

Comments

ikreymer commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mr0grog commented Mar 17, 2025

Uh oh!

ikreymer commented Feb 7, 2025 •

edited

Loading