- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 101
404 response with empty body causes crawler to think page crashed and not record response in WARC #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is an interesting edge-case, I think the browser considers this a crash, as it shows the chrome error page here, since it generates to content and can't be loaded. |
- chrome returns net::ERR_HTTP_RESPONSE_CODE_FAILURE - store WARC record with empty response - don't retry page, save with loadState: 1 - fixes #789
Oh interesting, I tried it in Safari and Firefox, which just show a blank screen and no error, but did not try Chrome. I wonder if it would make sense to handle Taking a quick look at the Chromium source, it looks like it intentionally bails out and declares this error code if there is no response body on a non-2xx response (see the corresponding header file for a comment explaining a bit more). Later on it uses that signal to render a custom page instead of a blank screen like other browsers. So FWIW, I don’t think Chromium is really considering this a crash so much as it’s taking a kind of roundabout way to render a nice error message for users (much nicer than the blank screen!) that happens to have weird results for CDP/Puppeteer consumers.
Yes please to both of these. |
When trying to archive a URL that returns a 404 status code and an empty response body, the crawler logs that the page crashed, retries a few times, and then never records the request and response in the WARC, despite the fact that it is a correct, complete, successful HTTP response. Skimming the code, I suspect this might be the case for any non-2xx response, since that causes the direct fetch to fail. But there are obviously also issues in the part of this that is automating the browser, too.
Here’s an example URL with this behavior: https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf
To reproduce:
Using the
webrecorder/browsertrix-crawler:1.5.8
Docker image and the following config:And the following command:
Logs warnings and errors like:
Complete Log Output
The text was updated successfully, but these errors were encountered: