Skip to content

404 response with empty body causes crawler to think page crashed and not record response in WARC #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Mr0grog opened this issue Mar 8, 2025 · 2 comments

Comments

@Mr0grog
Copy link

Mr0grog commented Mar 8, 2025

When trying to archive a URL that returns a 404 status code and an empty response body, the crawler logs that the page crashed, retries a few times, and then never records the request and response in the WARC, despite the fact that it is a correct, complete, successful HTTP response. Skimming the code, I suspect this might be the case for any non-2xx response, since that causes the direct fetch to fail. But there are obviously also issues in the part of this that is automating the browser, too.

Here’s an example URL with this behavior: https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf

To reproduce:

Using the webrecorder/browsertrix-crawler:1.5.8 Docker image and the following config:

# test.crawl.yaml
scopeType: page
rolloverSize: 8000000000
workers: 1
saveStateHistory: 1

seeds:
  - 'https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf'

And the following command:

docker run \
    --rm \
    --attach stdout --attach stderr \
    --volume "./test.crawl.yaml:/app/config.yaml" \
    --volume "./crawls:/crawls/" \
    webrecorder/browsertrix-crawler:1.5.8 \
    crawl \
    --config /app/config.yaml \
    --collection "test--20250307182348" \
    --saveState always \
    --logging debug,stats

Logs warnings and errors like:

{"context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","frameId":"3FB853DBA7F1F9F08A07C0C540C98813"}}
{"context":"recorder","message":"Request failed","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","type":"Document","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
"context":"pageStatus","message":"Page Crashed on Load: will retry","details":{"retry":0,"retries":2,"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}

...Repeat a few times...

{"context":"pageStatus","message":"Page Crashed on Load: retry limit reached","details":{"retry":2,"retries":2,"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
Complete Log Output
{"timestamp":"2025-03-08T02:23:49.505Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.5.8 (with warcio.js 2.4.3)","details":{}}
{"timestamp":"2025-03-08T02:23:49.505Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","scopeType":"page","include":[],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"auth":null,"_authEncoded":null,"maxExtraHops":0,"maxDepth":0}]}
{"timestamp":"2025-03-08T02:23:49.505Z","logLevel":"info","context":"general","message":"Link Selectors","details":[{"selector":"a[href]","extract":"href","isAttribute":false}]}
{"timestamp":"2025-03-08T02:23:49.505Z","logLevel":"info","context":"general","message":"Behavior Options","details":{"message":"{\"autoplay\":true,\"autofetch\":true,\"autoscroll\":true,\"siteSpecific\":true,\"log\":\"__bx_log\",\"startEarly\":true,\"clickSelector\":\"a\"}"}}
{"timestamp":"2025-03-08T02:23:49.544Z","logLevel":"debug","context":"state","message":"Storing state via Redis redis://localhost:6379/0 @ key prefix \"4c25568b7dc8\"","details":{}}
{"timestamp":"2025-03-08T02:23:49.544Z","logLevel":"debug","context":"state","message":"Max Page Time: 190 seconds","details":{}}
{"timestamp":"2025-03-08T02:23:49.545Z","logLevel":"debug","context":"state","message":"Saving crawl state every 300 seconds, keeping last 1 states","details":{}}
{"timestamp":"2025-03-08T02:23:49.551Z","logLevel":"debug","context":"general","message":"Text Extraction: None","details":{}}
{"timestamp":"2025-03-08T02:23:49.552Z","logLevel":"debug","context":"general","message":"Text Extraction: None","details":{}}
{"timestamp":"2025-03-08T02:23:49.564Z","logLevel":"debug","context":"links","message":"Queued new page url","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:49.784Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}}
{"timestamp":"2025-03-08T02:23:49.784Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"timestamp":"2025-03-08T02:23:49.787Z","logLevel":"debug","context":"worker","message":"Getting page in new window","details":{"workerid":0}}
{"timestamp":"2025-03-08T02:23:49.848Z","logLevel":"debug","context":"browser","message":"Service Workers: always disabled","details":{}}
{"timestamp":"2025-03-08T02:23:49.855Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:49.856Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-03-08T02:23:49.786Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.whitehouse.gov\\/wp-content\\/uploads\\/2023\\/01\\/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf\",\"added\":\"2025-03-08T02:23:49.563Z\",\"depth\":0}"]}}
{"timestamp":"2025-03-08T02:23:49.856Z","logLevel":"debug","context":"memoryStatus","message":"Memory","details":{"maxHeapUsed":41141704,"maxHeapTotal":72593408,"rss":129142784,"heapTotal":72593408,"heapUsed":41141704,"external":5673430,"arrayBuffers":571525}}
{"timestamp":"2025-03-08T02:23:49.867Z","logLevel":"debug","context":"recorder","message":"Async started: fetch","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:50.088Z","logLevel":"debug","context":"fetch","message":"Direct fetch response not accepted, continuing with browser fetch","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.088Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.257Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","frameId":"3FB853DBA7F1F9F08A07C0C540C98813"}}
{"timestamp":"2025-03-08T02:23:50.259Z","logLevel":"debug","context":"general","message":"Setting page timestamp","details":{"ts":"2025-03-08T02:23:50.091Z","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404}}
{"timestamp":"2025-03-08T02:23:50.262Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","type":"Document","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.293Z","logLevel":"debug","context":"behaviorScript","message":"Using AutoFetcher","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.294Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoplay","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.294Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoscroll","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.861Z","logLevel":"warn","context":"pageStatus","message":"Page Crashed on Load: will retry","details":{"retry":0,"retries":2,"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:50.862Z","logLevel":"debug","context":"worker","message":"Closing page","details":{"crashed":false,"workerid":0}}
{"timestamp":"2025-03-08T02:23:50.894Z","logLevel":"debug","context":"recorder","message":"WARC Record Written","details":{"type":"pageinfo","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:50.897Z","logLevel":"debug","context":"worker","message":"Getting page in new window","details":{"workerid":0}}
{"timestamp":"2025-03-08T02:23:50.967Z","logLevel":"debug","context":"browser","message":"Service Workers: always disabled","details":{}}
{"timestamp":"2025-03-08T02:23:50.980Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:50.980Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"url\":\"https:\\/\\/www.whitehouse.gov\\/wp-content\\/uploads\\/2023\\/01\\/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf\",\"seedId\":0,\"started\":\"2025-03-08T02:23:50.896Z\",\"extraHops\":0,\"depth\":0,\"added\":\"2025-03-08T02:23:49.563Z\",\"retry\":1}"]}}
{"timestamp":"2025-03-08T02:23:50.980Z","logLevel":"debug","context":"memoryStatus","message":"Memory","details":{"maxHeapUsed":46035520,"maxHeapTotal":72855552,"rss":150286336,"heapTotal":72855552,"heapUsed":46035520,"external":7002810,"arrayBuffers":846374}}
{"timestamp":"2025-03-08T02:23:50.981Z","logLevel":"debug","context":"recorder","message":"Async started: fetch","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:51.020Z","logLevel":"debug","context":"fetch","message":"Direct fetch response not accepted, continuing with browser fetch","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.020Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.092Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","frameId":"4B9D860A4321D84512E77C75AAE9A47F"}}
{"timestamp":"2025-03-08T02:23:51.092Z","logLevel":"debug","context":"general","message":"Setting page timestamp","details":{"ts":"2025-03-08T02:23:51.022Z","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404}}
{"timestamp":"2025-03-08T02:23:51.095Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","type":"Document","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.125Z","logLevel":"debug","context":"behaviorScript","message":"Using AutoFetcher","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.126Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoplay","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.126Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoscroll","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.680Z","logLevel":"warn","context":"pageStatus","message":"Page Crashed on Load: will retry","details":{"retry":1,"retries":2,"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.681Z","logLevel":"debug","context":"worker","message":"Closing page","details":{"crashed":false,"workerid":0}}
{"timestamp":"2025-03-08T02:23:51.699Z","logLevel":"debug","context":"recorder","message":"WARC Record Written","details":{"type":"pageinfo","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:51.702Z","logLevel":"debug","context":"worker","message":"Getting page in new window","details":{"workerid":0}}
{"timestamp":"2025-03-08T02:23:51.782Z","logLevel":"debug","context":"browser","message":"Service Workers: always disabled","details":{}}
{"timestamp":"2025-03-08T02:23:51.789Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:51.790Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"extraHops\":0,\"seedId\":0,\"started\":\"2025-03-08T02:23:51.701Z\",\"url\":\"https:\\/\\/www.whitehouse.gov\\/wp-content\\/uploads\\/2023\\/01\\/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf\",\"depth\":0,\"added\":\"2025-03-08T02:23:49.563Z\",\"retry\":2}"]}}
{"timestamp":"2025-03-08T02:23:51.790Z","logLevel":"debug","context":"memoryStatus","message":"Memory","details":{"maxHeapUsed":48391960,"maxHeapTotal":72855552,"rss":152907776,"heapTotal":72855552,"heapUsed":48391960,"external":7452239,"arrayBuffers":1033659}}
{"timestamp":"2025-03-08T02:23:51.791Z","logLevel":"debug","context":"recorder","message":"Async started: fetch","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:51.828Z","logLevel":"debug","context":"fetch","message":"Direct fetch response not accepted, continuing with browser fetch","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.828Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.868Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","frameId":"98B10A39F38EEE8BDCCE7A493F1488C8"}}
{"timestamp":"2025-03-08T02:23:51.869Z","logLevel":"debug","context":"general","message":"Setting page timestamp","details":{"ts":"2025-03-08T02:23:51.830Z","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404}}
{"timestamp":"2025-03-08T02:23:51.871Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","type":"Document","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.899Z","logLevel":"debug","context":"behaviorScript","message":"Using AutoFetcher","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.900Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoplay","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:51.900Z","logLevel":"debug","context":"behaviorScript","message":"Using Autoscroll","details":{"page":"chrome-error://chromewebdata/","workerid":0}}
{"timestamp":"2025-03-08T02:23:52.458Z","logLevel":"error","context":"pageStatus","message":"Page Crashed on Load: retry limit reached","details":{"retry":2,"retries":2,"url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","status":404,"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:52.458Z","logLevel":"debug","context":"worker","message":"Closing page","details":{"crashed":false,"workerid":0}}
{"timestamp":"2025-03-08T02:23:52.471Z","logLevel":"debug","context":"recorder","message":"WARC Record Written","details":{"type":"pageinfo","url":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf"}}
{"timestamp":"2025-03-08T02:23:52.475Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/test--20250307182348/crawls/crawl-20250308022352-4c25568b7dc8.yaml","details":{}}
{"timestamp":"2025-03-08T02:23:52.485Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2025-03-08T02:23:52.486Z","logLevel":"debug","context":"recorder","message":"Finishing Fetcher Queue","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:52.486Z","logLevel":"debug","context":"recorder","message":"Finishing WARC writing","details":{"page":"https://www.whitehouse.gov/wp-content/uploads/2023/01/01-2023-Framework-for-Federal-Scientific-Integrity-Policy-and-Practice.pdf","workerid":0}}
{"timestamp":"2025-03-08T02:23:52.534Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/test--20250307182348/crawls/crawl-20250308022352-4c25568b7dc8.yaml","details":{}}
{"timestamp":"2025-03-08T02:23:52.535Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/test--20250307182348/crawls/crawl-20250308022352-4c25568b7dc8.yaml","details":{}}
{"timestamp":"2025-03-08T02:23:52.536Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2025-03-08T02:23:52.536Z","logLevel":"debug","context":"memoryStatus","message":"Memory","details":{"maxHeapUsed":50368720,"maxHeapTotal":73117696,"rss":154611712,"heapTotal":73117696,"heapUsed":50368720,"external":7833883,"arrayBuffers":1153159}}
{"timestamp":"2025-03-08T02:23:52.537Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2025-03-08T02:23:52.542Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}
@ikreymer
Copy link
Member

ikreymer commented Apr 1, 2025

This is an interesting edge-case, I think the browser considers this a crash, as it shows the chrome error page here, since it generates to content and can't be loaded.
It's possible to detect and write to WARC, though. Pehaps also shouldn't retry? I guess that's probably better than current behavior

ikreymer added a commit that referenced this issue Apr 1, 2025
- chrome returns net::ERR_HTTP_RESPONSE_CODE_FAILURE
- store WARC record with empty response
- don't retry page, save with loadState: 1
- fixes #789
@Mr0grog
Copy link
Author

Mr0grog commented Apr 2, 2025

I think the browser considers this a crash, as it shows the chrome error page here

Oh interesting, I tried it in Safari and Firefox, which just show a blank screen and no error, but did not try Chrome. I wonder if it would make sense to handle net::ERR_HTTP_RESPONSE_CODE_FAILURE specially.

Taking a quick look at the Chromium source, it looks like it intentionally bails out and declares this error code if there is no response body on a non-2xx response (see the corresponding header file for a comment explaining a bit more). Later on it uses that signal to render a custom page instead of a blank screen like other browsers. So FWIW, I don’t think Chromium is really considering this a crash so much as it’s taking a kind of roundabout way to render a nice error message for users (much nicer than the blank screen!) that happens to have weird results for CDP/Puppeteer consumers.

It's possible to detect and write to WARC, though. Pehaps also shouldn't retry?

Yes please to both of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants