Include extracted titles in `urn:pageinfo:` WARC records #786

Mr0grog · 2025-03-05T02:42:19Z

I noticed the pages.jsonl file includes the page’s title, which can be really useful on pages where the <title> element is created dynamically via JavaScript (I’m hitting a handful of those in my crawls). It would be lovely if this info were also included directly in the WARC.

The most obvious place seems like the url:pageinfo:<url> records, although having it as metadata on the response record (maybe in WARC-JSON-Metadata?) or in a metadata record attached to the response record could make sense, too. (Side question: when first working with Browsertrix WARCs, I was surprised these pageinfo records were plain old resource records instead of metadata. Is there a specific reason for that?)

The text was updated successfully, but these errors were encountered:

ikreymer · 2025-03-29T00:25:28Z

Yeah, I suppose the title could optionally be included in the pageinfo record, since it has that available when the record is written. It would be the final title of the page, in case it changed from earlier.
We decided to go with resource as it is a new type of resource, with a distinct URN, rather than metadata about a particular WARC record, which is how metadata records are traditionally used. The pageinfo, just like the extracted text and screenshots, are resources created from the entire rendered page, not just a single response record.

Mr0grog · 2025-03-31T17:20:34Z

Makes sense! I can take a crack at this if that's useful; it looks like the right thing is to have the crawler tell the recorder (if it exists) what the title is?

browsertrix-crawler/src/crawler.ts

Lines 1064 to 1069 in 02c4353

    
           data.title = await timedRun( 
        
             page.title(), 
        
             PAGE_OP_TIMEOUT_SECS, 
        
             "Timed out getting page title, something is likely wrong", 
        
             logDetails, 
        
           );

Looking at this, I wonder if the content of pages.jsonl and url:pageinfo: records could/should be unified more generally. Maybe the recorder should just be hander the PageState record at the end of Crawler.crawlPage()? Dunno if I’m wanting to stuff too much into the WARC, though. (For my current uses, there’s a lot more that would be helpful to see in the pageinfo records [or pages.jsonl, I guess], such as ID or timestamp references to all the request/response records for the page and various resources that are already listed.)

ikreymer · 2025-04-02T05:22:29Z

Yes, the actual pageinfo record is written here:
https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/worker.ts#L304
and the title should already be available in data.title.

Yeah, probably everything that is in the pages.jsonl can be in the pageinfo record, too, as these were added later.
They sort of serve different purposes: pages.jsonl is for fast lookup of all pages, while the pageinfo record is to be able to determine which resources were used on each page at capture (or replay time if using QA).
We use this for the QA system to do a comparison, but its sort of experimental still.
Would be open to adding additional data.
Not every resource is written to WARC every time it is loaded, some may be cached by the browser, or already written previously (could surface this info as well...)

github-project-automation bot added this to Webrecorder Projects Mar 5, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include extracted titles in `urn:pageinfo:` WARC records #786

Include extracted titles in `urn:pageinfo:` WARC records #786

Mr0grog commented Mar 5, 2025

ikreymer commented Mar 29, 2025

Mr0grog commented Mar 31, 2025

ikreymer commented Apr 2, 2025

Include extracted titles in urn:pageinfo: WARC records #786

Include extracted titles in urn:pageinfo: WARC records #786

Comments

Mr0grog commented Mar 5, 2025

ikreymer commented Mar 29, 2025

Mr0grog commented Mar 31, 2025

ikreymer commented Apr 2, 2025

Include extracted titles in `urn:pageinfo:` WARC records #786

Include extracted titles in `urn:pageinfo:` WARC records #786