Skip to content

Include extracted titles in urn:pageinfo: WARC records #786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Mr0grog opened this issue Mar 5, 2025 · 3 comments
Open

Include extracted titles in urn:pageinfo: WARC records #786

Mr0grog opened this issue Mar 5, 2025 · 3 comments

Comments

@Mr0grog
Copy link

Mr0grog commented Mar 5, 2025

I noticed the pages.jsonl file includes the page’s title, which can be really useful on pages where the <title> element is created dynamically via JavaScript (I’m hitting a handful of those in my crawls). It would be lovely if this info were also included directly in the WARC.

The most obvious place seems like the url:pageinfo:<url> records, although having it as metadata on the response record (maybe in WARC-JSON-Metadata?) or in a metadata record attached to the response record could make sense, too. (Side question: when first working with Browsertrix WARCs, I was surprised these pageinfo records were plain old resource records instead of metadata. Is there a specific reason for that?)

@ikreymer
Copy link
Member

Yeah, I suppose the title could optionally be included in the pageinfo record, since it has that available when the record is written. It would be the final title of the page, in case it changed from earlier.
We decided to go with resource as it is a new type of resource, with a distinct URN, rather than metadata about a particular WARC record, which is how metadata records are traditionally used. The pageinfo, just like the extracted text and screenshots, are resources created from the entire rendered page, not just a single response record.

@Mr0grog
Copy link
Author

Mr0grog commented Mar 31, 2025

Makes sense! I can take a crack at this if that's useful; it looks like the right thing is to have the crawler tell the recorder (if it exists) what the title is?

data.title = await timedRun(
page.title(),
PAGE_OP_TIMEOUT_SECS,
"Timed out getting page title, something is likely wrong",
logDetails,
);

Looking at this, I wonder if the content of pages.jsonl and url:pageinfo: records could/should be unified more generally. Maybe the recorder should just be hander the PageState record at the end of Crawler.crawlPage()? Dunno if I’m wanting to stuff too much into the WARC, though. (For my current uses, there’s a lot more that would be helpful to see in the pageinfo records [or pages.jsonl, I guess], such as ID or timestamp references to all the request/response records for the page and various resources that are already listed.)

@ikreymer
Copy link
Member

ikreymer commented Apr 2, 2025

Yes, the actual pageinfo record is written here:
https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/worker.ts#L304
and the title should already be available in data.title.

Yeah, probably everything that is in the pages.jsonl can be in the pageinfo record, too, as these were added later.
They sort of serve different purposes: pages.jsonl is for fast lookup of all pages, while the pageinfo record is to be able to determine which resources were used on each page at capture (or replay time if using QA).
We use this for the QA system to do a comparison, but its sort of experimental still.
Would be open to adding additional data.
Not every resource is written to WARC every time it is loaded, some may be cached by the browser, or already written previously (could surface this info as well...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants