-
-
Notifications
You must be signed in to change notification settings - Fork 101
Include extracted titles in urn:pageinfo:
WARC records
#786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, I suppose the title could optionally be included in the pageinfo record, since it has that available when the record is written. It would be the final title of the page, in case it changed from earlier. |
Makes sense! I can take a crack at this if that's useful; it looks like the right thing is to have the crawler tell the recorder (if it exists) what the title is? browsertrix-crawler/src/crawler.ts Lines 1064 to 1069 in 02c4353
Looking at this, I wonder if the content of |
Yes, the actual pageinfo record is written here: Yeah, probably everything that is in the pages.jsonl can be in the pageinfo record, too, as these were added later. |
I noticed the
pages.jsonl
file includes the page’s title, which can be really useful on pages where the<title>
element is created dynamically via JavaScript (I’m hitting a handful of those in my crawls). It would be lovely if this info were also included directly in the WARC.The most obvious place seems like the
url:pageinfo:<url>
records, although having it as metadata on theresponse
record (maybe inWARC-JSON-Metadata
?) or in ametadata
record attached to the response record could make sense, too. (Side question: when first working with Browsertrix WARCs, I was surprised these pageinfo records were plain oldresource
records instead ofmetadata
. Is there a specific reason for that?)The text was updated successfully, but these errors were encountered: