Skip to content

Add WARC-Protocol header #715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 20, 2025
Merged

Add WARC-Protocol header #715

merged 8 commits into from
May 20, 2025

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Nov 7, 2024

A few caveats:
For now, just adding WARC-Protocol here as WARC-Cipher-Suite needs more testing.
- The WARC-Cipher-Suite data is not directly available from the browser, so must be inferred based on the available info.

@tw4l
Copy link
Member

tw4l commented Nov 11, 2024

Re: the WARC-Protocol header being comma separated rather than repeated, it looks like the IIPC membership is trying to get consensus on whether this is acceptable this week. See ongoing discussion here: iipc/warc-specifications#42

Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion re: import cleanup.

Community consensus in iipc/warc-specifications#42 seems to be for repeated headers rather than a single header with a comma-separated list so we should probably modify this PR to go with that approach.

@ikreymer ikreymer requested a review from tw4l November 20, 2024 19:48
@ikreymer ikreymer marked this pull request as ready for review November 20, 2024 19:50
@ikreymer
Copy link
Member Author

One suggestion re: import cleanup.

Community consensus in iipc/warc-specifications#42 seems to be for repeated headers rather than a single header with a comma-separated list so we should probably modify this PR to go with that approach.

Updated to now generate multiple WARC-Protocol headers, per consensus there.

@ikreymer
Copy link
Member Author

Though, may also want to get clarification on WARC-Cipher-Suite since its not an exact one-to-one mapping there..

ikreymer and others added 5 commits May 12, 2025 16:08
…parated

- add WARC-Cipher-Suite header, mapping Chrome NetworkSecurityDetails to known cipher suites
- fixes #641
support WARC-Protocol as multiple headers
tests: add tests for WARC-Protocol, WARC-Cipher-Suite
Co-authored-by: Tessa Walsh <[email protected]>
ikreymer added 2 commits May 12, 2025 16:10
check if protocol ever matches HTTP/1.0 and use that in WARC header, otherwise always use HTTP/1.1
@ikreymer ikreymer changed the title WARC-Protocol + WARC-Cipher-Suite headers Add WARC-Protocol header May 12, 2025
@ikreymer ikreymer requested a review from tw4l May 12, 2025 23:29
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested independently with the crawler and this is working as expected. Nice work.

@ikreymer ikreymer merged commit e72b343 into main May 20, 2025
4 checks passed
@ikreymer ikreymer deleted the cipher-protocol branch May 20, 2025 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for WARC-Protocol and WARC-Cipher-Suite headers
2 participants