-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix linkcheck anchor encoding issue #13621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
98f3a53
to
f896df9
Compare
- Enhanced AnchorCheckParser to handle multiple anchor variations - Added comprehensive test coverage for encoded anchors - Fixed false 'Anchor not found' errors for URLs with encoded characters - Maintains full backward compatibility - All linting checks pass
f896df9
to
5265662
Compare
self.search_variations = { | ||
search_anchor, # decoded (current behavior) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend that we try to search the anchor name variations in the order listed in the WHATWG HTML spec for Scrolling to a fragment
: https://html.spec.whatwg.org/commit-snapshots/4467ddf3235732614a0202729d6ec0e04c33b597/#scrolling-to-a-fragment
To do that, I think we may need to use a different datastructure than Python's set
, because it doesn't necessarily iterate over elements in the order they were added/inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake; we only use the search_variations
set attribute for presence-checking as we progress our way through each HTML document. Iteration order does not matter. So: I think set
is fine, and we do not need to worry about the order that we add items -- provided that the resulting collection is equivalent to the anchor names a browser could check for.
# Add the original encoded version if provided | ||
if original_encoded_anchor: | ||
self.search_variations.add(original_encoded_anchor) | ||
|
||
# Add a re-encoded version if the decoded anchor contains characters | ||
# that would be encoded | ||
if search_anchor != quote(search_anchor, safe=''): | ||
self.search_variations.add(quote(search_anchor, safe='')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to extract the anchor-to-search-variations code into a helper function/method; doing so would hopefully also be convenient to write unit tests around its behaviour.
@@ -706,15 +713,37 @@ def contains_anchor(response: Response, anchor: str) -> bool: | |||
class AnchorCheckParser(HTMLParser): | |||
"""Specialised HTML parser that looks for a specific anchor.""" | |||
|
|||
def __init__(self, search_anchor: str) -> None: | |||
def __init__(self, search_anchor: str, original_encoded_anchor: str = '') -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function call that contains the same value encoded in two different ways is, to me, something of an undesirable code smell. It's probably not the first item to focus on, but it would be nice if this initializer could be updated to accept only one form of the anchor (I'd probably recommend the original value received before any encoding-related operations have been attempted).
After some initial confusion (documented to some extent in the linked issue thread #13620), I'm now supportive of this functionality, and would like to see this merged. I would like a few adjustments/refactorings to be made before then, though. |
Fix linkcheck anchor encoding issue (#13620)
Description
This PR fixes an issue where the linkcheck builder incorrectly reports "Anchor not found" errors
for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working
correctly in web browsers.
Current Behavior
When encountering a URL with percent-encoded characters in the anchor/fragment (e.g.,
https://example.com/page#standard-input%2Foutput-stdio
), the linkcheck builder:standard-input%2Foutput-stdio
standard-input/output-stdio
id="standard-input/output-stdio"
orname="standard-input/output-stdio"
Changes Made
AnchorCheckParser
to check for multiple variants of the anchor:contains_anchor
function to accept both decoded and original encoded anchorsTesting Done
AnchorCheckParser
classFixes
Fixes #13620