Skip to content

Fix linkcheck anchor encoding issue #13621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nrdlngr
Copy link

@nrdlngr nrdlngr commented Jun 6, 2025

Fix linkcheck anchor encoding issue (#13620)

Description

This PR fixes an issue where the linkcheck builder incorrectly reports "Anchor not found" errors
for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working
correctly in web browsers.

Current Behavior

When encountering a URL with percent-encoded characters in the anchor/fragment (e.g.,
https://example.com/page#standard-input%2Foutput-stdio), the linkcheck builder:

  1. Extracts the fragment: standard-input%2Foutput-stdio
  2. Decodes it to: standard-input/output-stdio
  3. Searches for an HTML element with id="standard-input/output-stdio" or
    name="standard-input/output-stdio"
  4. Reports a broken link when the element isn't found, even though the URL works in browsers

Changes Made

  • Enhanced AnchorCheckParser to check for multiple variants of the anchor:
    • The decoded version (current behavior)
    • The original encoded version
    • A re-encoded version if the decoded version contains encoding-required characters
  • Added comprehensive tests to verify the new behavior
  • Updated the contains_anchor function to accept both decoded and original encoded anchors
  • Added entry to CHANGES.rst

Testing Done

  • Added unit tests for the AnchorCheckParser class
  • Added integration tests with a mock HTTP server that serves HTML with encoded anchors
  • Verified that all tests pass with the new implementation

Fixes

Fixes #13620

@nrdlngr nrdlngr force-pushed the fix-linkcheck-anchor-encoding branch 3 times, most recently from 98f3a53 to f896df9 Compare June 6, 2025 17:42
- Enhanced AnchorCheckParser to handle multiple anchor variations
- Added comprehensive test coverage for encoded anchors
- Fixed false 'Anchor not found' errors for URLs with encoded characters
- Maintains full backward compatibility
- All linting checks pass
@nrdlngr nrdlngr force-pushed the fix-linkcheck-anchor-encoding branch from f896df9 to 5265662 Compare June 6, 2025 17:53
Comment on lines +728 to +730
self.search_variations = {
search_anchor, # decoded (current behavior)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we try to search the anchor name variations in the order listed in the WHATWG HTML spec for Scrolling to a fragment: https://html.spec.whatwg.org/commit-snapshots/4467ddf3235732614a0202729d6ec0e04c33b597/#scrolling-to-a-fragment

To do that, I think we may need to use a different datastructure than Python's set, because it doesn't necessarily iterate over elements in the order they were added/inserted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake; we only use the search_variations set attribute for presence-checking as we progress our way through each HTML document. Iteration order does not matter. So: I think set is fine, and we do not need to worry about the order that we add items -- provided that the resulting collection is equivalent to the anchor names a browser could check for.

Comment on lines +732 to +739
# Add the original encoded version if provided
if original_encoded_anchor:
self.search_variations.add(original_encoded_anchor)

# Add a re-encoded version if the decoded anchor contains characters
# that would be encoded
if search_anchor != quote(search_anchor, safe=''):
self.search_variations.add(quote(search_anchor, safe=''))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to extract the anchor-to-search-variations code into a helper function/method; doing so would hopefully also be convenient to write unit tests around its behaviour.

@@ -706,15 +713,37 @@ def contains_anchor(response: Response, anchor: str) -> bool:
class AnchorCheckParser(HTMLParser):
"""Specialised HTML parser that looks for a specific anchor."""

def __init__(self, search_anchor: str) -> None:
def __init__(self, search_anchor: str, original_encoded_anchor: str = '') -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function call that contains the same value encoded in two different ways is, to me, something of an undesirable code smell. It's probably not the first item to focus on, but it would be nice if this initializer could be updated to accept only one form of the anchor (I'd probably recommend the original value received before any encoding-related operations have been attempted).

@jayaddison
Copy link
Contributor

After some initial confusion (documented to some extent in the linked issue thread #13620), I'm now supportive of this functionality, and would like to see this merged. I would like a few adjustments/refactorings to be made before then, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linkcheck Builder: False Positive "Anchor not found" Errors with Encoded Characters in Fragment Identifiers
2 participants