Fix linkcheck anchor encoding issue #13621

nrdlngr · 2025-06-06T02:48:13Z

Fix linkcheck anchor encoding issue (#13620)

Description

This PR fixes an issue where the linkcheck builder incorrectly reports "Anchor not found" errors
for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working
correctly in web browsers.

Current Behavior

When encountering a URL with percent-encoded characters in the anchor/fragment (e.g.,
https://example.com/page#standard-input%2Foutput-stdio), the linkcheck builder:

Extracts the fragment: standard-input%2Foutput-stdio
Decodes it to: standard-input/output-stdio
Searches for an HTML element with id="standard-input/output-stdio" or
name="standard-input/output-stdio"
Reports a broken link when the element isn't found, even though the URL works in browsers

Changes Made

Enhanced AnchorCheckParser to check for multiple variants of the anchor:
- The decoded version (current behavior)
- The original encoded version
- A re-encoded version if the decoded version contains encoding-required characters
Added comprehensive tests to verify the new behavior
Updated the contains_anchor function to accept both decoded and original encoded anchors
Added entry to CHANGES.rst

Testing Done

Added unit tests for the AnchorCheckParser class
Added integration tests with a mock HTTP server that serves HTML with encoded anchors
Verified that all tests pass with the new implementation

Fixes

Fixes #13620

- Enhanced AnchorCheckParser to handle multiple anchor variations - Added comprehensive test coverage for encoded anchors - Fixed false 'Anchor not found' errors for URLs with encoded characters - Maintains full backward compatibility - All linting checks pass

jayaddison · 2025-06-13T10:24:08Z

sphinx/builders/linkcheck.py

+        self.search_variations = {
+            search_anchor,  # decoded (current behavior)
+        }


I'd recommend that we try to search the anchor name variations in the order listed in the WHATWG HTML spec for Scrolling to a fragment: https://html.spec.whatwg.org/commit-snapshots/4467ddf3235732614a0202729d6ec0e04c33b597/#scrolling-to-a-fragment

To do that, I think we may need to use a different datastructure than Python's set, because it doesn't necessarily iterate over elements in the order they were added/inserted.

My mistake; we only use the search_variations set attribute for presence-checking as we progress our way through each HTML document. Iteration order does not matter. So: I think set is fine, and we do not need to worry about the order that we add items -- provided that the resulting collection is equivalent to the anchor names a browser could check for.

jayaddison · 2025-06-13T10:25:10Z

sphinx/builders/linkcheck.py

+        # Add the original encoded version if provided
+        if original_encoded_anchor:
+            self.search_variations.add(original_encoded_anchor)
+
+        # Add a re-encoded version if the decoded anchor contains characters
+        # that would be encoded
+        if search_anchor != quote(search_anchor, safe=''):
+            self.search_variations.add(quote(search_anchor, safe=''))


It would be nice to extract the anchor-to-search-variations code into a helper function/method; doing so would hopefully also be convenient to write unit tests around its behaviour.

jayaddison · 2025-06-13T10:27:29Z

sphinx/builders/linkcheck.py

@@ -706,15 +713,37 @@ def contains_anchor(response: Response, anchor: str) -> bool:
 class AnchorCheckParser(HTMLParser):
    """Specialised HTML parser that looks for a specific anchor."""

-    def __init__(self, search_anchor: str) -> None:
+    def __init__(self, search_anchor: str, original_encoded_anchor: str = '') -> None:


A function call that contains the same value encoded in two different ways is, to me, something of an undesirable code smell. It's probably not the first item to focus on, but it would be nice if this initializer could be updated to accept only one form of the anchor (I'd probably recommend the original value received before any encoding-related operations have been attempted).

jayaddison · 2025-06-13T10:28:51Z

After some initial confusion (documented to some extent in the linked issue thread #13620), I'm now supportive of this functionality, and would like to see this merged. I would like a few adjustments/refactorings to be made before then, though.

nrdlngr force-pushed the fix-linkcheck-anchor-encoding branch 3 times, most recently from 98f3a53 to f896df9 Compare June 6, 2025 17:42

nrdlngr force-pushed the fix-linkcheck-anchor-encoding branch from f896df9 to 5265662 Compare June 6, 2025 17:53

Merge branch 'master' into fix-linkcheck-anchor-encoding

cb19286

jayaddison mentioned this pull request Jun 12, 2025

Linkcheck Builder: False Positive "Anchor not found" Errors with Encoded Characters in Fragment Identifiers #13620

Open

jayaddison reviewed Jun 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix linkcheck anchor encoding issue #13621

Fix linkcheck anchor encoding issue #13621

Uh oh!

nrdlngr commented Jun 6, 2025

Uh oh!

jayaddison Jun 13, 2025

Uh oh!

jayaddison Jun 13, 2025

Uh oh!

jayaddison Jun 13, 2025

Uh oh!

jayaddison Jun 13, 2025

Uh oh!

jayaddison commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Fix linkcheck anchor encoding issue #13621

Are you sure you want to change the base?

Fix linkcheck anchor encoding issue #13621

Uh oh!

Conversation

nrdlngr commented Jun 6, 2025

Description

Current Behavior

Changes Made

Testing Done

Fixes

Uh oh!

jayaddison Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jayaddison Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jayaddison Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jayaddison Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jayaddison commented Jun 13, 2025

Uh oh!

Uh oh!