Description
Describe the bug
Linkcheck Builder: False Positive "Anchor not found" Errors with Encoded Characters in Fragment Identifiers
Problem Description
The Sphinx linkcheck builder incorrectly reports "Anchor not found" errors for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working correctly in web browsers.
Current Behavior
When encountering a URL with percent-encoded characters in the anchor/fragment (e.g., https://example.com/page#standard-input%2Foutput-stdio
), the linkcheck builder:
- Extracts the fragment:
standard-input%2Foutput-stdio
- Decodes it to:
standard-input/output-stdio
- Searches for an HTML element with
id="standard-input/output-stdio"
orname="standard-input/output-stdio"
- Reports a broken link when the element isn't found
Expected Behavior
The linkcheck builder should recognize URLs with encoded characters in fragments that work in browsers. This includes cases where:
- The HTML contains the encoded form in the id attribute:
id="standard-input%2Foutput-stdio"
- The browser handles the URL correctly through its own anchor matching rules
Example
URL being checked:
https://example.com/page#standard-input%2Foutput-stdio
Sphinx error:
broken link: Anchor 'standard-input/output-stdio' not found
Actual HTML in target page:
<element id="standard-input%2Foutput-stdio">...</element>
This URL works correctly in web browsers but fails in Sphinx linkcheck.
Affected Code
The issue appears to be in the following modules:
sphinx/builders/linkcheck.py
inHyperlinkAvailabilityCheckWorker._check_uri()
sphinx/builders/linkcheck.py
inAnchorCheckParser.handle_starttag()
Impact
This issue causes false positive "broken link" reports in documentation that references modern web pages with encoded characters in anchor IDs, leading to:
- Failed CI/CD pipelines due to false positive link checks
- Documentation teams needing to create workarounds or ignore valid URLs
- Extra maintenance overhead to verify links manually
Environment Information
sphinx-build --bug-report
Note: Please include the output of this command when filing the actual issue
Possible Solution Approach
Enhance anchor checking to consider both the decoded anchor (current behavior) and the original encoded anchor as valid matches. This would better align with browser behavior while maintaining backward compatibility.
Additional Context
- Modern web platforms often use URL-safe encodings in element IDs
- Browsers handle both encoded and decoded fragment matching
- This impacts documentation referencing web frameworks, APIs, and various technical documentation sites
How to Reproduce
When encountering a URL with percent-encoded characters in the anchor/fragment (e.g., https://example.com/page#standard-input%2Foutput-stdio
), the linkcheck builder:
- Extracts the fragment:
standard-input%2Foutput-stdio
- Decodes it to:
standard-input/output-stdio
- Searches for an HTML element with
id="standard-input/output-stdio"
orname="standard-input/output-stdio"
- Reports a broken link when the element isn't found
Environment Information
Platform: darwin; (macOS-15.5-arm64-arm-64bit)
Python version: 3.12.7 (main, Oct 8 2024, 14:30:16) [Clang 16.0.0 (clang-1600.0.26.3)])
Python implementation: CPython
Sphinx version: 8.3.0+/fd7eef97b
Docutils version: 0.21.2
Jinja2 version: 3.1.6
Pygments version: 2.19.1
Sphinx extensions
Additional context
I already have a fix implemented that I'd like to contribute via a PR. I've tested this locally and it works for my implementation.