Skip to content

Linkcheck Builder: False Positive "Anchor not found" Errors with Encoded Characters in Fragment Identifiers #13620

Open
@nrdlngr

Description

@nrdlngr

Describe the bug

Linkcheck Builder: False Positive "Anchor not found" Errors with Encoded Characters in Fragment Identifiers

Problem Description

The Sphinx linkcheck builder incorrectly reports "Anchor not found" errors for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working correctly in web browsers.

Current Behavior

When encountering a URL with percent-encoded characters in the anchor/fragment (e.g., https://example.com/page#standard-input%2Foutput-stdio), the linkcheck builder:

  1. Extracts the fragment: standard-input%2Foutput-stdio
  2. Decodes it to: standard-input/output-stdio
  3. Searches for an HTML element with id="standard-input/output-stdio" or name="standard-input/output-stdio"
  4. Reports a broken link when the element isn't found

Expected Behavior

The linkcheck builder should recognize URLs with encoded characters in fragments that work in browsers. This includes cases where:

  1. The HTML contains the encoded form in the id attribute: id="standard-input%2Foutput-stdio"
  2. The browser handles the URL correctly through its own anchor matching rules

Example

URL being checked:

https://example.com/page#standard-input%2Foutput-stdio

Sphinx error:

broken link: Anchor 'standard-input/output-stdio' not found

Actual HTML in target page:

<element id="standard-input%2Foutput-stdio">...</element>

This URL works correctly in web browsers but fails in Sphinx linkcheck.

Affected Code

The issue appears to be in the following modules:

  • sphinx/builders/linkcheck.py in HyperlinkAvailabilityCheckWorker._check_uri()
  • sphinx/builders/linkcheck.py in AnchorCheckParser.handle_starttag()

Impact

This issue causes false positive "broken link" reports in documentation that references modern web pages with encoded characters in anchor IDs, leading to:

  • Failed CI/CD pipelines due to false positive link checks
  • Documentation teams needing to create workarounds or ignore valid URLs
  • Extra maintenance overhead to verify links manually

Environment Information

sphinx-build --bug-report

Note: Please include the output of this command when filing the actual issue

Possible Solution Approach

Enhance anchor checking to consider both the decoded anchor (current behavior) and the original encoded anchor as valid matches. This would better align with browser behavior while maintaining backward compatibility.

Additional Context

  • Modern web platforms often use URL-safe encodings in element IDs
  • Browsers handle both encoded and decoded fragment matching
  • This impacts documentation referencing web frameworks, APIs, and various technical documentation sites

How to Reproduce

When encountering a URL with percent-encoded characters in the anchor/fragment (e.g., https://example.com/page#standard-input%2Foutput-stdio), the linkcheck builder:

  1. Extracts the fragment: standard-input%2Foutput-stdio
  2. Decodes it to: standard-input/output-stdio
  3. Searches for an HTML element with id="standard-input/output-stdio" or name="standard-input/output-stdio"
  4. Reports a broken link when the element isn't found

Environment Information

Platform:              darwin; (macOS-15.5-arm64-arm-64bit)
Python version:        3.12.7 (main, Oct  8 2024, 14:30:16) [Clang 16.0.0 (clang-1600.0.26.3)])
Python implementation: CPython
Sphinx version:        8.3.0+/fd7eef97b
Docutils version:      0.21.2
Jinja2 version:        3.1.6
Pygments version:      2.19.1

Sphinx extensions

Additional context

I already have a fix implemented that I'd like to contribute via a PR. I've tested this locally and it works for my implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions