Skip to content

Apply Unicode normalization to text layer and extracted text #7096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

robertknight
Copy link
Member

@robertknight robertknight commented May 19, 2025

Text quote anchoring in PDFs relies on the content of the rendered text layer matching the text produced by PDF.js's text extraction APIs, except for differences in whitespace which are handled by a translateOffsets helper. This assumption no longer holds in more recent versions of PDF.js because different Unicode normalization is applied to the extracted text versus the text layer.

Resolve the issue by applying our own consistent normalization to both the text layer and extracted text.

Part of #6784.


TODO:

Copy link

codecov bot commented May 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (7029044) to head (4dd1b74).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7096   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files         279      279           
  Lines       11326    11348   +22     
  Branches     2727     2731    +4     
=======================================
+ Hits        11257    11279   +22     
  Misses         69       69           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robertknight
Copy link
Member Author

A summary of Unicode normalization in PDF.js:

  1. It was originally introduced in mozilla/pdf.js@c85ec05 as part of the text search feature. At this time there was no normalization API in JS so a manual mapping was used.
  2. In 2023, [api-minor] Don't normalize the text used in the text layer. mozilla/pdf.js#16200 changed text layer construction to avoid normalization. Normalization was instead applied when copying text or using the "find" feature. At the same time the use of the JS normalization was introduced (String.prototype.normalize) to avoid needing to hardcode a large map in PDF.js. This however ran into issues that it normalized more than before ([api-minor] Don't normalize the text used in the text layer. mozilla/pdf.js#16200 (comment)). To preserve the existing limited normalization, a filter is applied (via a regex) so that NFKC normalization is only applied to a subset of characters.

Per the PR linked above, the rationale for not normalizing the text layer was to better align the characters in the text layer with the rendered page. Consider for example a fl ligature which is visually one character in many fonts but becomes a pair "fl" after NFKC normalization.

When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer.

This PR re-introduces normalization in the text layer, so it will cause some misalignment in some cases compared to using PDF.js without Hypothesis. So far in testing I haven't seen a case where it causes major problems, so I think we can get away with it.

If we find in future that this causes significant problems, we might need to implement a different approach which allows for mapping positions in the normalized extracted text to positions in the non-normalized text layer.

@robertknight
Copy link
Member Author

There is now an option to disable normalization when we extract text via the disableNormalization flag to getTextContent. This would align the extracted text with the text layer in the opposite way (disable normalization of both, instead of normalizing both). A caveat is that the extracted text will then no longer match previously extracted text. We do have some ability to tolerate that via fuzzy matching.

@robertknight robertknight force-pushed the pdf-text-layer-norm branch 3 times, most recently from df4fb36 to 77c03bc Compare May 20, 2025 16:16
@robertknight robertknight force-pushed the pdf-text-layer-norm branch from 77c03bc to f89436a Compare May 28, 2025 12:24
Text quote anchoring in PDFs relies on the content of the rendered text layer
matching the text produced by PDF.js's text extraction APIs, except for
differences in whitespace which are handled by a `translateOffsets` helper. This
assumption no longer holds in more recent versions of PDF.js because Unicode
normalization is no longer applied to the text layer.

Resolve the issue by applying normalization to the text layer ourselves.

Part of #6784.
@robertknight
Copy link
Member Author

Tentatively closing in favor of a different approach: #7123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant