Apply Unicode normalization to text layer and extracted text #7096

robertknight · 2025-05-19T14:26:21Z

Text quote anchoring in PDFs relies on the content of the rendered text layer matching the text produced by PDF.js's text extraction APIs, except for differences in whitespace which are handled by a translateOffsets helper. This assumption no longer holds in more recent versions of PDF.js because different Unicode normalization is applied to the extracted text versus the text layer.

Resolve the issue by applying our own consistent normalization to both the text layer and extracted text.

Part of #6784.

TODO:

Track down the history of the custom Unicode normalization that PDF.js uses and add a summary of why it is used (Apply Unicode normalization to text layer and extracted text #7096 (comment))
Tests

codecov · 2025-05-19T14:28:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (7029044) to head (4dd1b74).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7096   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files         279      279           
  Lines       11326    11348   +22     
  Branches     2727     2731    +4     
=======================================
+ Hits        11257    11279   +22     
  Misses         69       69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

robertknight · 2025-05-20T11:36:23Z

A summary of Unicode normalization in PDF.js:

It was originally introduced in mozilla/pdf.js@c85ec05 as part of the text search feature. At this time there was no normalization API in JS so a manual mapping was used.
In 2023, [api-minor] Don't normalize the text used in the text layer. mozilla/pdf.js#16200 changed text layer construction to avoid normalization. Normalization was instead applied when copying text or using the "find" feature. At the same time the use of the JS normalization was introduced (String.prototype.normalize) to avoid needing to hardcode a large map in PDF.js. This however ran into issues that it normalized more than before ([api-minor] Don't normalize the text used in the text layer. mozilla/pdf.js#16200 (comment)). To preserve the existing limited normalization, a filter is applied (via a regex) so that NFKC normalization is only applied to a subset of characters.

Per the PR linked above, the rationale for not normalizing the text layer was to better align the characters in the text layer with the rendered page. Consider for example a ﬂ ligature which is visually one character in many fonts but becomes a pair "fl" after NFKC normalization.

When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer.

This PR re-introduces normalization in the text layer, so it will cause some misalignment in some cases compared to using PDF.js without Hypothesis. So far in testing I haven't seen a case where it causes major problems, so I think we can get away with it.

If we find in future that this causes significant problems, we might need to implement a different approach which allows for mapping positions in the normalized extracted text to positions in the non-normalized text layer.

robertknight · 2025-05-20T11:45:33Z

There is now an option to disable normalization when we extract text via the disableNormalization flag to getTextContent. This would align the extracted text with the text layer in the opposite way (disable normalization of both, instead of normalizing both). A caveat is that the extracted text will then no longer match previously extracted text. We do have some ability to tolerate that via fuzzy matching.

Text quote anchoring in PDFs relies on the content of the rendered text layer matching the text produced by PDF.js's text extraction APIs, except for differences in whitespace which are handled by a `translateOffsets` helper. This assumption no longer holds in more recent versions of PDF.js because Unicode normalization is no longer applied to the text layer. Resolve the issue by applying normalization to the text layer ourselves. Part of #6784.

robertknight · 2025-05-29T13:33:12Z

Tentatively closing in favor of a different approach: #7123

robertknight mentioned this pull request May 19, 2025

Update PDF.js (2025 edition) #6784

Open

robertknight force-pushed the pdf-text-layer-norm branch 3 times, most recently from df4fb36 to 77c03bc Compare May 20, 2025 16:16

robertknight force-pushed the pdf-text-layer-norm branch from 77c03bc to f89436a Compare May 28, 2025 12:24

robertknight force-pushed the pdf-text-layer-norm branch from f89436a to 4dd1b74 Compare May 29, 2025 10:44

robertknight mentioned this pull request May 29, 2025

Handle Unicode normalization differences in translateOffsets #7123

Draft

2 tasks

robertknight closed this May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply Unicode normalization to text layer and extracted text #7096

Apply Unicode normalization to text layer and extracted text #7096

Uh oh!

robertknight commented May 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 19, 2025 •

edited

Loading

Uh oh!

robertknight commented May 20, 2025

Uh oh!

robertknight commented May 20, 2025

Uh oh!

robertknight commented May 29, 2025

Uh oh!

Uh oh!

Apply Unicode normalization to text layer and extracted text #7096

Apply Unicode normalization to text layer and extracted text #7096

Uh oh!

Conversation

robertknight commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

robertknight commented May 20, 2025

Uh oh!

robertknight commented May 20, 2025

Uh oh!

robertknight commented May 29, 2025

Uh oh!

Uh oh!

robertknight commented May 19, 2025 •

edited

Loading

codecov bot commented May 19, 2025 •

edited

Loading