-
Notifications
You must be signed in to change notification settings - Fork 207
Apply Unicode normalization to text layer and extracted text #7096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #7096 +/- ##
=======================================
Coverage 99.39% 99.39%
=======================================
Files 279 279
Lines 11326 11348 +22
Branches 2727 2731 +4
=======================================
+ Hits 11257 11279 +22
Misses 69 69 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
A summary of Unicode normalization in PDF.js:
Per the PR linked above, the rationale for not normalizing the text layer was to better align the characters in the text layer with the rendered page. Consider for example a fl ligature which is visually one character in many fonts but becomes a pair "fl" after NFKC normalization.
This PR re-introduces normalization in the text layer, so it will cause some misalignment in some cases compared to using PDF.js without Hypothesis. So far in testing I haven't seen a case where it causes major problems, so I think we can get away with it. If we find in future that this causes significant problems, we might need to implement a different approach which allows for mapping positions in the normalized extracted text to positions in the non-normalized text layer. |
There is now an option to disable normalization when we extract text via the |
df4fb36
to
77c03bc
Compare
77c03bc
to
f89436a
Compare
Text quote anchoring in PDFs relies on the content of the rendered text layer matching the text produced by PDF.js's text extraction APIs, except for differences in whitespace which are handled by a `translateOffsets` helper. This assumption no longer holds in more recent versions of PDF.js because Unicode normalization is no longer applied to the text layer. Resolve the issue by applying normalization to the text layer ourselves. Part of #6784.
f89436a
to
4dd1b74
Compare
Tentatively closing in favor of a different approach: #7123 |
Text quote anchoring in PDFs relies on the content of the rendered text layer matching the text produced by PDF.js's text extraction APIs, except for differences in whitespace which are handled by a
translateOffsets
helper. This assumption no longer holds in more recent versions of PDF.js because different Unicode normalization is applied to the extracted text versus the text layer.Resolve the issue by applying our own consistent normalization to both the text layer and extracted text.
Part of #6784.
TODO: