feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

xxnuo · 2025-02-12T16:49:19Z

Is your feature request related to a problem? Please describe.

Current text coverage and paragraph detection are inaccurate. Setting different short-line-split-factor values in various documents is necessary to achieve better translation results, which prevents a universal solution and manytimes the line breaks go wrong.

Describe the solution you'd like

Like the PDF translation implementation of https://doclingo.cn:

According to the position of the original text, perform text scaling to completely place the content, keeping the layout unchanged.

Where there's a problem, I can provide assistance.😊

Additional context
sample-complex-document.updated-zh-CN_doclingo.ai.pdf

awwaawwa · 2025-02-12T16:56:12Z

I'm considering directly optimizing the OCR model to improve accuracy and solve this problem.

The short line segmentation coefficient is just a workaround, it will be removed later.

awwaawwa · 2025-02-12T16:58:05Z

Table of contents, authors, citations, cross-column and cross-page paragraph, code algorithms and other areas, I want to solve them all through OCR methods, rather than processing based on rules using information directly extracted from PDF.

awwaawwa · 2025-02-12T17:01:54Z

For paragraphs spanning columns and pages, the layout engine needs to support multi-available space paragraphs, which will be resolved by refactoring the layout later.

For text that becomes too small after dynamic scaling, it may be because the height of the glyphs increases after translating from the original text to the translation. It could also be due to excessive horizontal expansion (such as "图1" being translated to "Figure 1"). For such cases, I'm working on solutions, like appropriately expanding the available space.

Additionally, I'm researching center alignment, justified alignment, and the KP line-breaking algorithm...

PS: Currently only considering left-to-right text layout.

awwaawwa · 2025-02-12T17:03:29Z

The non-rectangular available space paragraphs you provided are indeed valuable for reference. This can be achieved by obtaining the exact available space through parsing the precise position of PDF characters after OCR layout analysis, and then using multiple available space layout features.

awwaawwa · 2025-02-12T17:11:49Z

I estimate this improvement will begin after the layout engine refactoring.

xxnuo · 2025-02-12T23:16:29Z

I’ve tried the surya OCR model from https://github.com/VikParuchuri/surya/tree/master before. It has been optimized for table and text line recognition, works pretty well. but it requires a professional GPU and the speed is extremely slow.

Paddle OCR is slightly inferior in accuracy but more fast in speed. This is an optimized onnx ocr model repository https://github.com/jingsongliujing/OnnxOCR.

But I'm not sure how much improvement it has over the current model.

For reference only.

xxnuo · 2025-02-21T00:32:42Z

I estimate this improvement will begin after the layout engine refactoring.

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/module_usage/tutorials/ocr_modules/layout_detection.md

awwaawwa · 2025-02-21T02:02:30Z

I estimate this improvement will begin after the layout engine refactoring.

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/module_usage/tutorials/ocr_modules/layout_detection.md

@xxnuo Thank you for your suggestion, I'll ask my colleagues to look into it.

xxnuo added the enhancement New feature or request label Feb 12, 2025

awwaawwa self-assigned this Feb 12, 2025

awwaawwa added the Normal Priority label Feb 12, 2025

awwaawwa changed the title ~~feat: Imagine a new text overlay method~~ feat(layout&typesetting): Precise layout recognition & multi-available space typesetting Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

xxnuo commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025 •

edited

Loading

awwaawwa commented Feb 12, 2025

xxnuo commented Feb 12, 2025 •

edited

Loading

xxnuo commented Feb 21, 2025

awwaawwa commented Feb 21, 2025

feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

Comments

xxnuo commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025

awwaawwa commented Feb 12, 2025 • edited Loading

awwaawwa commented Feb 12, 2025

xxnuo commented Feb 12, 2025 • edited Loading

xxnuo commented Feb 21, 2025

awwaawwa commented Feb 21, 2025

awwaawwa commented Feb 12, 2025 •

edited

Loading

xxnuo commented Feb 12, 2025 •

edited

Loading