Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89

Open
xxnuo opened this issue Feb 12, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request Normal Priority

Comments

@xxnuo
Copy link
Contributor

xxnuo commented Feb 12, 2025

Is your feature request related to a problem? Please describe.

Current text coverage and paragraph detection are inaccurate. Setting different short-line-split-factor values in various documents is necessary to achieve better translation results, which prevents a universal solution and manytimes the line breaks go wrong.

Describe the solution you'd like

Like the PDF translation implementation of https://doclingo.cn:

Image Image Image

According to the position of the original text, perform text scaling to completely place the content, keeping the layout unchanged.

Where there's a problem, I can provide assistance.😊

Additional context
sample-complex-document.updated-zh-CN_doclingo.ai.pdf

@xxnuo xxnuo added the enhancement New feature or request label Feb 12, 2025
@awwaawwa
Copy link
Member

I'm considering directly optimizing the OCR model to improve accuracy and solve this problem.

The short line segmentation coefficient is just a workaround, it will be removed later.

@awwaawwa
Copy link
Member

Table of contents, authors, citations, cross-column and cross-page paragraph, code algorithms and other areas, I want to solve them all through OCR methods, rather than processing based on rules using information directly extracted from PDF.

@awwaawwa
Copy link
Member

For paragraphs spanning columns and pages, the layout engine needs to support multi-available space paragraphs, which will be resolved by refactoring the layout later.

For text that becomes too small after dynamic scaling, it may be because the height of the glyphs increases after translating from the original text to the translation. It could also be due to excessive horizontal expansion (such as "图1" being translated to "Figure 1"). For such cases, I'm working on solutions, like appropriately expanding the available space.

Additionally, I'm researching center alignment, justified alignment, and the KP line-breaking algorithm...

PS: Currently only considering left-to-right text layout.

@awwaawwa
Copy link
Member

awwaawwa commented Feb 12, 2025

The non-rectangular available space paragraphs you provided are indeed valuable for reference. This can be achieved by obtaining the exact available space through parsing the precise position of PDF characters after OCR layout analysis, and then using multiple available space layout features.

@awwaawwa awwaawwa self-assigned this Feb 12, 2025
@awwaawwa awwaawwa changed the title feat: Imagine a new text overlay method feat(layout&typesetting): Precise layout recognition & multi-available space typesetting Feb 12, 2025
@awwaawwa
Copy link
Member

I estimate this improvement will begin after the layout engine refactoring.

@xxnuo
Copy link
Contributor Author

xxnuo commented Feb 12, 2025

I’ve tried the surya OCR model from https://github.com/VikParuchuri/surya/tree/master before. It has been optimized for table and text line recognition, works pretty well. but it requires a professional GPU and the speed is extremely slow.

Paddle OCR is slightly inferior in accuracy but more fast in speed. This is an optimized onnx ocr model repository https://github.com/jingsongliujing/OnnxOCR.

But I'm not sure how much improvement it has over the current model.

For reference only.

@xxnuo
Copy link
Contributor Author

xxnuo commented Feb 21, 2025

I estimate this improvement will begin after the layout engine refactoring.

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/module_usage/tutorials/ocr_modules/layout_detection.md

@awwaawwa
Copy link
Member

I estimate this improvement will begin after the layout engine refactoring.

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/module_usage/tutorials/ocr_modules/layout_detection.md

@xxnuo Thank you for your suggestion, I'll ask my colleagues to look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Normal Priority
Projects
None yet
Development

No branches or pull requests

2 participants