-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(layout&typesetting): Precise layout recognition & multi-available space typesetting #89
Comments
I'm considering directly optimizing the OCR model to improve accuracy and solve this problem. The short line segmentation coefficient is just a workaround, it will be removed later. |
Table of contents, authors, citations, cross-column and cross-page paragraph, code algorithms and other areas, I want to solve them all through OCR methods, rather than processing based on rules using information directly extracted from PDF. |
For paragraphs spanning columns and pages, the layout engine needs to support multi-available space paragraphs, which will be resolved by refactoring the layout later. For text that becomes too small after dynamic scaling, it may be because the height of the glyphs increases after translating from the original text to the translation. It could also be due to excessive horizontal expansion (such as "图1" being translated to "Figure 1"). For such cases, I'm working on solutions, like appropriately expanding the available space. Additionally, I'm researching center alignment, justified alignment, and the KP line-breaking algorithm... PS: Currently only considering left-to-right text layout. |
The non-rectangular available space paragraphs you provided are indeed valuable for reference. This can be achieved by obtaining the exact available space through parsing the precise position of PDF characters after OCR layout analysis, and then using multiple available space layout features. |
I estimate this improvement will begin after the layout engine refactoring. |
I’ve tried the surya OCR model from https://github.com/VikParuchuri/surya/tree/master before. It has been optimized for table and text line recognition, works pretty well. but it requires a professional GPU and the speed is extremely slow. Paddle OCR is slightly inferior in accuracy but more fast in speed. This is an optimized onnx ocr model repository https://github.com/jingsongliujing/OnnxOCR. But I'm not sure how much improvement it has over the current model. For reference only. |
|
@xxnuo Thank you for your suggestion, I'll ask my colleagues to look into it. |
Is your feature request related to a problem? Please describe.
Current text coverage and paragraph detection are inaccurate. Setting different short-line-split-factor values in various documents is necessary to achieve better translation results, which prevents a universal solution and manytimes the line breaks go wrong.
Describe the solution you'd like
Like the PDF translation implementation of https://doclingo.cn:
According to the position of the original text, perform text scaling to completely place the content, keeping the layout unchanged.
Where there's a problem, I can provide assistance.😊
Additional context
sample-complex-document.updated-zh-CN_doclingo.ai.pdf
The text was updated successfully, but these errors were encountered: