Skip to content

fix: Return error for invalid PDFs #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 18, 2025

Conversation

mpolomdeepsense
Copy link
Contributor

  • Adds function for checking PDF files before the request gets sent. Throws appropriate error message in case the file is invalid.

@mpolomdeepsense mpolomdeepsense marked this pull request as ready for review June 18, 2025 10:46
@mpolomdeepsense mpolomdeepsense requested review from awalker4 and pawel-kmiecik and removed request for awalker4 June 18, 2025 10:46
Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks ok - I'm just worried a bit about the check's influence on the performance (but I think we do these operations in split-pdf function anyway).

pdf.root_object # pylint: disable=pointless-statement

# This will raise if the file's pages are corrupted
list(pdf.pages)
Copy link
Contributor

@pawel-kmiecik pawel-kmiecik Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we profiled it for memory/time for large pdfs (like ~1k pages)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, but maybe we could just check for the first page? I think it will result with the same error. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. This does not slow down execution and doesn't affect memory usage, even with big files (tested on 200, 1000 and 10000 pages pdfs).

Results for 10k pages pdf:
mem_profiler_results.txt

@mpolomdeepsense mpolomdeepsense enabled auto-merge (squash) June 18, 2025 14:31
@mpolomdeepsense mpolomdeepsense disabled auto-merge June 18, 2025 14:31
@mpolomdeepsense mpolomdeepsense enabled auto-merge (squash) June 18, 2025 14:49
@mpolomdeepsense mpolomdeepsense merged commit a8e484c into main Jun 18, 2025
25 of 26 checks passed
@mpolomdeepsense mpolomdeepsense deleted the fix/return-422-for-failing-pdfs branch June 18, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants