-
Notifications
You must be signed in to change notification settings - Fork 17
fix: Return error for invalid PDFs #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mpolomdeepsense
commented
Jun 18, 2025
- Adds function for checking PDF files before the request gets sent. Throws appropriate error message in case the file is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks ok - I'm just worried a bit about the check's influence on the performance (but I think we do these operations in split-pdf function anyway).
pdf.root_object # pylint: disable=pointless-statement | ||
|
||
# This will raise if the file's pages are corrupted | ||
list(pdf.pages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we profiled it for memory/time for large pdfs (like ~1k pages)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not, but maybe we could just check for the first page? I think it will result with the same error. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested. This does not slow down execution and doesn't affect memory usage, even with big files (tested on 200, 1000 and 10000 pages pdfs).
Results for 10k pages pdf:
mem_profiler_results.txt