fix: Return error for invalid PDFs #277

mpolomdeepsense · 2025-06-18T08:31:09Z

Adds function for checking PDF files before the request gets sent. Throws appropriate error message in case the file is invalid.

…uest

pawel-kmiecik

Generally looks ok - I'm just worried a bit about the check's influence on the performance (but I think we do these operations in split-pdf function anyway).

pawel-kmiecik · 2025-06-18T11:04:41Z

src/unstructured_client/_hooks/custom/pdf_utils.py

+        pdf.root_object  # pylint: disable=pointless-statement
+
+        # This will raise if the file's pages are corrupted
+        list(pdf.pages)


Did we profiled it for memory/time for large pdfs (like ~1k pages)?

I did not, but maybe we could just check for the first page? I think it will result with the same error. WDYT?

Tested. This does not slow down execution and doesn't affect memory usage, even with big files (tested on 200, 1000 and 10000 pages pdfs).

Results for 10k pages pdf:
mem_profiler_results.txt

mpolomdeepsense added 7 commits June 18, 2025 10:23

fix: return appropriate error for failing pdf, before sending the req…

4de9a67

…uest

test: pdf check tests before request

d9c3b4a

fix: add new line at the end of a file

d094ac9

docs: fix check_pdf function docs

bd7f658

fix: pylint errors and warnings

b861538

chore: update changelog

4689fc1

test: before_request split pdf hook unit tests

2a15f55

mpolomdeepsense marked this pull request as ready for review June 18, 2025 10:46

mpolomdeepsense requested review from awalker4 and pawel-kmiecik and removed request for awalker4 June 18, 2025 10:46

pawel-kmiecik approved these changes Jun 18, 2025

View reviewed changes

Merge branch 'main' into fix/return-422-for-failing-pdfs

2a39907

mpolomdeepsense enabled auto-merge (squash) June 18, 2025 14:31

mpolomdeepsense disabled auto-merge June 18, 2025 14:31

mpolomdeepsense enabled auto-merge (squash) June 18, 2025 14:49

mpolomdeepsense merged commit a8e484c into main Jun 18, 2025
25 of 26 checks passed

mpolomdeepsense deleted the fix/return-422-for-failing-pdfs branch June 18, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Return error for invalid PDFs #277

fix: Return error for invalid PDFs #277

Uh oh!

mpolomdeepsense commented Jun 18, 2025

Uh oh!

pawel-kmiecik left a comment

Uh oh!

pawel-kmiecik Jun 18, 2025 •

edited

Loading

Uh oh!

mpolomdeepsense Jun 18, 2025

Uh oh!

mpolomdeepsense Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

fix: Return error for invalid PDFs #277

fix: Return error for invalid PDFs #277

Uh oh!

Conversation

mpolomdeepsense commented Jun 18, 2025

Uh oh!

pawel-kmiecik left a comment

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpolomdeepsense Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mpolomdeepsense Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pawel-kmiecik Jun 18, 2025 •

edited

Loading