Unable to load file #3097

vlavorini · 2024-05-24T11:12:27Z

Maybe related to this. When using in the context of a binary file an error is thrown.

Example:

with open ("./that.pdf", 'rb') as f:

    elements = partition_pdf(
            file=f,
            strategy='hi_res',
            is_image=False,
            include_page_breaks=True,
            analysis=True,
            infer_table_structure=True,
    
        )

Error:

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\LavoriV\AppData\Local\Temp\tmp_fqqu798': No error.

The text was updated successfully, but these errors were encountered:

MthwRobinson · 2024-05-24T14:14:25Z

Hi @vlavorini - could let use know what versions and operating system you're using and share an example file we could use to reproduce?

vlavorini · 2024-05-28T11:24:22Z

I'm on Windows, with Unstructured version 0.14.2. Here the file I use
population in EU.pdf

MthwRobinson · 2024-05-28T12:00:44Z

Thanks, @vlavorini !

andy1213aa · 2024-07-03T14:26:17Z

I also encountered this issue.
Is there any progress or update on this matter?

scanny · 2024-07-03T18:11:23Z

@andy1213aa can you post the code you used and also mention whether you are on Windows?

andy1213aa · 2024-07-03T18:21:17Z

I used the same code provided above by @vlavorini, and yes, I am also running it on Windows.
Is there currently a way to solve this problem on Windows? Thank you!

simonschoe · 2024-08-19T08:49:28Z

@MthwRobinson do you have any updates on the matter?

scanny · 2024-08-19T17:51:33Z

@simonschoe can you post a stack trace?

simonschoe · 2024-08-19T18:28:29Z

from unstructured.partition.pdf import partition_pdf

with open("testfile_with_images.pdf", 'rb') as f:
    base64str = base64.b64encode(f.read()).decode('utf-8')

file_bytes = base64.b64decode(base64str)
file_bytes = io.BytesIO(file_bytes)

doc_elements = partition_pdf(
    file=file_bytes,
    #filename="testfile_with_images.pdf",
    languages=['deu'],
    strategy="hi_res", 
    hi_res_model_name="yolox",
)

---------------------------------------------------------------------------
[shortened]

File ~\pdf2image\pdf2image.py:127, in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    [124](~/pdf2image/pdf2image.py:124) if isinstance(poppler_path, PurePath):
    [125](~/pdf2image/pdf2image.py:125)     poppler_path = poppler_path.as_posix()
--> [127](~/pdf2image/pdf2image.py:127) page_count = pdfinfo_from_path(
    [128](~/pdf2image/pdf2image.py:128)     pdf_path, userpw, ownerpw, poppler_path=poppler_path
    [129](~/pdf2image/pdf2image.py:129) )["Pages"]
    [131](~/pdf2image/pdf2image.py:131) # We start by getting the output format, the buffer processing function and if we need pdftocairo
    [132](~/pdf2image/pdf2image.py:132) parsed_fmt, final_extension, parse_buffer_func, use_pdfcairo_format = _parse_format(
    [133](~/pdf2image/pdf2image.py:133)     fmt, grayscale
    [134](~/pdf2image/pdf2image.py:134) )

File ~\pdf2image\pdf2image.py:611, in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    [607](~/pdf2image/pdf2image.py:607)     raise PDFInfoNotInstalledError(
    [608](~/pdf2image/pdf2image.py:608)         "Unable to get page count. Is poppler installed and in PATH?"
    [609](~/pdf2image/pdf2image.py:609)     )
    [610](~/pdf2image/pdf2image.py:610) except ValueError:
--> [611](~/pdf2image/pdf2image.py:611)     raise PDFPageCountError(
    [612](~/pdf2image/pdf2image.py:612)         f"Unable to get page count.\n{err.decode('utf8', 'ignore')}"
    [613](~/pdf2image/pdf2image.py:613)     )

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\...\Temp\tmpzorb9m7z': No error.

The issue does not occur if I load the file using the filename arg though.

scanny · 2024-08-19T19:10:55Z

@simonschoe thanks for this :)

Okay, this looks like a bug that has been fixed on main but not released yet.
Unstructured-IO/unstructured-inference@7804e0d

Can you try installing unstructured-inference from the main branch on GitHub? I think that's going to solve the problem. Something like this IIRC:

$ pip install -U unstructured-ingest @ git+https://github.com/Unstructured-IO/unstructured-ingest

I'll see about moving along a release.

simonschoe · 2024-08-21T18:17:51Z

Thanks for the feedback! Unfortunately, I have to resort to a stable release verion. I will look out for the upcoming unstructured-inference release

HuangBugWei · 2024-09-10T09:06:59Z

@simonschoe
I've faced the issue due to some dependencies not being installed. The unstructured version I used is 0.15.9.

sudo apt-get install poppler-utils # recommend by https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image
sudo apt install tesseract-ocr # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
sudo apt install libtesseract-dev # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
pip install tesseract # recommended by https://stackoverflow.com/a/52231794
pip install tesseract-ocr # recommended by https://stackoverflow.com/a/52231794

I've also found the command sudo apt-get install libleptonica-dev tesseract-ocr tesseract-ocr-dev libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn, but I thought it is legacy code since I will get the error E: Package 'tesseract-ocr-dev' has no installation candidate. You can further try the legacy code if you are still facing the PDFPageCountError: Unable to get page count. error after executing the above command.

scanny · 2024-12-16T21:34:27Z

Fixed by #3395.

vlavorini added the bug Something isn't working label May 24, 2024

MthwRobinson added the awaiting-response label May 24, 2024

MthwRobinson added needs follow up and removed awaiting-response labels May 28, 2024

scanny added the pdf label Jul 3, 2024

scanny closed this as completed Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load file #3097

Unable to load file #3097

vlavorini commented May 24, 2024

MthwRobinson commented May 24, 2024

vlavorini commented May 28, 2024

MthwRobinson commented May 28, 2024

andy1213aa commented Jul 3, 2024

scanny commented Jul 3, 2024

andy1213aa commented Jul 3, 2024

simonschoe commented Aug 19, 2024

scanny commented Aug 19, 2024

simonschoe commented Aug 19, 2024 •

edited

Loading

scanny commented Aug 19, 2024

simonschoe commented Aug 21, 2024

HuangBugWei commented Sep 10, 2024

scanny commented Dec 16, 2024

Unable to load file #3097

Unable to load file #3097

Comments

vlavorini commented May 24, 2024

MthwRobinson commented May 24, 2024

vlavorini commented May 28, 2024

MthwRobinson commented May 28, 2024

andy1213aa commented Jul 3, 2024

scanny commented Jul 3, 2024

andy1213aa commented Jul 3, 2024

simonschoe commented Aug 19, 2024

scanny commented Aug 19, 2024

simonschoe commented Aug 19, 2024 • edited Loading

scanny commented Aug 19, 2024

simonschoe commented Aug 21, 2024

HuangBugWei commented Sep 10, 2024

scanny commented Dec 16, 2024

simonschoe commented Aug 19, 2024 •

edited

Loading