Skip to content

Unable to load file #3097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vlavorini opened this issue May 24, 2024 · 13 comments
Closed

Unable to load file #3097

vlavorini opened this issue May 24, 2024 · 13 comments
Labels
bug Something isn't working pdf

Comments

@vlavorini
Copy link

Maybe related to this. When using in the context of a binary file an error is thrown.

Example:

with open ("./that.pdf", 'rb') as f:

    elements = partition_pdf(
            file=f,
            strategy='hi_res',
            is_image=False,
            include_page_breaks=True,
            analysis=True,
            infer_table_structure=True,
    
        )

Error:

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\LavoriV\AppData\Local\Temp\tmp_fqqu798': No error.
@vlavorini vlavorini added the bug Something isn't working label May 24, 2024
@MthwRobinson
Copy link
Contributor

Hi @vlavorini - could let use know what versions and operating system you're using and share an example file we could use to reproduce?

@vlavorini
Copy link
Author

I'm on Windows, with Unstructured version 0.14.2. Here the file I use
population in EU.pdf

@MthwRobinson
Copy link
Contributor

Thanks, @vlavorini !

@andy1213aa
Copy link

I also encountered this issue.
Is there any progress or update on this matter?

@scanny scanny added the pdf label Jul 3, 2024
@scanny
Copy link
Contributor

scanny commented Jul 3, 2024

@andy1213aa can you post the code you used and also mention whether you are on Windows?

@andy1213aa
Copy link

I used the same code provided above by @vlavorini, and yes, I am also running it on Windows.
Is there currently a way to solve this problem on Windows? Thank you!

@simonschoe
Copy link

@MthwRobinson do you have any updates on the matter?

@scanny
Copy link
Contributor

scanny commented Aug 19, 2024

@simonschoe can you post a stack trace?

@simonschoe
Copy link

simonschoe commented Aug 19, 2024

from unstructured.partition.pdf import partition_pdf

with open("testfile_with_images.pdf", 'rb') as f:
    base64str = base64.b64encode(f.read()).decode('utf-8')

file_bytes = base64.b64decode(base64str)
file_bytes = io.BytesIO(file_bytes)

doc_elements = partition_pdf(
    file=file_bytes,
    #filename="testfile_with_images.pdf",
    languages=['deu'],
    strategy="hi_res", 
    hi_res_model_name="yolox",
)
---------------------------------------------------------------------------
[shortened]

File ~\pdf2image\pdf2image.py:127, in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    [124](~/pdf2image/pdf2image.py:124) if isinstance(poppler_path, PurePath):
    [125](~/pdf2image/pdf2image.py:125)     poppler_path = poppler_path.as_posix()
--> [127](~/pdf2image/pdf2image.py:127) page_count = pdfinfo_from_path(
    [128](~/pdf2image/pdf2image.py:128)     pdf_path, userpw, ownerpw, poppler_path=poppler_path
    [129](~/pdf2image/pdf2image.py:129) )["Pages"]
    [131](~/pdf2image/pdf2image.py:131) # We start by getting the output format, the buffer processing function and if we need pdftocairo
    [132](~/pdf2image/pdf2image.py:132) parsed_fmt, final_extension, parse_buffer_func, use_pdfcairo_format = _parse_format(
    [133](~/pdf2image/pdf2image.py:133)     fmt, grayscale
    [134](~/pdf2image/pdf2image.py:134) )

File ~\pdf2image\pdf2image.py:611, in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    [607](~/pdf2image/pdf2image.py:607)     raise PDFInfoNotInstalledError(
    [608](~/pdf2image/pdf2image.py:608)         "Unable to get page count. Is poppler installed and in PATH?"
    [609](~/pdf2image/pdf2image.py:609)     )
    [610](~/pdf2image/pdf2image.py:610) except ValueError:
--> [611](~/pdf2image/pdf2image.py:611)     raise PDFPageCountError(
    [612](~/pdf2image/pdf2image.py:612)         f"Unable to get page count.\n{err.decode('utf8', 'ignore')}"
    [613](~/pdf2image/pdf2image.py:613)     )

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\...\Temp\tmpzorb9m7z': No error.

The issue does not occur if I load the file using the filename arg though.

@scanny
Copy link
Contributor

scanny commented Aug 19, 2024

@simonschoe thanks for this :)

Okay, this looks like a bug that has been fixed on main but not released yet.
Unstructured-IO/unstructured-inference@7804e0d

Can you try installing unstructured-inference from the main branch on GitHub? I think that's going to solve the problem. Something like this IIRC:

$ pip install -U unstructured-ingest @ git+https://github.com/Unstructured-IO/unstructured-ingest

I'll see about moving along a release.

@simonschoe
Copy link

Thanks for the feedback! Unfortunately, I have to resort to a stable release verion. I will look out for the upcoming unstructured-inference release

@HuangBugWei
Copy link

@simonschoe
I've faced the issue due to some dependencies not being installed. The unstructured version I used is 0.15.9.

sudo apt-get install poppler-utils # recommend by https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image
sudo apt install tesseract-ocr # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
sudo apt install libtesseract-dev # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
pip install tesseract # recommended by https://stackoverflow.com/a/52231794
pip install tesseract-ocr # recommended by https://stackoverflow.com/a/52231794

I've also found the command sudo apt-get install libleptonica-dev tesseract-ocr tesseract-ocr-dev libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn, but I thought it is legacy code since I will get the error E: Package 'tesseract-ocr-dev' has no installation candidate. You can further try the legacy code if you are still facing the PDFPageCountError: Unable to get page count. error after executing the above command.

@scanny
Copy link
Contributor

scanny commented Dec 16, 2024

Fixed by #3395.

@scanny scanny closed this as completed Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdf
Projects
None yet
Development

No branches or pull requests

6 participants