-
Notifications
You must be signed in to change notification settings - Fork 927
Unable to load file #3097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @vlavorini - could let use know what versions and operating system you're using and share an example file we could use to reproduce? |
I'm on Windows, with Unstructured version 0.14.2. Here the file I use |
Thanks, @vlavorini ! |
I also encountered this issue. |
@andy1213aa can you post the code you used and also mention whether you are on Windows? |
I used the same code provided above by @vlavorini, and yes, I am also running it on Windows. |
@MthwRobinson do you have any updates on the matter? |
@simonschoe can you post a stack trace? |
from unstructured.partition.pdf import partition_pdf
with open("testfile_with_images.pdf", 'rb') as f:
base64str = base64.b64encode(f.read()).decode('utf-8')
file_bytes = base64.b64decode(base64str)
file_bytes = io.BytesIO(file_bytes)
doc_elements = partition_pdf(
file=file_bytes,
#filename="testfile_with_images.pdf",
languages=['deu'],
strategy="hi_res",
hi_res_model_name="yolox",
)
The issue does not occur if I load the file using the |
@simonschoe thanks for this :) Okay, this looks like a bug that has been fixed on Can you try installing $ pip install -U unstructured-ingest @ git+https://github.com/Unstructured-IO/unstructured-ingest I'll see about moving along a release. |
Thanks for the feedback! Unfortunately, I have to resort to a stable release verion. I will look out for the upcoming |
@simonschoe sudo apt-get install poppler-utils # recommend by https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image
sudo apt install tesseract-ocr # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
sudo apt install libtesseract-dev # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
pip install tesseract # recommended by https://stackoverflow.com/a/52231794
pip install tesseract-ocr # recommended by https://stackoverflow.com/a/52231794 I've also found the command |
Fixed by #3395. |
Maybe related to this. When using in the context of a binary file an error is thrown.
Example:
Error:
The text was updated successfully, but these errors were encountered: