Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

while converting a URI/URL, requests needs timeout when server doesn't close the stream #1167

Open
ThePatelCode opened this issue Apr 5, 2025 · 4 comments

Comments

@ThePatelCode
Copy link

ThePatelCode commented Apr 5, 2025

        elif uri.startswith("http:") or uri.startswith("https:"):
            response = self._requests_session.get(uri, stream=True)
            response.raise_for_status()
            return self.convert_response(
                response,
                stream_info=stream_info,
                file_extension=file_extension,
                url=mock_url,
                **kwargs,
            )
        else:
            raise ValueError(
                f"Unsupported URI scheme: {uri.split(':')[0]}. Supported schemes are: file:, data:, http:, https:"
            )

I have encountered issues when server doesn't close the steam and it hangs forever trying to buffer the stream

How do reproduce:

from markitdown import MarkItDown

client = OpenAI()
md = MarkItDown()
result = md.convert("https://alletting.dot.state.al.us/")
markdown = result.text_content

I propose adding a timeout value. I have tested this locally, and if you give me direction, i.e how do we pass the timeout value to this function, I can make the change myself.

I propose modifying stream_info blob to add timeout.

@afourney
Copy link
Member

afourney commented Apr 7, 2025

Yes, this is a good point and an easy fix. Let me pencil this in for the next update.

@ThePatelCode
Copy link
Author

@afourney , if you give me sense of direction which you are think, I am happy to contribute as well

@afourney
Copy link
Member

afourney commented Apr 7, 2025

@afourney , if you give me sense of direction which you are think, I am happy to contribute as well

I'm thinking of something in this area: https://github.com/microsoft/markitdown/blob/3fcd48cdfc651cbf508071c8d2fb7d82aeb075de/packages/markitdown/src/markitdown/_markitdown.py#L439C1-L465C38

Either when calling get or reading from the stream. Let's start with a good/sane default, then we can parameterize it.

@sonnylaskar
Copy link

I was about to report the same issue. This seems to be an easy fix. Looking forward to the next build. Thanks to the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants