Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Real-time JavaScript in HTML Conversion Causes Loss of Dynamic Data #1169

Open
lazur07 opened this issue Apr 7, 2025 · 0 comments

Comments

@lazur07
Copy link

lazur07 commented Apr 7, 2025

Description:
Markitdown's HTML converter wraps Markdownify, which removes all JavaScript and style blocks, including those responsible for loading real-time data. This issue affects documents requiring real-time data (e.g., live feeds, dynamic updates) that no longer appear in the HTML after conversion.

def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
    encoding = "utf-8" if stream_info.charset is None else stream_info.charset
    soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

    # Remove javascript and style blocks
    for script in soup(["script", "style"]):
        script.extract()

Steps to Reproduce:

  1. Convert an HTML document containing JavaScript that loads real-time data (e.g., live updates or dynamic content).
  2. Observe that the real-time data is missing in the converted HTML output.

Expected Behavior:
The HTML converter should:

  • Preserve JavaScript responsible for fetching or displaying real-time data.
  • Allow configuration to selectively retain or remove JavaScript to prevent stripping dynamic content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant