Skip to content

Memory corruption (probably since 0.21.0) #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tamasfe opened this issue May 16, 2025 · 23 comments
Open

Memory corruption (probably since 0.21.0) #253

tamasfe opened this issue May 16, 2025 · 23 comments

Comments

@tamasfe
Copy link

tamasfe commented May 16, 2025

We have a service at my company that uses scraper for extracting data from HTML that recently started crashing.
I've tracked down the issue that appears after bumping scraper from 0.19.0 to the latest version.

The crash happens once in a blue moon with the system allocator, however with Jemalloc on linux, it happens almost immediately.

I have extracted some of our logic that consistently reproduces the issue and uploaded it here: https://github.com/tamasfe/scraper-segfault .
Simply run cargo run and observe the segmentation fault message.

I did not yet dive into the causes and attempts to fix it as tracking it down took me several hours already, I hope the repro is minimal enough for people who are familiar with the project internals.

@adamreichold
Copy link
Member

Still reproduces using the latest code from master. Valgrind has

==512451== Use of uninitialised value of size 8
==512451==    at 0x5A64A4: selectors::matching::matches_simple_selector (matching.rs:1191)
==512451==    by 0x5A7525: selectors::matching::matches_compound_selector::{{closure}} (matching.rs:1172)
==512451==    by 0x5A73BB: any_value<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::kleene_value::{impl#1}::any_false::{closure_env#0}<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:65)
==512451==    by 0x5A73BB: any_false<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:52)
==512451==    by 0x5A73BB: selectors::matching::matches_compound_selector (matching.rs:1170)
==512451==    by 0x5A9A8A: selectors::matching::matches_complex_selector_internal (matching.rs:825)
==512451==    by 0x5A411C: matches_complex_selector<scraper::element_ref::ElementRef> (matching.rs:445)
==512451==    by 0x5A411C: matches_selector_kleene<scraper::element_ref::ElementRef> (matching.rs:295)
==512451==    by 0x5A411C: matches_selector<scraper::element_ref::ElementRef> (matching.rs:265)
==512451==    by 0x5A411C: scraper::selector::Selector::matches_with_scope_and_cache::{{closure}} (selector.rs:73)
==512451==    by 0x5ADF30: <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::any (macros.rs:308)
==512451==    by 0x529F8E: scraper::selector::Selector::matches_with_scope_and_cache (selector.rs:70)
==512451==    by 0x5646D0: <scraper::html::Select as core::iter::traits::iterator::Iterator>::next (mod.rs:163)
==512451==    by 0x227EEB: scraper_broken::extract_redacted_from_html (main.rs:37)
==512451==    by 0x2278F5: scraper_broken::main (main.rs:15)
==512451==    by 0x22696A: core::ops::function::FnOnce::call_once (function.rs:250)
==512451==    by 0x229C0D: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==512451== 
==512451== Use of uninitialised value of size 8
==512451==    at 0x5A66C0: to_unconditional_case_sensitivity<scraper::element_ref::ElementRef> (matching.rs:1322)
==512451==    by 0x5A66C0: selectors::matching::matches_simple_selector (matching.rs:1213)
==512451==    by 0x5A7525: selectors::matching::matches_compound_selector::{{closure}} (matching.rs:1172)
==512451==    by 0x5A73BB: any_value<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::kleene_value::{impl#1}::any_false::{closure_env#0}<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:65)
==512451==    by 0x5A73BB: any_false<&selectors::parser::Component<scraper::selector::Simple>, &mut selectors::parser::SelectorIter<scraper::selector::Simple>, selectors::matching::matches_compound_selector::{closure_env#0}<scraper::element_ref::ElementRef>> (kleene_value.rs:52)
==512451==    by 0x5A73BB: selectors::matching::matches_compound_selector (matching.rs:1170)
==512451==    by 0x5A9A8A: selectors::matching::matches_complex_selector_internal (matching.rs:825)
==512451==    by 0x5A411C: matches_complex_selector<scraper::element_ref::ElementRef> (matching.rs:445)
==512451==    by 0x5A411C: matches_selector_kleene<scraper::element_ref::ElementRef> (matching.rs:295)
==512451==    by 0x5A411C: matches_selector<scraper::element_ref::ElementRef> (matching.rs:265)
==512451==    by 0x5A411C: scraper::selector::Selector::matches_with_scope_and_cache::{{closure}} (selector.rs:73)
==512451==    by 0x5ADF30: <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::any (macros.rs:308)
==512451==    by 0x529F8E: scraper::selector::Selector::matches_with_scope_and_cache (selector.rs:70)
==512451==    by 0x5646D0: <scraper::html::Select as core::iter::traits::iterator::Iterator>::next (mod.rs:163)
==512451==    by 0x227EEB: scraper_broken::extract_redacted_from_html (main.rs:37)
==512451==    by 0x2278F5: scraper_broken::main (main.rs:15)
==512451==    by 0x22696A: core::ops::function::FnOnce::call_once (function.rs:250)
==512451==    by 0x229C0D: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==512451== 
==512451== 
==512451== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==512451==  Bad permissions for mapped region at address 0x7A1CE5
==512451==    at 0x7A1CE5: ???
==512451==    by 0x22BB26: std::rt::lang_start (rt.rs:198)
==512451==    by 0x228C8D: main (in /home/adam/scraper-segfault/target/debug/scraper-broken)
==512451== 
==512451== HEAP SUMMARY:
==512451==     in use at exit: 0 bytes in 0 blocks
==512451==   total heap usage: 7 allocs, 7 frees, 2,072 bytes allocated
==512451== 
==512451== All heap blocks were freed -- no leaks are possible
==512451== 
==512451== Use --track-origins=yes to see where uninitialised values come from
==512451== For lists of detected and suppressed errors, rerun with: -s
==512451== ERROR SUMMARY: 8 errors from 8 contexts (suppressed: 0 from 0)

to say about it.

@adamreichold
Copy link
Member

Weirdly enough, the uninitialized memory is stack allocations created by the parsers in selectors and cssparser.

@adamreichold
Copy link
Member

I minimized the test code further to

use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;

#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;

fn main() {
    let test_data = include_str!("../repro.html");

    loop {
        extract_redacted_from_html(test_data);
    }
}

fn extract_redacted_from_html(html: &str) {
    let fragment = Html::parse_fragment(html);
    let row_selector = Selector::parse("tr").unwrap();

    fragment.select(&row_selector).next();
}

@adamreichold
Copy link
Member

adamreichold commented May 16, 2025

Weirdly enough, the uninitialized memory is stack allocations created by the parsers in selectors and cssparser.

This also fits with keeping the selector itself alive working around the crash:

diff --git a/src/main.rs b/src/main.rs
index 97dfb96..6cf9a95 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -13,8 +13,8 @@ fn main() {
 }
 
 fn extract_redacted_from_html(html: &str) {
-    let fragment = Html::parse_fragment(html);
     let row_selector = Selector::parse("tr").unwrap();
+    let fragment = Html::parse_fragment(html);
 
     fragment.select(&row_selector).next();
 }

@adamreichold
Copy link
Member

This also explain why we never saw this in production: Our code is basically always structure to parse/allocate selectors one in the beginning and then re-use them for all documents, i.e. the selectors will always outlive the documents in our usage.

@adamreichold
Copy link
Member

So this appears to happen when the document is dropped after the selector via an invalid string_cache::atom::Atom<markup5ever::LocalNameStaticSet>
thread 'main' panicked at /home/adam/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/string_cache-0.8.9/src/atom.rs:262:25:
null pointer dereference occurred
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread caused non-unwinding panic. aborting.

Program received signal SIGABRT, Aborted.
Downloading 4.48 K source file /usr/src/debug/glibc-2.41/nptl/pthread_kill.c
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44                                                                                                                                                     
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt full
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {17037417024}}
        ret = <optimized out>
#1  0x00007ffff7c9b453 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
No locals.
#2  0x00007ffff7c41cb6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007ffff7c2938b in __GI_abort () at abort.c:73
        act = {__sigaction_handler = {sa_handler = 0x7fffffffca60, sa_sigaction = 0x7fffffffca60}, sa_mask = {__val = {93824998144200, 3758096416, 140737488341608, 93824998147272, 93824997890192, 0, 140737488341880, 140737488341648, 93824997673827, 
              18446744065119640128, 0, 0, 140737488341520, 140737488341504, 140737488341615, 140737488341815}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x0000555555a89a5a in std::sys::pal::unix::abort_internal () at library/std/src/sys/pal/unix/mod.rs:366
No locals.
#5  0x0000555555a88913 in std::panicking::rust_panic_with_hook () at library/std/src/panicking.rs:37
No locals.
#6  0x0000555555a884f6 in std::panicking::begin_panic_handler::{closure#0} () at library/std/src/panicking.rs:699
No locals.
#7  0x0000555555a875a9 in std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure_env#0}, !> () at library/std/src/sys/backtrace.rs:168
No locals.
#8  0x0000555555a881bd in std::panicking::begin_panic_handler () at library/std/src/panicking.rs:697
No locals.
#9  0x0000555555aa3ccd in core::panicking::panic_nounwind_fmt::runtime () at library/core/src/panicking.rs:117
No locals.
#10 core::panicking::panic_nounwind_fmt () at library/core/src/intrinsics/mod.rs:3175
No locals.
#11 0x0000555555aa3ead in core::panicking::panic_null_pointer_dereference () at library/core/src/panicking.rs:311
No locals.
#12 0x0000555555a4ca5a in string_cache::atom::{impl#6}::drop<markup5ever::LocalNameStaticSet> (self=0x7ffff780e340) at /home/adam/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/string_cache-0.8.9/src/atom.rs:262
        entry = 0x0
#13 0x0000555555a48bdb in core::ptr::drop_in_place<string_cache::atom::Atom<markup5ever::LocalNameStaticSet>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#14 0x0000555555a489c0 in core::ptr::drop_in_place<markup5ever::interface::QualName> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#15 0x00005555559ad537 in core::ptr::drop_in_place<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#16 0x00005555559916f1 in core::ptr::mut_ptr::{impl#0}::drop_in_place<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (self=0x7ffff780e338)
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mut_ptr.rs:1491
No locals.
#17 hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)>::drop<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (
    self=0x7fffffffcda8) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:505
No locals.
#18 hashbrown::raw::RawTableInner::drop_elements<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> (self=0x7ffff78b6568) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:2106
        item = hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {ptr: core::ptr::non_null::NonNull<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {pointer: 0x7ffff780e360}}
        iter = hashbrown::raw::RawIter<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {iter: hashbrown::raw::RawIterRange<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {current_group: hashbrown::control::bitmask::BitMaskIter (hashbrown::control::bitmask::BitMask (16384)), data: hashbrown::raw::Bucket<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {ptr: core::ptr::non_null::NonNull<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>)> {pointer: 0x7ffff780e540}}, next_ctrl: 0x7ffff780e550, end: 0x7ffff780e550}, items: 1}
#19 0x00005555559922ad in hashbrown::raw::RawTableInner::drop_inner_table<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global> (self=0x7ffff78b6568, alloc=0x7ffff78b6588, table_layout=...)
    at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:2161
No locals.
#20 0x000055555599102c in hashbrown::raw::{impl#18}::drop<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global> (self=0x7ffff78b6568) at /rust/deps/hashbrown-0.15.3/src/raw/mod.rs:3348
No locals.
#21 0x00005555559ade2b in core::ptr::drop_in_place<hashbrown::raw::RawTable<(markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>), alloc::alloc::Global>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#22 0x00005555559ae3fb in core::ptr::drop_in_place<hashbrown::map::HashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState, alloc::alloc::Global>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#23 0x00005555559ae4cb in core::ptr::drop_in_place<std::collections::hash::map::HashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#24 0x00005555559add4b in core::ptr::drop_in_place<ahash::hash_map::AHashMap<markup5ever::interface::QualName, tendril::tendril::Tendril<tendril::fmt::UTF8, tendril::tendril::NonAtomic>, ahash::random_state::RandomState>> ()
    at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#25 0x00005555559ae9ef in core::ptr::drop_in_place<scraper::node::Element> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#26 0x00005555559ae8c1 in core::ptr::drop_in_place<scraper::node::Node> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#27 0x00005555559af00b in core::ptr::drop_in_place<ego_tree::Node<scraper::node::Node>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#28 0x00005555559af540 in core::ptr::drop_in_place<[ego_tree::Node<scraper::node::Node>]> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
--Type <RET> for more, q to quit, c to continue without paging--
No locals.
#29 0x0000555555a1ca83 in alloc::vec::{impl#25}::drop<ego_tree::Node<scraper::node::Node>, alloc::alloc::Global> (self=0x7fffffffd0b8) at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:3804
No locals.
#30 0x00005555559afb77 in core::ptr::drop_in_place<alloc::vec::Vec<ego_tree::Node<scraper::node::Node>, alloc::alloc::Global>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#31 0x00005555559af01b in core::ptr::drop_in_place<ego_tree::Tree<scraper::node::Node>> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#32 0x00005555559ae81c in core::ptr::drop_in_place<scraper::html::Html> () at /home/adam/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:524
No locals.
#33 0x0000555555668f5c in scraper_broken::extract_redacted_from_html (
    html="\r\n\r\n\r\n<!DOCTYPE HTML>\r\n<html class=\"no-js\">\r\n<head>\r\n    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\r\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\r\n    "...) at src/main.rs:20
No locals.
#34 0x0000555555668d5a in scraper_broken::main () at src/main.rs:11
        test_data = "\r\n\r\n\r\n<!DOCTYPE HTML>\r\n<html class=\"no-js\">\r\n<head>\r\n    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\r\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\r\n    "...

@adamreichold
Copy link
Member

The only interaction between selectors and string_cache I can see is our CssLocalName which was not introduced in 0.21 but did gain an implementation of PrecomputedHash.

@adamreichold
Copy link
Member

Hhhmmm, replacing CssLocalName(LocalName) by CssLocalName(String) does not help, so this seems unrelated after all.

The example does minimize further though, i.e.

use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;

#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;

fn main() {
    let test_data = include_str!("../repro.html");

    loop {
        let fragment = Html::parse_fragment(test_data);
        let row_selector = Selector::parse("tr").unwrap();
    }
}

@adamreichold
Copy link
Member

The selector value itself seems irrelevant, but the HTML test data is not... So a bug in html5ever seems possible?

@adamreichold
Copy link
Member

So I tried to reduce the HTML input and this does indeed move around the crash site.
<!DOCTYPE HTML>
<html class="no-js">
<body class="culture--en-GB">

    <table class="table"  uiid="0-5-0-0-0-0-1" data-sort-parameter="order-by">
        <tbody class="table__body">
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2016
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2017
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2018
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2019
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2020
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2021
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted 2022
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted redacted
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted Today
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted Yesterday
                    </td>
                    <td class="table__cell table__cell--grow">

                        Raw (XML)
                    </td>
                    <td class="table__cell">


<button name="run-redacted" ui-ajax="ui-ajax" value="0876d065-95f2-4ce6-9ec4-ced1e2b48e7e" uiid="0-5-0-0-0-0-1-34" id="0-5-0-0-0-0-1-34" class=" button--secondary" type="button">
    Run
</button>

                    </td>
            </tr>
            <tr class="table__row">
                    <td class="table__cell">

                        Redacted redacted
                    </td>
                    <td class="table__cell table__cell--grow">

                        Excel (unformatted .xlsx)
                    </td>
                    <td class="table__cell">


<button name="run-redacted" ui-ajax="ui-ajax" value="7bf8a530-7890-4134-a87f-ed8b6be6f944" uiid="0-5-0-0-0-0-1-37" id="0-5-0-0-0-0-1-37" class=" button--secondary" type="button">
    Run
</button>

                    </td>
            </tr>
        </tbody>
    </table>

</body>
</html>

@tamasfe I am sorry to have nothing more definitive, but could you try to get someone from html5ever involved? scraper itself does not have unsafe code and the unsafe code in ego-tree does not seem too likely even though it is possible (downgrading to ego-tree 0.6.2 as in scraper 0.19 also did not fix the issue).

@adamreichold
Copy link
Member

But then again, doing

use html5ever::tendril::TendrilSink;
use html5ever::{parse_fragment, ns, QualName, local_name};
use markup5ever_rcdom::{RcDom};
let dom = parse_fragment(RcDom::default(), Default::default(), QualName::new(None, ns!(html), local_name!("body")), Vec::new()).one(test_data);

instead of

let fragment = Html::parse_fragment(test_data);

does not crash which would point at our TreeSink implementation in some way...

@adamreichold
Copy link
Member

does not crash which would point at our TreeSink implementation in some way...

Oh well, but after removing all unsafe code from ego-tree, it still segfaults, i.e. even if we use no unsafe code at all via scraper or ego-tree.

@tamasfe
Copy link
Author

tamasfe commented May 16, 2025

Thank you for looking into this, I'll try to bisect it and/or run it with miri over the weekend to see if I find anything.

Of course if it's not a scraper issue, I can report it upstream.

@adamreichold
Copy link
Member

adamreichold commented May 16, 2025

miri over the weekend

I tried Miri, but stacked borrows fails somewhere inside tendril and tree borrows does not handle the pointer-to-integer casts in string-cache.

@tamasfe
Copy link
Author

tamasfe commented May 16, 2025

I reduced the test data a bit further, the issue seems to be a combination of things in it, the html itself is well-formed, and adding/removing nodes randomly will trigger the issue.

I also looked into downgrading html5ever itself, but there are many breaking changes and I'm not familiar with the code so I gave up on that approach.

@adamreichold
Copy link
Member

Another possibility besides Miri and Valgrind would be Address Sanitizer, e.g.

> RUSTFLAGS=-Zsanitizer=address cargo +nightly run -Zbuild-std --target x86_64-unknown-linux-gnu

but I also did not get an actually useful stack trace out of that.

I wonder if the aim should be to run the non-crashing variant and get one of those tools to emit something useful instead of crashing into invalid stack frames etc.

@tamasfe
Copy link
Author

tamasfe commented May 16, 2025

Address Sanitizer

Oddly enough the bug does not seem to trigger it.

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

@adamreichold
Copy link
Member

Oddly enough the bug does not seem to trigger it.

I get a panic in our tree sink implementation, i.e. html5ever wants to know the name of an element which does not exist. And then Rust's panic handler segfaults trying to create a backtrace which suggests stack-based memory corruption (which jives with Valgrind's report).

@tamasfe
Copy link
Author

tamasfe commented May 16, 2025

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

Sorry, it works on the reduced test data, but running it with the original test data, it triggers a panic, which may or may not be related to this:

thread 'main' panicked at scraper/scraper/src/html/tree_sink.rs:64:18:
called `Option::unwrap()` on a `None` value

@adamreichold
Copy link
Member

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

The main difference is that we tell html5ever to assume that everything is wrapped within a body element:

pub fn parse_document(document: &str) -> Self {
    let parser =
        driver::parse_document(HtmlTreeSink::new(Self::new_document()), Default::default());
    parser.one(document)
}

pub fn parse_fragment(fragment: &str) -> Self {
    let parser = driver::parse_fragment(
        HtmlTreeSink::new(Self::new_fragment()),
        Default::default(),
        QualName::new(None, ns!(html), local_name!("body")),
        Vec::new(),
    );
    parser.one(fragment)
}

Both are safe functions though, so it should not be possible to trigger a segmentation fault by one or the other.

@adamreichold
Copy link
Member

Also I've noticed that I was using Html::parse_fragment, changing it to Html::parse_document works correctly, which is fair since we are parsing an entire document, I don't know what difference that makes though.

Sorry, it works on the reduced test data, but running it with the original test data, it triggers a panic, which may or may not be related to this:

thread 'main' panicked at scraper/scraper/src/html/tree_sink.rs:64:18:
called `Option::unwrap()` on a `None` value

Running the initial reproduction but using parse_document repeatedly, I sometimes get just a segfault and sometimes this panic with a follow-up segfault. As written above, this really smells of stack-based memory corruption which of course makes the backtraces and such dubious.

@dvishal485
Copy link

I used this crate v0.20.0 with wasm-bindgen, which used to compile and run just fine. But since v0.21.0, there seems to be some issue with allocator. With console error hook, I identified that the dlmalloc code panics.

panicked at /rust/deps/dlmalloc-0.2.7/src/dlmalloc.rs:1198:13:
assertion failed: psize <= size + max_overhead

I suspect that this issue is related to my wasm issue, and might be related to rustwasm/wasm-pack#1389. (I'm on cargo 1.86.0 btw)

@cfvescovo
Copy link
Member

cfvescovo commented May 18, 2025

I had never seen this before! Will start debugging tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants