-
Notifications
You must be signed in to change notification settings - Fork 115
Memory corruption (probably since 0.21.0) #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Still reproduces using the latest code from
to say about it. |
Weirdly enough, the uninitialized memory is stack allocations created by the parsers in |
I minimized the test code further to use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;
#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;
fn main() {
let test_data = include_str!("../repro.html");
loop {
extract_redacted_from_html(test_data);
}
}
fn extract_redacted_from_html(html: &str) {
let fragment = Html::parse_fragment(html);
let row_selector = Selector::parse("tr").unwrap();
fragment.select(&row_selector).next();
} |
This also fits with keeping the selector itself alive working around the crash: diff --git a/src/main.rs b/src/main.rs
index 97dfb96..6cf9a95 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -13,8 +13,8 @@ fn main() {
}
fn extract_redacted_from_html(html: &str) {
- let fragment = Html::parse_fragment(html);
let row_selector = Selector::parse("tr").unwrap();
+ let fragment = Html::parse_fragment(html);
fragment.select(&row_selector).next();
} |
This also explain why we never saw this in production: Our code is basically always structure to parse/allocate selectors one in the beginning and then re-use them for all documents, i.e. the selectors will always outlive the documents in our usage. |
So this appears to happen when the document is dropped after the selector via an invalid string_cache::atom::Atom<markup5ever::LocalNameStaticSet>
|
The only interaction between |
Hhhmmm, replacing The example does minimize further though, i.e. use scraper::{Html, Selector};
use tikv_jemallocator::Jemalloc;
#[global_allocator]
static ALLOC: Jemalloc = Jemalloc;
fn main() {
let test_data = include_str!("../repro.html");
loop {
let fragment = Html::parse_fragment(test_data);
let row_selector = Selector::parse("tr").unwrap();
}
} |
The selector value itself seems irrelevant, but the HTML test data is not... So a bug in |
So I tried to reduce the HTML input and this does indeed move around the crash site.<!DOCTYPE HTML>
<html class="no-js">
<body class="culture--en-GB">
<table class="table" uiid="0-5-0-0-0-0-1" data-sort-parameter="order-by">
<tbody class="table__body">
<tr class="table__row">
<td class="table__cell">
Redacted 2016
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2017
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2018
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2019
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2020
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2021
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted 2022
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted redacted
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted Today
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted Yesterday
</td>
<td class="table__cell table__cell--grow">
Raw (XML)
</td>
<td class="table__cell">
<button name="run-redacted" ui-ajax="ui-ajax" value="0876d065-95f2-4ce6-9ec4-ced1e2b48e7e" uiid="0-5-0-0-0-0-1-34" id="0-5-0-0-0-0-1-34" class=" button--secondary" type="button">
Run
</button>
</td>
</tr>
<tr class="table__row">
<td class="table__cell">
Redacted redacted
</td>
<td class="table__cell table__cell--grow">
Excel (unformatted .xlsx)
</td>
<td class="table__cell">
<button name="run-redacted" ui-ajax="ui-ajax" value="7bf8a530-7890-4134-a87f-ed8b6be6f944" uiid="0-5-0-0-0-0-1-37" id="0-5-0-0-0-0-1-37" class=" button--secondary" type="button">
Run
</button>
</td>
</tr>
</tbody>
</table>
</body>
</html> @tamasfe I am sorry to have nothing more definitive, but could you try to get someone from |
But then again, doing use html5ever::tendril::TendrilSink;
use html5ever::{parse_fragment, ns, QualName, local_name};
use markup5ever_rcdom::{RcDom};
let dom = parse_fragment(RcDom::default(), Default::default(), QualName::new(None, ns!(html), local_name!("body")), Vec::new()).one(test_data); instead of let fragment = Html::parse_fragment(test_data); does not crash which would point at our |
Oh well, but after removing all unsafe code from |
Thank you for looking into this, I'll try to bisect it and/or run it with miri over the weekend to see if I find anything. Of course if it's not a scraper issue, I can report it upstream. |
I tried Miri, but stacked borrows fails somewhere inside |
I reduced the test data a bit further, the issue seems to be a combination of things in it, the html itself is well-formed, and adding/removing nodes randomly will trigger the issue. I also looked into downgrading html5ever itself, but there are many breaking changes and I'm not familiar with the code so I gave up on that approach. |
Another possibility besides Miri and Valgrind would be Address Sanitizer, e.g. > RUSTFLAGS=-Zsanitizer=address cargo +nightly run -Zbuild-std --target x86_64-unknown-linux-gnu but I also did not get an actually useful stack trace out of that. I wonder if the aim should be to run the non-crashing variant and get one of those tools to emit something useful instead of crashing into invalid stack frames etc. |
Oddly enough the bug does not seem to trigger it. Also I've noticed that I was using |
I get a panic in our tree sink implementation, i.e. |
Sorry, it works on the reduced test data, but running it with the original test data, it triggers a panic, which may or may not be related to this:
|
The main difference is that we tell pub fn parse_document(document: &str) -> Self {
let parser =
driver::parse_document(HtmlTreeSink::new(Self::new_document()), Default::default());
parser.one(document)
}
pub fn parse_fragment(fragment: &str) -> Self {
let parser = driver::parse_fragment(
HtmlTreeSink::new(Self::new_fragment()),
Default::default(),
QualName::new(None, ns!(html), local_name!("body")),
Vec::new(),
);
parser.one(fragment)
} Both are safe functions though, so it should not be possible to trigger a segmentation fault by one or the other. |
Running the initial reproduction but using |
I used this crate v0.20.0 with wasm-bindgen, which used to compile and run just fine. But since v0.21.0, there seems to be some issue with allocator. With console error hook, I identified that the dlmalloc code panics.
I suspect that this issue is related to my wasm issue, and might be related to rustwasm/wasm-pack#1389. (I'm on cargo 1.86.0 btw) |
I had never seen this before! Will start debugging tomorrow |
We have a service at my company that uses scraper for extracting data from HTML that recently started crashing.
I've tracked down the issue that appears after bumping scraper from
0.19.0
to the latest version.The crash happens once in a blue moon with the system allocator, however with Jemalloc on linux, it happens almost immediately.
I have extracted some of our logic that consistently reproduces the issue and uploaded it here: https://github.com/tamasfe/scraper-segfault .
Simply run
cargo run
and observe the segmentation fault message.I did not yet dive into the causes and attempts to fix it as tracking it down took me several hours already, I hope the repro is minimal enough for people who are familiar with the project internals.
The text was updated successfully, but these errors were encountered: