You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is a collection of some structural changes I want to make to html5ever (and its integration in the servo project).
The intent is both for me to act as a rubber duck and for reviewers to make sure what I plan makes sense (so complaints/questions are welcome).
Current architecture
For simplicity, the following diagram does not include XML parsing, XML is also not really relevant for now.
Encoding support with <meta charset> tags is hard, because the decoder is far away from the parser
Reentrancy requires awkward interior mutability in html5ever, because the TreeSink can invoke the ServoParser again with document.write
Encoding support
When the parser encounters a <meta charset> tag it cannot proceed immediately. Instead, a message needs to be bubbled up from the parser, through the tokenizer to the place where the decoding happens and there we need to change the encoding while parsing. This is why the parser architecture would benefit from the decoder being closer to the tree builder.
The final design plan is to have an (optional) wrapper around the Tokenizer which handles both decoding and buffering of input.
Pretty much all of that is already implemented in #590 with the DecodingParser type. Note that this type also takes care of the intricacies of document.write which might benefit other users of html5ever. Nico requested that this DecodingParser stay behind a feature flag.
Parallel Parsing / Real Prefetching
My current plan is very similar to servo/servo#19203, except that I want the input stream to live in the parser thread. Otherwise you have to send (read: clone) the input each time you invoke the parser. The input stream is a spec concept that buffers input which has been received from the network but not yet been processed by the tokenizer. It is also where the decoding from bytes to UTF8 happens. html5ever does not currently implement this.
Below is a somewhat simplified diagram. The extra info sent to the parser thread mostly relates to document.write and is not included here for simplicity.
A parse operation in the diagram above could be something like AppendChild or SetQuirksMode - mirroring the methods of the current TreeSink trait.
Notice how this design allows us to support reentrancy without interior mutability in html5ever - the parser thread does not need to know about reentrant parsing at all, since it just processes input from the main thread.
Ordering of changes
The current plan is to
Move buffering of input into html5ever. Makes everything else easier - This will be a significant breaking change to the API!
Implement parallel HTML parsing in servo, to be able to implement encoding support without decoding everything twice.
Implement support for <meta charset>
The text was updated successfully, but these errors were encountered:
I don't quite follow all the details (and this sketch doesn't include all of them!), but this broadly sounds good to me.
I would be very glad if we could get rid of the interior mutability. That has been a significant source of bugs in my parser (mostly refcell panics) that the borrow checker was previously protecting against since it was introduced.
This issue is a collection of some structural changes I want to make to html5ever (and its integration in the servo project).
The intent is both for me to act as a rubber duck and for reviewers to make sure what I plan makes sense (so complaints/questions are welcome).
Current architecture
For simplicity, the following diagram does not include XML parsing, XML is also not really relevant for now.
There are multiple downsides to this architecture:
<meta charset>
tags is hard, because the decoder is far away from the parserTreeSink
can invoke theServoParser
again withdocument.write
Encoding support
When the parser encounters a
<meta charset>
tag it cannot proceed immediately. Instead, a message needs to be bubbled up from the parser, through the tokenizer to the place where the decoding happens and there we need tochange the encoding while parsing
. This is why the parser architecture would benefit from the decoder being closer to the tree builder.Right now the decoding happens in the
network_decoder
andnetwork_input
fields onServoParser
: https://github.com/servo/servo/blob/4e9993128b81b5a3757970786d47fb165ed3ebca/components/script/dom/servoparser/mod.rs#L111-L116.The problem with naively moving the decoding process into html5ever is that all input would be decoded twice (once in the prefetcher and once in the "main" parser).
The final design plan is to have an (optional) wrapper around the
Tokenizer
which handles both decoding and buffering of input.Pretty much all of that is already implemented in #590 with the
DecodingParser
type. Note that this type also takes care of the intricacies ofdocument.write
which might benefit other users ofhtml5ever
. Nico requested that thisDecodingParser
stay behind a feature flag.Parallel Parsing / Real Prefetching
My current plan is very similar to servo/servo#19203, except that I want the input stream to live in the parser thread. Otherwise you have to send (read: clone) the input each time you invoke the parser. The input stream is a spec concept that buffers input which has been received from the network but not yet been processed by the tokenizer. It is also where the decoding from bytes to UTF8 happens.
html5ever
does not currently implement this.Below is a somewhat simplified diagram. The
extra info
sent to the parser thread mostly relates todocument.write
and is not included here for simplicity.A parse operation in the diagram above could be something like
AppendChild
orSetQuirksMode
- mirroring the methods of the currentTreeSink
trait.Notice how this design allows us to support reentrancy without interior mutability in html5ever - the parser thread does not need to know about reentrant parsing at all, since it just processes input from the main thread.
Ordering of changes
The current plan is to
html5ever
. Makes everything else easier - This will be a significant breaking change to the API!<meta charset>
The text was updated successfully, but these errors were encountered: