Skip to content

Make TextEncoder and TextDecoder be transform streams #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

ricea
Copy link
Collaborator

@ricea ricea commented Dec 11, 2017

Add readable and writable attributes to TextEncoder and TextDecoder
objects to permit them to be used to transform ReadableStreams via the
pipeThrough() method.

This integrates the Encoding Standard with the Streams Standard. See
https://streams.spec.whatwg.org/#ts-model for the definition of a
"transform stream" and https://streams.spec.whatwg.org/#rs-pipe-through
for an explanation of the pipeThrough() method.

A TextEncoder object can be used to transform a stream of strings to a
stream of bytes in UTF-8 encoding. A TextDecoder object can be used to
transform a stream of bytes in the encoding passed to the constructor to
strings.

The implementation delegates to a TransformStream object internally,
which provides the glue logic that ties the readable and writable
together.

There is a human-readable version of these changes at
http://htmlpreview.github.io/?https://github.com/ricea/encoding-streams/blob/master/patch.html

There is a prollyfill and tests for the new functionality at
https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill

Closes #72.


Preview | Diff

Add `readable` and `writable` attributes to TextEncoder and TextDecoder
objects to permit them to be used to transform ReadableStreams via the
`pipeThrough()` method.

This integrates the Encoding Standard with the Streams Standard. See
https://streams.spec.whatwg.org/#ts-model for the definition of a
"transform stream" and https://streams.spec.whatwg.org/#rs-pipe-through
for an explanation of the `pipeThrough()` method.

A TextEncoder object can be used to transform a stream of strings to a
stream of bytes in UTF-8 encoding. A TextDecoder object can be used to
transform a stream of bytes in the encoding passed to the constructor to
strings.

The implementation delegates to a TransformStream object internally,
which provides the glue logic that ties the `readable` and `writable`
together.

There is a human-readable version of these changes at
http://htmlpreview.github.io/?https://github.com/ricea/encoding-streams/blob/master/patch.html

There is a prollyfill and tests for the new functionality at
https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill
@inexorabletash
Copy link
Member

Yay! Comments in random order:

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

Can the tests be upgraded to use async/await ?

In the tests, stream-properties.html should probably just be an idlharness test. (I don't know how well that plays with p(r)ollyfills, though.)

What's the behavior for split surrogate pairs when encoding? We basically pretend that's not a problem in the non-streaming API, by making the input is a USVString and thus not needing a {stream} flag. In other words, we assume that the input is always a complete text, not just a randomly partitioned sequence of code units. In this patch, the text reads Let input be the result of converting chunk to a USVString.... which sounds like it might introduce problems when chunk sizes cross surrogate boundaries. This worries me. Also, needs test cases.

@ricea
Copy link
Collaborator Author

ricea commented Dec 12, 2017

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

Sadly, bring-your-own-buffer style usage isn't supported by TransformStream yet. There's an open question about what to do when there isn't enough space in the supplied ArrayBuffer to fit the transformed input.

Since this is basically an optimisation feature, I don't plan to support it in the first release. I would rather take my time to get it right.

Can the tests be upgraded to use async/await ?

Certainly. I'm never quite sure what version of ES to target in tests. For the Streams Standard tests I've been avoiding introducing async/await, but I'm happy to use them here.

In the tests, stream-properties.html should probably just be an idlharness test. (I don't know how well that plays with p(r)ollyfills, though.)

I haven't tried it. I will give it a go.

What's the behavior for split surrogate pairs when encoding? We basically pretend that's not a problem in the non-streaming API, by making the input is a USVString and thus not needing a {stream} flag. In other words, we assume that the input is always a complete text, not just a randomly partitioned sequence of code units. In this patch, the text reads Let input be the result of converting chunk to a USVString.... which sounds like it might introduce problems when chunk sizes cross surrogate boundaries. This worries me. Also, needs test cases.

Yes, it's broken. I've been planning to write a test for this at some point. What I expect to see is if you encode "😻" in a single chunk, it works as intended. If you split it into two chunks, you get two chunks as output each containing U+FFFD encoded as UTF-8.

If there is consensus that this should be fixed, I think it can be done by converting to DOMString rather than USVString and special-casing the final code unit when it is the first half of a surrogate pair.

@annevk
Copy link
Member

annevk commented Dec 13, 2017

I think a problem here is that if a script running in a document invokes the constructor of either TextDecoder or TextEncoder in an <iframe> element the tear-off transform-related objects will have the associated global of the document (the current global/realm) and not the <iframe>'s document.

See also https://www.w3.org/Bugs/Public/show_bug.cgi?id=24652 and https://w3c-test.org/WebIDL/current-realm.html.

Note that if you change it to accept a DOMString you also need to do conversions from code unit to scalar value (you probably meant that; stating it here so I don't forget to verify it).

(Tests can use JavaScript features that ship in the latest release of all browsers. At least, that's my rule of thumb.)

@domenic
Copy link
Member

domenic commented Dec 13, 2017

I think we should indeed make split surrogates work. It seems like a big value proposition for having this built-in and shared, that people don't have to figure that out by themselves.

I think a problem here is that if a script running in a document invokes the constructor of either TextDecoder or TextEncoder in an <iframe> element the tear-off transform-related objects will have the associated global of the document (the current global/realm) and not the <iframe>'s document.

I don't think that is true. You're talking about new childIframe.contentWindow.TextEncoder(), right? The current realm inside the TextEncoder constructor is that of the constructor itself, i.e. of the iframe, so CreateTransformStream will still be in that realm, and the TransformStream objects will be from the same realm as the TextEncoder object constructed.

@annevk
Copy link
Member

annevk commented Dec 13, 2017

Ah, great, thanks for confirming. Let's make sure to test that as part of the test I referenced.

@ricea
Copy link
Collaborator Author

ricea commented Dec 14, 2017

@inexorabletash GoogleChromeLabs/text-encode-transform-polyfill#6 updates the tests to use async / await (much more readable) and adds idlharness.html. The prollyfill fails a couple of the idlharness tests because it (intentionally) doesn't do brand checks. Apart from that, everything looks good.

@ricea
Copy link
Collaborator Author

ricea commented Dec 14, 2017

I have been writing tests for surrogate pairs split between chunks and in the process realised there's a problem.

We're now explicitly taking a sequence of code units and normalising it to Unicode. Which means we're doing error handling. Which means the developer should probably be able to specify what type of error handling should be done. Which means adding a fatal option to the constructor. Which means that encode() should logically also support it. Which means that encode() also needs to be changed to take a DOMString. Which is a much more disruptive change than this one.

My preference is to land this first, and then worry about providing control over error behaviour and changing encode() later, since that is a much riskier change.

@annevk
Copy link
Member

annevk commented Dec 14, 2017

Which means we're doing error handling.

Well, but not any different than IDL always did, right? We're just changing the layering a bit. I'm not necessarily convinced String -> USVString conversion needs to be subject to fatal the same way that bytes <> scalar values needs to be. Note that the reason we have it in the latter case is primarily XML, which is the only format I know of that demands such handling. We exposed it to the API as a nicety mostly.

@ricea
Copy link
Collaborator Author

ricea commented Dec 14, 2017

Well, but not any different than IDL always did, right? We're just changing the layering a bit. I'm not necessarily convinced String -> USVString conversion needs to be subject to fatal the same way that bytes <> scalar values needs to be. Note that the reason we have it in the latter case is primarily XML, which is the only format I know of that demands such handling. We exposed it to the API as a nicety mostly.

That works for me too.

@ricea
Copy link
Collaborator Author

ricea commented Dec 19, 2017

I have written tests for split surrogates but I haven't written the standard language yet.

I am working on tests for ensuring that objects are all created in the correct realms. This is taking a while.

I probably won't have the standard language updated for split surrogates until the new year.

@annevk
Copy link
Member

annevk commented Dec 19, 2017

@vyv03354 are you willing to review this as well? Thanks!

Previously if the high surrogate and low surrogate were split between
two chunks, they would be each replaced with the replacement
character. Now the correct character will correctly appear in the
output.

Also mark "readable" and "writable" properties [SameObject].
@ricea
Copy link
Collaborator Author

ricea commented Jan 22, 2018

I have updated the standard language for split surrogates. Sorry for the delay. I tried a few approaches and adding the "convert code unit to scalar value" algorithm seemed to work best. PTAL.

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style nits (note that I've only reported the first instance).

I was actually hoping IDL's algorithm could become https://infra.spec.whatwg.org/#javascript-string-convert, but I couldn't really find a way to reuse that here so we might be stuck with some UTF-16 logic in multiple places forever.

encoding.bs Outdated
type:dfn; text:writable stream; url:#writable-stream
type:abstract-op; text:CreateTransformStream; url: #create-transform-stream
type:abstract-op; text:TransformStreamDefaultControllerEnqueue; url: #transform-stream-default-controller-enqueue
type:interface; text:TransformStream; url: #ts-class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's not exported yet in streams should be exported so there's no need to list it here.

encoding.bs Outdated
text: type; url: #sec-ecmascript-data-types-and-values; type: dfn
text: IsDetachedBuffer; url: #sec-isdetachedbuffer; type: abstract-op
text: IsSharedArrayBuffer; url: #sec-issharedarraybuffer; type: abstract-op
text: internal slot; url: #sec-object-internal-methods-and-internal-slots; type: dfn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't cross-reference "internal slot".

encoding.bs Outdated
@@ -1186,10 +1246,31 @@ if <a for=TextDecoder>error mode</a> is "<code>fatal</code>", and false otherwis
<p>The <dfn attribute for=TextDecoder><code>ignoreBOM</code></dfn> attribute's getter must return
true if <a for=TextDecoder>ignore BOM flag</a> is set, and false otherwise.

<p>The <dfn attribute for=TextDecoder><code>readable</code></dfn> attribute's getter must return the
contents of <a for=TextDecoder>transform</a>'s \[[readable]] <a>internal slot</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than "the contents of x" we usually say "x's value". @domenic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "return transform.[[readable]]" usually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, indeed.

encoding.bs Outdated
slot</a>.

<li><p>If <a abstract-op>IsReadableStreamLocked</a>(<var>readable</var>) is true, throw a
{{TypeError}} exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then throw* (xref throw?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to xref throw I think; we haven't elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"throw" is hyperlinked elsewhere in this standard. It would be nice to be consistent.

encoding.bs Outdated
{{TextDecoder}} <var>dec</var> and a <var>chunk</var>, runs these steps:

<ol>
<li><p>If the <a>type</a> of of <var>chunk</var> is not Object, or <var>chunk</var> does not have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be noted as Type(chunk)?

encoding.bs Outdated
<li><p>If the <a>type</a> of of <var>chunk</var> is not Object, or <var>chunk</var> does not have
an \[[ArrayBufferData]] <a>internal slot</a>, or <a
abstract-op>IsDetachedBuffer</a>(<var>chunk</var>) is true, or <a
abstract-op>IsSharedArrayBuffer</a>(<var>chunk</var>) is true then return <a>a promise rejected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma before then. Also I'm not sure I want to reference the promises guide as it really needs to be folded into IDL and has a bunch of things that confuse folks. Can we just say "a new promise rejected with a TypeError exception"?

encoding.bs Outdated
abstract-op>IsSharedArrayBuffer</a>(<var>chunk</var>) is true then return <a>a promise rejected
with</a> a {{TypeError}}.

<li><p>If the <a for=TextDecoder>do not flush flag</a> for <var>dec</var> is unset, set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then set*

encoding.bs Outdated
<li><p>Let <var>output</var> be a new <a for=/>stream</a>.

<li>
<p>While true, run these substeps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

steps* we try to avoid "substeps" and friends these days.

encoding.bs Outdated
</ol>

<li><p>Otherwise, return <a>a promise rejected with</a> a {{TypeError}}.
</ol>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you ate a newline here.

encoding.bs Outdated

<p>Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a {{ReadableStream}} source.

<pre class=example id=example-textencode-writable>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<pre also needs two spaces in front of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should probably use <code> too, just like the other examples.

@hsivonen
Copy link
Member

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

I'd like to reiterate the concerns about interoperability hazards associated with existing ArrayBuffer that I mentioned in the issue dedicated to that request.

I think we should indeed make split surrogates work.

While JS strings need to be able to exist in invalid-UTF-16 states when a JS program constructs a valid UTF-16 string, making unpaired surrogate preservation in crossing the WebIDL boundary into an explicit feature is a novel thing for the Web Platform and one that makes me uneasy.

So far, we have USVString, which explicitly replaces unpaired surrogates and various DOMString APIs that in UTF-16-based Web engines pass unpaired surrogates through, but that's mostly incidental and Web developers don't have good reasons to pass unpaired surrogates through, so not preserving unpaired surrogates appears to be feasible in terms of Web compat.

In Gecko, we were able to make CSSOM effectively change from DOMString to USVString without breaking the Web. If I've read the Servo source comments correctly, Servo panics if a DOMString contains unpaired surrogates and the absence of such crash reports suggests that the Web doesn't actually depend on unpaired surrogates crossing the WebIDL boundary (to the extent Servo has enough DOM features to be exercised with real Web content).

While of all the Web Platform APIs this might be the one where letting unpaired surrogates cross the WebIDL boundary might make sense as an explicit feature, I think we should consider a bit more carefully if that's necessary and if it's feasible to require textual streams not to split surrogate pairs across chunk boundaries.

@ricea
Copy link
Collaborator Author

ricea commented Jan 23, 2018

@hsivonen Consider the following transform stream:

const rechunker = new TransformStream({
  transform(chunk, controller) {
    for (let start = 0; start < chunk.length; start += 1024) {
      const end = Math.min(start + 1024, chunk.length);
      controller.enqueue(chunk.substring(start, end));
    }
  }
});

This takes a stream of strings and splits up any chunk that is longer than 1024 code units.

Here are the questions I ask myself:

  1. Is this a useful and sensible thing to want to do? I believe so. If the stream you're processing may have arbitrarily large chunks, you're forwarding it to some expensive operation, and you want to avoid jank, I think the correct approach is to include something like this in your pipe.
  2. It is reasonable for a developer to expect this to work and not corrupt the input? I think so.
    1. Corruption wouldn't even happen unless you streamed it to TextEncoder, and that could be several steps away down the pipe.
    2. It just applies the Javascript built-in substring() function, and nothing on
      https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring says anything about data corruption.
    3. It will be a long time before any problems are noticed.
  3. Is this implementation correct? Maybe? Maybe not? I don't know.

@hsivonen
Copy link
Member

I'm not fully familiar with Streams. Where do the string chunks in the rechunker example cross the WebIDL boundary (if anywhere)? The controller seems to be expressed in ECMA spec style and not in WebIDL style. Is there really no WebIDL boundary here and the prose sentence "Let input be the result of converting chunk to a DOMString." defines the boundary that is usually expressed in WebIDL?

The "convert code unit to scalar value" algorithm would not handle stray
lead surrogates correctly. They would erroneously eat the next code
unit. Fix the algorithm to replace the lead surrogate and leave the next
code unit alone.

Fix up the imports now that the correct things are exported from the
Streams Standard.

Consistently use ", then" in if statements.

Don't use "substeps". Just "steps". It's cleaner.

Use the formulation "transform.[[slot]]" instead of "transform's
[[slot]] internal slot". Also stop linking "internal slot" to the
Javascript standard.

Write "Type(chunk)" instead of "the type of chunk".

Don't link to the Promises Guide as it is viewed as problematic.
Copy link
Collaborator Author

@ricea ricea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed everything from @annevk's comments. I didn't respond "done" to each one individually as it might be a bit spammy.

I considered combining the "convert code unit to scalar value" algorithm with "shared UTF-16 decoder" but it would significantly complicate the latter for little benefit.

encoding.bs Outdated
slot</a>.

<li><p>If <a abstract-op>IsReadableStreamLocked</a>(<var>readable</var>) is true, throw a
{{TypeError}} exception.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"throw" is hyperlinked elsewhere in this standard. It would be nice to be consistent.

@ricea
Copy link
Collaborator Author

ricea commented Jan 23, 2018

I'm not fully familiar with Streams. Where do the string chunks in the rechunker example cross the WebIDL boundary (if anywhere)?

Sorry for not explaining clearly. The chunks don't cross the WebIDL boundary within the rechunker example. It's only if you then pipe the stream to TextEncoder or some other stream defined in a WebIDL-based standard that the chunks will cross the WebIDL boundary.

The controller seems to be expressed in ECMA spec style and not in WebIDL style. Is there really no WebIDL boundary here and the prose sentence "Let input be the result of converting chunk to a DOMString." defines the boundary that is usually expressed in WebIDL?

I think that is correct. My intent is for "converting chunk to a DOMString" to imply the same processing that would normally happen upon calling a WebIDL method. Is this something that needs to be made more explicit? The same applies to the first step of the "decode and enqueue a chunk" algorithm.

@annevk
Copy link
Member

annevk commented Jan 23, 2018

@ricea if transform streams gain a "bring your own buffer" capability does that need explicit changes to this API or would we run into #69 (comment) as a side effect of that capability being added? In case it's the latter, we'll need to address @hsivonen's concerns somehow now, I think.

@ricea
Copy link
Collaborator Author

ricea commented Jan 23, 2018

@annevk It won't be an automatic upgrade. There will be a different Create operation to call, similar to CreateReadableByteStream. While this is a bit regrettable, it would be bad if a change in the Streams Standard suddenly exposed internals of Encoding and other standards.

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have many nits that I'm also willing to help address once you consider things done.

Reading through #72 again it seems that with @jakearchibald you agreed on mode locking, but I cannot find that in the current draft. Did I miss a comment saying why we went back on that?

It would also be nice to add at least one example.

type:interface; text:ReadableStream
type:dfn; text:chunk
type:dfn; text:readable stream
type:dfn; text:writable stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed? Do you get some conflict that cannot be resolved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ReadableStream" is also exported by Fetch, and so needs to be disambiguated. I have modified the Streams Standard to export "chunk", "readable stream" and "writable stream" in whatwg/streams#871, so those won't be needed once Bikeshed has picked up the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make Fetch stop doing that then? It doesn't really seem like Fetch's way of managing it handles anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yutakahirano Who is using the ReadableStream definition that is exported from Fetch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find none, but serviceworker uses related ones ("construct a ReadableStream object", "empty ReadableStream"). See also: whatwg/webidl#445

encoding.bs Outdated
<dfn for=TextDecoder>do not flush flag</dfn> (initially unset).
<dfn for=TextDecoder>error mode</dfn> (initially "<code>replacement</code>"),
<dfn for=TextDecoder>do not flush flag</dfn> (initially unset), and
<dfn for=TextDecoder>transform</dfn> (a {{TransformStream}} object).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems best to not say anything here between parenthesis as that's only used for initial values currently (which is inconsistent with other specifications, I realize). We could do a follow-up where we state the type for all of these and then put the initial value elsewhere.

encoding.bs Outdated
@@ -1135,6 +1150,31 @@ control.
<dt><code><var>decoder</var> . <a attribute for=TextDecoder>ignoreBOM</a></code>
<dd><p>Returns true if <a for=TextDecoder>ignore BOM flag</a> is set, and false otherwise.

<dt><code><var>decoder</var> . <a attribute for=TextDecoder>readable</a></code>
<dd>
<p>Returns a <a>readable stream</a> whose <a>chunks</a> are strings resulting from running <a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't wrap inline/phrasing elements in any manner.

encoding.bs Outdated

<dt><code><var>decoder</var> . <a attribute for=TextDecoder>writable</a></code>
<dd>
<p>Returns a <a>writable stream</a> which accepts {{BufferSource}} chunks and runs them through <a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

encoding.bs Outdated
.pipeTo(textWritable);</code></pre>

<p>If the <a for=TextDecoder>error mode</a> is "<code>fatal</code>" and
<a for=TextDecoder>encoding</a>'s <a for=/>decoder</a> returns <a>error</a>, both {{TextDecoder/readable}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exceeds 100 columns.

encoding.bs Outdated
{{TextEncoder}} <var>enc</var>, runs these steps:

<ol>
<li><p>If <var>enc</var>'s <a>pending high surrogate</a> is set, run these steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<li> with multiple children again.

@@ -1241,6 +1314,88 @@ method, when invoked, must run these steps:
</ol>
</ol>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an <hr> here would be good I think, to separate it a bit from the member definitions.

encoding.bs Outdated

<p class=note>This is equivalent to the <a spec=webidl>convert a DOMString to a sequence of Unicode
scalar values</a> algorithm from [[WEBIDL]], but allows for surrogate pairs that are split between
strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, it would be better to reference https://infra.spec.whatwg.org/#javascript-string-convert here since eventually we'll update IDL to use that too and then this would need to be updated.

encoding.bs Outdated
<ol>
<li><p>Let <var>high surrogate</var> be <var>enc</var>'s <a>pending high surrogate</a>.

<li><p>Unset <var>enc</var>'s <a>pending high surrogate</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly clearer if we instead made pending high surrogate's type "null or a code unit" instead of the current mixture of flag and code unit. So check for non-null and set it to null when done.

@@ -1314,6 +1524,113 @@ must run these steps:
</ol>
</ol>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An <hr> would be good here too.

@hsivonen
Copy link
Member

I think that is correct. My intent is for "converting chunk to a DOMString" to imply the same processing that would normally happen upon calling a WebIDL method. Is this something that needs to be made more explicit?

Please add a very conspicuous note explaining that unlike in cases where the JS string to DOMString boundary passes through data that doesn't conform to Unicode as a performance optimization / legacy artifact of failure to enforce Unicode conformance, in this case letting code unit sequences that do not conform to Unicode travel across the boundary is an explicit feature with the intent of allowing the sequence of code units to become conforming once concatenated on the other side of the JS string / DOMString boundary.

ricea added 2 commits January 25, 2018 23:59
Do not wrap inside tags. Do not add lines that are longer than 100
characters.
Add <hr>s before the streaming-related algorithms to set them apart from
the method algorithms. Remove some blank lines that shouldn't be there.
ricea added 5 commits January 26, 2018 19:13
It was noted as being equivalent to the algorithm in WebIDL. Instead
note that it is equivalent to the "convert a JavaScript string into a
scalar value string" algorithm from INFRA.

Also add a trailing full-stop.
TextEncoder's "pending high surrogate" was either a code unit or
unset. Make it "null or a code unit" instead.
@ricea
Copy link
Collaborator Author

ricea commented Jan 26, 2018

@hsivonen wrote:

Please add a very conspicuous note explaining that unlike in cases where the JS string to DOMString boundary passes through data that doesn't conform to Unicode as a performance optimization / legacy artifact of failure to enforce Unicode conformance, in this case letting code unit sequences that do not conform to Unicode travel across the boundary is an explicit feature with the intent of allowing the sequence of code units to become conforming once concatenated on the other side of the JS string / DOMString boundary.

I'm concerned that this may contradict the advice from the WebIDL standard, namely:

Specifications should only use USVString for APIs that perform text processing and need a string of Unicode scalar values to operate on. Most APIs that use strings should instead be using DOMString, which does not make any interpretations of the code units in the string. When in doubt, use DOMString.

I'd prefer to have some clarity about whether there is a contradiction here and what the resolution should be before adding an editorial note.

@ricea
Copy link
Collaborator Author

ricea commented Jan 26, 2018

@annevk I had completely forgotten what I agreed to in #72. Thanks for catching that. There's quite a few extra tests needed so it won't be ready until next week.

@ricea
Copy link
Collaborator Author

ricea commented Jan 29, 2018

I've started a side-discussion about how to best implement mode-locking at whatwg/streams#873. The question is basically whether to add some complexity to the Streams Standard to reduce some complexity in the Encoding Standard.

@ricea
Copy link
Collaborator Author

ricea commented Feb 19, 2018

FYI: I've added tests that objects are created in the correct realms: GoogleChromeLabs/text-encode-transform-polyfill#11

ricea added 3 commits March 22, 2018 23:07
A new TextDecoder variable, "decForTransform" is introduced to the
constructor to achieve this.
Create a separate TextDecoder object "encForTransform" in the constructor, and use that to
initialise "transform".

Also clean up some wrapping problems in TextEncoder.
No point in running the encoder when the answer is always the same.

Also fix markup indentation for notes.
@ricea
Copy link
Collaborator Author

ricea commented Mar 22, 2018

I've attempted to implement the "Constructor version" of the API discussed #72 (comment). It doesn't seem too bad.

I have some tests in https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill/blob/f4de56ca18b98e55a79e7f2c9cb7658097b2aaf4/tests/non-interference.html. From the perspective of a consumer of the API, the behaviour looks pleasantly unsurprising to me.

This is not a final decision to go with the "Constructor version" of the API. I could switch to the "Static method version" depending on how people think this looks.

encoding.bs Outdated
<p class=note>{{DOMString}} is used here so that a surrogate pair that is split between chunks can
be correctly reassembled into the appropriate code point in the output. The behaviour is otherwise
identical to {{USVString}}. In particular, replacement characters will be used where necessary to
make the output valid UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we stated that lone surrogates will be replaced with U+FFFD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. How is the updated version? I'm not sure whether "valid UTF-8" or "well-formed UTF-8" is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd just say "is UTF-8", but for a note it seems fine. Although having said that, it's a little weird that it both talks about code points in the output and says the output is UTF-8 (which is bytes).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of options I can think of

  1. Stop mentioning UTF-8 in this note altogether.
  2. Say "the appropriate character in the output" instead of "code point". Although talking about "characters" in relation to Unicode can be ambiguous.

Other options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

{{DOMString}} is used here so that a surrogate pair that is split between chunks can be reassembled into the appropriate scalar value. The behaviour is otherwise identical to {{USVString}}. In particular, lone surrogates will be replaced with U+FFFD.

@ricea
Copy link
Collaborator Author

ricea commented Jul 18, 2018

Closing in favour of #149.

@ricea ricea closed this Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants