Make TextEncoder and TextDecoder be transform streams #127

ricea · 2017-12-11T09:03:58Z

Add readable and writable attributes to TextEncoder and TextDecoder
objects to permit them to be used to transform ReadableStreams via the
pipeThrough() method.

This integrates the Encoding Standard with the Streams Standard. See
https://streams.spec.whatwg.org/#ts-model for the definition of a
"transform stream" and https://streams.spec.whatwg.org/#rs-pipe-through
for an explanation of the pipeThrough() method.

A TextEncoder object can be used to transform a stream of strings to a
stream of bytes in UTF-8 encoding. A TextDecoder object can be used to
transform a stream of bytes in the encoding passed to the constructor to
strings.

The implementation delegates to a TransformStream object internally,
which provides the glue logic that ties the readable and writable
together.

There is a human-readable version of these changes at
http://htmlpreview.github.io/?https://github.com/ricea/encoding-streams/blob/master/patch.html

There is a prollyfill and tests for the new functionality at
https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill

Closes #72.

Preview | Diff

Add `readable` and `writable` attributes to TextEncoder and TextDecoder objects to permit them to be used to transform ReadableStreams via the `pipeThrough()` method. This integrates the Encoding Standard with the Streams Standard. See https://streams.spec.whatwg.org/#ts-model for the definition of a "transform stream" and https://streams.spec.whatwg.org/#rs-pipe-through for an explanation of the `pipeThrough()` method. A TextEncoder object can be used to transform a stream of strings to a stream of bytes in UTF-8 encoding. A TextDecoder object can be used to transform a stream of bytes in the encoding passed to the constructor to strings. The implementation delegates to a TransformStream object internally, which provides the glue logic that ties the `readable` and `writable` together. There is a human-readable version of these changes at http://htmlpreview.github.io/?https://github.com/ricea/encoding-streams/blob/master/patch.html There is a prollyfill and tests for the new functionality at https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill

inexorabletash · 2017-12-11T18:23:06Z

Yay! Comments in random order:

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

Can the tests be upgraded to use async/await ?

In the tests, stream-properties.html should probably just be an idlharness test. (I don't know how well that plays with p(r)ollyfills, though.)

What's the behavior for split surrogate pairs when encoding? We basically pretend that's not a problem in the non-streaming API, by making the input is a USVString and thus not needing a {stream} flag. In other words, we assume that the input is always a complete text, not just a randomly partitioned sequence of code units. In this patch, the text reads Let input be the result of converting chunk to a USVString.... which sounds like it might introduce problems when chunk sizes cross surrogate boundaries. This worries me. Also, needs test cases.

ricea · 2017-12-12T09:26:50Z

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

Sadly, bring-your-own-buffer style usage isn't supported by TransformStream yet. There's an open question about what to do when there isn't enough space in the supplied ArrayBuffer to fit the transformed input.

Since this is basically an optimisation feature, I don't plan to support it in the first release. I would rather take my time to get it right.

Can the tests be upgraded to use async/await ?

Certainly. I'm never quite sure what version of ES to target in tests. For the Streams Standard tests I've been avoiding introducing async/await, but I'm happy to use them here.

In the tests, stream-properties.html should probably just be an idlharness test. (I don't know how well that plays with p(r)ollyfills, though.)

I haven't tried it. I will give it a go.

What's the behavior for split surrogate pairs when encoding? We basically pretend that's not a problem in the non-streaming API, by making the input is a USVString and thus not needing a {stream} flag. In other words, we assume that the input is always a complete text, not just a randomly partitioned sequence of code units. In this patch, the text reads Let input be the result of converting chunk to a USVString.... which sounds like it might introduce problems when chunk sizes cross surrogate boundaries. This worries me. Also, needs test cases.

Yes, it's broken. I've been planning to write a test for this at some point. What I expect to see is if you encode "😻" in a single chunk, it works as intended. If you split it into two chunks, you get two chunks as output each containing U+FFFD encoded as UTF-8.

If there is consensus that this should be fixed, I think it can be done by converting to DOMString rather than USVString and special-casing the final code unit when it is the first half of a surrogate pair.

annevk · 2017-12-13T11:20:20Z

I think a problem here is that if a script running in a document invokes the constructor of either TextDecoder or TextEncoder in an <iframe> element the tear-off transform-related objects will have the associated global of the document (the current global/realm) and not the <iframe>'s document.

See also https://www.w3.org/Bugs/Public/show_bug.cgi?id=24652 and https://w3c-test.org/WebIDL/current-realm.html.

Note that if you change it to accept a DOMString you also need to do conversions from code unit to scalar value (you probably meant that; stating it here so I don't forget to verify it).

(Tests can use JavaScript features that ship in the latest release of all browsers. At least, that's my rule of thumb.)

domenic · 2017-12-13T15:53:16Z

I think we should indeed make split surrogates work. It seems like a big value proposition for having this built-in and shared, that people don't have to figure that out by themselves.

I think a problem here is that if a script running in a document invokes the constructor of either TextDecoder or TextEncoder in an <iframe> element the tear-off transform-related objects will have the associated global of the document (the current global/realm) and not the <iframe>'s document.

I don't think that is true. You're talking about new childIframe.contentWindow.TextEncoder(), right? The current realm inside the TextEncoder constructor is that of the constructor itself, i.e. of the iframe, so CreateTransformStream will still be in that realm, and the TransformStream objects will be from the same realm as the TextEncoder object constructed.

annevk · 2017-12-13T16:46:17Z

Ah, great, thanks for confirming. Let's make sure to test that as part of the test I referenced.

ricea · 2017-12-14T12:58:56Z

@inexorabletash GoogleChromeLabs/text-encode-transform-polyfill#6 updates the tests to use async / await (much more readable) and adds idlharness.html. The prollyfill fails a couple of the idlharness tests because it (intentionally) doesn't do brand checks. Apart from that, everything looks good.

ricea · 2017-12-14T13:05:39Z

I have been writing tests for surrogate pairs split between chunks and in the process realised there's a problem.

We're now explicitly taking a sequence of code units and normalising it to Unicode. Which means we're doing error handling. Which means the developer should probably be able to specify what type of error handling should be done. Which means adding a fatal option to the constructor. Which means that encode() should logically also support it. Which means that encode() also needs to be changed to take a DOMString. Which is a much more disruptive change than this one.

My preference is to land this first, and then worry about providing control over error behaviour and changing encode() later, since that is a much riskier change.

annevk · 2017-12-14T13:09:58Z

Which means we're doing error handling.

Well, but not any different than IDL always did, right? We're just changing the layering a bit. I'm not necessarily convinced String -> USVString conversion needs to be subject to fatal the same way that bytes <> scalar values needs to be. Note that the reason we have it in the latter case is primarily XML, which is the only format I know of that demands such handling. We exposed it to the API as a nicety mostly.

ricea · 2017-12-14T13:10:59Z

Well, but not any different than IDL always did, right? We're just changing the layering a bit. I'm not necessarily convinced String -> USVString conversion needs to be subject to fatal the same way that bytes <> scalar values needs to be. Note that the reason we have it in the latter case is primarily XML, which is the only format I know of that demands such handling. We exposed it to the API as a nicety mostly.

That works for me too.

ricea · 2017-12-19T14:10:50Z

I have written tests for split surrogates but I haven't written the standard language yet.

I am working on tests for ensuring that objects are all created in the correct realms. This is taking a while.

I probably won't have the standard language updated for split surrogates until the new year.

annevk · 2017-12-19T15:17:50Z

@vyv03354 are you willing to review this as well? Thanks!

Previously if the high surrogate and low surrogate were split between two chunks, they would be each replaced with the replacement character. Now the correct character will correctly appear in the output. Also mark "readable" and "writable" properties [SameObject].

ricea · 2018-01-22T15:21:43Z

I have updated the standard language for split surrogates. Sorry for the delay. I tried a few approaches and adding the "convert code unit to scalar value" algorithm seemed to work best. PTAL.

annevk

Mostly style nits (note that I've only reported the first instance).

I was actually hoping IDL's algorithm could become https://infra.spec.whatwg.org/#javascript-string-convert, but I couldn't really find a way to reuse that here so we might be stuck with some UTF-16 logic in multiple places forever.

annevk · 2018-01-22T15:53:35Z

encoding.bs

+    type:dfn; text:writable stream; url:#writable-stream
+    type:abstract-op; text:CreateTransformStream; url: #create-transform-stream
+    type:abstract-op; text:TransformStreamDefaultControllerEnqueue; url: #transform-stream-default-controller-enqueue
+    type:interface; text:TransformStream; url: #ts-class


What's not exported yet in streams should be exported so there's no need to list it here.

annevk · 2018-01-22T15:53:48Z

encoding.bs

+    text: type; url: #sec-ecmascript-data-types-and-values; type: dfn
+    text: IsDetachedBuffer; url: #sec-isdetachedbuffer; type: abstract-op
+    text: IsSharedArrayBuffer; url: #sec-issharedarraybuffer; type: abstract-op
+    text: internal slot; url: #sec-object-internal-methods-and-internal-slots; type: dfn


We generally don't cross-reference "internal slot".

annevk · 2018-01-22T15:54:58Z

encoding.bs

@@ -1186,10 +1246,31 @@ if <a for=TextDecoder>error mode</a> is "<code>fatal</code>", and false otherwis
 <p>The <dfn attribute for=TextDecoder><code>ignoreBOM</code></dfn> attribute's getter must return
 true if <a for=TextDecoder>ignore BOM flag</a> is set, and false otherwise.

+<p>The <dfn attribute for=TextDecoder><code>readable</code></dfn> attribute's getter must return the
+contents of <a for=TextDecoder>transform</a>'s \[[readable]] <a>internal slot</a>.


I think rather than "the contents of x" we usually say "x's value". @domenic?

Just "return transform.[[readable]]" usually?

Ah yeah, indeed.

annevk · 2018-01-22T15:55:17Z

encoding.bs

+ slot</a>.
+
+ <li><p>If <a abstract-op>IsReadableStreamLocked</a>(<var>readable</var>) is true, throw a
+ {{TypeError}} exception.


then throw* (xref throw?)

No need to xref throw I think; we haven't elsewhere

"throw" is hyperlinked elsewhere in this standard. It would be nice to be consistent.

annevk · 2018-01-22T15:55:43Z

encoding.bs

+{{TextDecoder}} <var>dec</var> and a <var>chunk</var>, runs these steps:
+
+<ol>
+ <li><p>If the <a>type</a> of of <var>chunk</var> is not Object, or <var>chunk</var> does not have


Shouldn't this be noted as Type(chunk)?

annevk · 2018-01-22T15:57:01Z

encoding.bs

+ <li><p>If the <a>type</a> of of <var>chunk</var> is not Object, or <var>chunk</var> does not have
+ an \[[ArrayBufferData]] <a>internal slot</a>, or <a
+ abstract-op>IsDetachedBuffer</a>(<var>chunk</var>) is true, or <a
+ abstract-op>IsSharedArrayBuffer</a>(<var>chunk</var>) is true then return <a>a promise rejected


Comma before then. Also I'm not sure I want to reference the promises guide as it really needs to be folded into IDL and has a bunch of things that confuse folks. Can we just say "a new promise rejected with a TypeError exception"?

annevk · 2018-01-22T15:57:13Z

encoding.bs

+ abstract-op>IsSharedArrayBuffer</a>(<var>chunk</var>) is true then return <a>a promise rejected
+ with</a> a {{TypeError}}.
+
+ <li><p>If the <a for=TextDecoder>do not flush flag</a> for <var>dec</var> is unset, set


annevk · 2018-01-22T15:58:16Z

encoding.bs

+ <li><p>Let <var>output</var> be a new <a for=/>stream</a>.
+
+ <li>
+  <p>While true, run these substeps:


steps* we try to avoid "substeps" and friends these days.

annevk · 2018-01-22T15:58:59Z

encoding.bs

+ </ol>
+
+ <li><p>Otherwise, return <a>a promise rejected with</a> a {{TypeError}}.
+</ol>


It seems like you ate a newline here.

annevk · 2018-01-22T15:59:35Z

encoding.bs

+
+  <p>Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a {{ReadableStream}} source.
+
+<pre class=example id=example-textencode-writable>


<pre also needs two spaces in front of it.

Also, this should probably use <code> too, just like the other examples.

hsivonen · 2018-01-22T20:32:54Z

Since it's a common enough request, would it be possible to include an example in the spec itself showing how to use streams to encode into an existing ArrayBuffer?

I'd like to reiterate the concerns about interoperability hazards associated with existing ArrayBuffer that I mentioned in the issue dedicated to that request.

I think we should indeed make split surrogates work.

While JS strings need to be able to exist in invalid-UTF-16 states when a JS program constructs a valid UTF-16 string, making unpaired surrogate preservation in crossing the WebIDL boundary into an explicit feature is a novel thing for the Web Platform and one that makes me uneasy.

So far, we have USVString, which explicitly replaces unpaired surrogates and various DOMString APIs that in UTF-16-based Web engines pass unpaired surrogates through, but that's mostly incidental and Web developers don't have good reasons to pass unpaired surrogates through, so not preserving unpaired surrogates appears to be feasible in terms of Web compat.

In Gecko, we were able to make CSSOM effectively change from DOMString to USVString without breaking the Web. If I've read the Servo source comments correctly, Servo panics if a DOMString contains unpaired surrogates and the absence of such crash reports suggests that the Web doesn't actually depend on unpaired surrogates crossing the WebIDL boundary (to the extent Servo has enough DOM features to be exercised with real Web content).

While of all the Web Platform APIs this might be the one where letting unpaired surrogates cross the WebIDL boundary might make sense as an explicit feature, I think we should consider a bit more carefully if that's necessary and if it's feasible to require textual streams not to split surrogate pairs across chunk boundaries.

ricea · 2018-01-23T07:01:03Z

@hsivonen Consider the following transform stream:

const rechunker = new TransformStream({
  transform(chunk, controller) {
    for (let start = 0; start < chunk.length; start += 1024) {
      const end = Math.min(start + 1024, chunk.length);
      controller.enqueue(chunk.substring(start, end));
    }
  }
});

This takes a stream of strings and splits up any chunk that is longer than 1024 code units.

Here are the questions I ask myself:

Is this a useful and sensible thing to want to do? I believe so. If the stream you're processing may have arbitrarily large chunks, you're forwarding it to some expensive operation, and you want to avoid jank, I think the correct approach is to include something like this in your pipe.
It is reasonable for a developer to expect this to work and not corrupt the input? I think so.
1. Corruption wouldn't even happen unless you streamed it to TextEncoder, and that could be several steps away down the pipe.
2. It just applies the Javascript built-in substring() function, and nothing on
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring says anything about data corruption.
3. It will be a long time before any problems are noticed.
Is this implementation correct? Maybe? Maybe not? I don't know.

hsivonen · 2018-01-23T12:43:25Z

I'm not fully familiar with Streams. Where do the string chunks in the rechunker example cross the WebIDL boundary (if anywhere)? The controller seems to be expressed in ECMA spec style and not in WebIDL style. Is there really no WebIDL boundary here and the prose sentence "Let input be the result of converting chunk to a DOMString." defines the boundary that is usually expressed in WebIDL?

The "convert code unit to scalar value" algorithm would not handle stray lead surrogates correctly. They would erroneously eat the next code unit. Fix the algorithm to replace the lead surrogate and leave the next code unit alone. Fix up the imports now that the correct things are exported from the Streams Standard. Consistently use ", then" in if statements. Don't use "substeps". Just "steps". It's cleaner. Use the formulation "transform.[[slot]]" instead of "transform's [[slot]] internal slot". Also stop linking "internal slot" to the Javascript standard. Write "Type(chunk)" instead of "the type of chunk". Don't link to the Promises Guide as it is viewed as problematic.

ricea

Fixed everything from @annevk's comments. I didn't respond "done" to each one individually as it might be a bit spammy.

I considered combining the "convert code unit to scalar value" algorithm with "shared UTF-16 decoder" but it would significantly complicate the latter for little benefit.

ricea · 2018-01-23T12:10:13Z

encoding.bs

+ slot</a>.
+
+ <li><p>If <a abstract-op>IsReadableStreamLocked</a>(<var>readable</var>) is true, throw a
+ {{TypeError}} exception.


"throw" is hyperlinked elsewhere in this standard. It would be nice to be consistent.

ricea · 2018-01-23T15:06:35Z

I'm not fully familiar with Streams. Where do the string chunks in the rechunker example cross the WebIDL boundary (if anywhere)?

Sorry for not explaining clearly. The chunks don't cross the WebIDL boundary within the rechunker example. It's only if you then pipe the stream to TextEncoder or some other stream defined in a WebIDL-based standard that the chunks will cross the WebIDL boundary.

The controller seems to be expressed in ECMA spec style and not in WebIDL style. Is there really no WebIDL boundary here and the prose sentence "Let input be the result of converting chunk to a DOMString." defines the boundary that is usually expressed in WebIDL?

I think that is correct. My intent is for "converting chunk to a DOMString" to imply the same processing that would normally happen upon calling a WebIDL method. Is this something that needs to be made more explicit? The same applies to the first step of the "decode and enqueue a chunk" algorithm.

annevk · 2018-01-23T15:14:13Z

@ricea if transform streams gain a "bring your own buffer" capability does that need explicit changes to this API or would we run into #69 (comment) as a side effect of that capability being added? In case it's the latter, we'll need to address @hsivonen's concerns somehow now, I think.

ricea · 2018-01-23T15:19:19Z

@annevk It won't be an automatic upgrade. There will be a different Create operation to call, similar to CreateReadableByteStream. While this is a bit regrettable, it would be bad if a change in the Streams Standard suddenly exposed internals of Encoding and other standards.

annevk

I have many nits that I'm also willing to help address once you consider things done.

Reading through #72 again it seems that with @jakearchibald you agreed on mode locking, but I cannot find that in the current draft. Did I miss a comment saying why we went back on that?

It would also be nice to add at least one example.

annevk · 2018-01-24T09:00:31Z

encoding.bs

+    type:interface; text:ReadableStream
+    type:dfn; text:chunk
+    type:dfn; text:readable stream
+    type:dfn; text:writable stream


Why are these needed? Do you get some conflict that cannot be resolved?

"ReadableStream" is also exported by Fetch, and so needs to be disambiguated. I have modified the Streams Standard to export "chunk", "readable stream" and "writable stream" in whatwg/streams#871, so those won't be needed once Bikeshed has picked up the change.

Shall we make Fetch stop doing that then? It doesn't really seem like Fetch's way of managing it handles anything.

@yutakahirano Who is using the ReadableStream definition that is exported from Fetch?

I find none, but serviceworker uses related ones ("construct a ReadableStream object", "empty ReadableStream"). See also: whatwg/webidl#445

annevk · 2018-01-24T09:02:10Z

encoding.bs

-<dfn for=TextDecoder>do not flush flag</dfn> (initially unset).
+<dfn for=TextDecoder>error mode</dfn> (initially "<code>replacement</code>"),
+<dfn for=TextDecoder>do not flush flag</dfn> (initially unset), and
+<dfn for=TextDecoder>transform</dfn> (a {{TransformStream}} object).


It seems best to not say anything here between parenthesis as that's only used for initial values currently (which is inconsistent with other specifications, I realize). We could do a follow-up where we state the type for all of these and then put the initial value elsewhere.

annevk · 2018-01-24T09:02:45Z

encoding.bs

@@ -1135,6 +1150,31 @@ control.
 <dt><code><var>decoder</var> . <a attribute for=TextDecoder>ignoreBOM</a></code>
 <dd><p>Returns true if <a for=TextDecoder>ignore BOM flag</a> is set, and false otherwise.

+ <dt><code><var>decoder</var> . <a attribute for=TextDecoder>readable</a></code>
+ <dd>
+  <p>Returns a <a>readable stream</a> whose <a>chunks</a> are strings resulting from running <a


Please don't wrap inline/phrasing elements in any manner.

annevk · 2018-01-24T09:03:03Z

encoding.bs

+
+ <dt><code><var>decoder</var> . <a attribute for=TextDecoder>writable</a></code>
+ <dd>
+  <p>Returns a <a>writable stream</a> which accepts {{BufferSource}} chunks and runs them through <a


annevk · 2018-01-24T09:03:34Z

encoding.bs

+  .pipeTo(textWritable);</code></pre>
+
+  <p>If the <a for=TextDecoder>error mode</a> is "<code>fatal</code>" and
+  <a for=TextDecoder>encoding</a>'s <a for=/>decoder</a> returns <a>error</a>, both {{TextDecoder/readable}}


This exceeds 100 columns.

annevk · 2018-01-24T09:14:20Z

encoding.bs

+{{TextEncoder}} <var>enc</var>, runs these steps:
+
+<ol>
+ <li><p>If <var>enc</var>'s <a>pending high surrogate</a> is set, run these steps:


<li> with multiple children again.

annevk · 2018-01-24T09:18:50Z

encoding.bs

@@ -1241,6 +1314,88 @@ method, when invoked, must run these steps:
  </ol>
 </ol>



Having an <hr> here would be good I think, to separate it a bit from the member definitions.

annevk · 2018-01-24T09:22:48Z

encoding.bs

+
+<p class=note>This is equivalent to the <a spec=webidl>convert a DOMString to a sequence of Unicode
+scalar values</a> algorithm from [[WEBIDL]], but allows for surrogate pairs that are split between
+strings.


I was thinking, it would be better to reference https://infra.spec.whatwg.org/#javascript-string-convert here since eventually we'll update IDL to use that too and then this would need to be updated.

annevk · 2018-01-24T09:24:57Z

encoding.bs

+ <ol>
+  <li><p>Let <var>high surrogate</var> be <var>enc</var>'s <a>pending high surrogate</a>.
+
+  <li><p>Unset <var>enc</var>'s <a>pending high surrogate</a>.


I think it would be slightly clearer if we instead made pending high surrogate's type "null or a code unit" instead of the current mixture of flag and code unit. So check for non-null and set it to null when done.

annevk · 2018-01-24T09:26:45Z

encoding.bs

@@ -1314,6 +1524,113 @@ must run these steps:
  </ol>
 </ol>



An <hr> would be good here too.

hsivonen · 2018-01-25T09:27:27Z

I think that is correct. My intent is for "converting chunk to a DOMString" to imply the same processing that would normally happen upon calling a WebIDL method. Is this something that needs to be made more explicit?

Please add a very conspicuous note explaining that unlike in cases where the JS string to DOMString boundary passes through data that doesn't conform to Unicode as a performance optimization / legacy artifact of failure to enforce Unicode conformance, in this case letting code unit sequences that do not conform to Unicode travel across the boundary is an explicit feature with the intent of allowing the sequence of code units to become conforming once concatenated on the other side of the JS string / DOMString boundary.

Do not wrap inside tags. Do not add lines that are longer than 100 characters.

Add <hr>s before the streaming-related algorithms to set them apart from the method algorithms. Remove some blank lines that shouldn't be there.

It was noted as being equivalent to the algorithm in WebIDL. Instead note that it is equivalent to the "convert a JavaScript string into a scalar value string" algorithm from INFRA. Also add a trailing full-stop.

TextEncoder's "pending high surrogate" was either a code unit or unset. Make it "null or a code unit" instead.

ricea · 2018-01-26T11:15:55Z

@hsivonen wrote:

Please add a very conspicuous note explaining that unlike in cases where the JS string to DOMString boundary passes through data that doesn't conform to Unicode as a performance optimization / legacy artifact of failure to enforce Unicode conformance, in this case letting code unit sequences that do not conform to Unicode travel across the boundary is an explicit feature with the intent of allowing the sequence of code units to become conforming once concatenated on the other side of the JS string / DOMString boundary.

I'm concerned that this may contradict the advice from the WebIDL standard, namely:

Specifications should only use USVString for APIs that perform text processing and need a string of Unicode scalar values to operate on. Most APIs that use strings should instead be using DOMString, which does not make any interpretations of the code units in the string. When in doubt, use DOMString.

I'd prefer to have some clarity about whether there is a contradiction here and what the resolution should be before adding an editorial note.

ricea · 2018-01-26T12:45:21Z

@annevk I had completely forgotten what I agreed to in #72. Thanks for catching that. There's quite a few extra tests needed so it won't be ready until next week.

ricea · 2018-01-29T13:41:15Z

I've started a side-discussion about how to best implement mode-locking at whatwg/streams#873. The question is basically whether to add some complexity to the Streams Standard to reduce some complexity in the Encoding Standard.

ricea · 2018-02-19T14:57:39Z

FYI: I've added tests that objects are created in the correct realms: GoogleChromeLabs/text-encode-transform-polyfill#11

A new TextDecoder variable, "decForTransform" is introduced to the constructor to achieve this.

Create a separate TextDecoder object "encForTransform" in the constructor, and use that to initialise "transform". Also clean up some wrapping problems in TextEncoder.

No point in running the encoder when the answer is always the same. Also fix markup indentation for notes.

ricea · 2018-03-22T15:17:09Z

I've attempted to implement the "Constructor version" of the API discussed #72 (comment). It doesn't seem too bad.

I have some tests in https://github.com/GoogleChromeLabs/text-encode-transform-prollyfill/blob/f4de56ca18b98e55a79e7f2c9cb7658097b2aaf4/tests/non-interference.html. From the perspective of a consumer of the API, the behaviour looks pleasantly unsurprising to me.

This is not a final decision to go with the "Constructor version" of the API. I could switch to the "Static method version" depending on how people think this looks.

annevk · 2018-03-26T09:42:17Z

encoding.bs

+ <p class=note>{{DOMString}} is used here so that a surrogate pair that is split between chunks can
+ be correctly reassembled into the appropriate code point in the output. The behaviour is otherwise
+ identical to {{USVString}}. In particular, replacement characters will be used where necessary to
+ make the output valid UTF-8.


I'd prefer if we stated that lone surrogates will be replaced with U+FFFD.

Thanks. How is the updated version? I'm not sure whether "valid UTF-8" or "well-formed UTF-8" is better.

I guess I'd just say "is UTF-8", but for a note it seems fine. Although having said that, it's a little weird that it both talks about code points in the output and says the output is UTF-8 (which is bytes).

A couple of options I can think of

Stop mentioning UTF-8 in this note altogether.

Say "the appropriate character in the output" instead of "code point". Although talking about "characters" in relation to Unicode can be ambiguous.

Other options?

How about:

{{DOMString}} is used here so that a surrogate pair that is split between chunks can be reassembled into the appropriate scalar value. The behaviour is otherwise identical to {{USVString}}. In particular, lone surrogates will be replaced with U+FFFD.

ricea · 2018-07-18T09:23:50Z

Closing in favour of #149.

domenic mentioned this pull request Jan 5, 2018

Provide ability to encode content from browser/runtime in fetch API. whatwg/fetch#653

Closed

annevk reviewed Jan 22, 2018

View reviewed changes

ricea commented Jan 23, 2018

View reviewed changes

annevk reviewed Jan 24, 2018

View reviewed changes

ricea added 2 commits January 25, 2018 23:59

Fix wrapping

497545f

Do not wrap inside tags. Do not add lines that are longer than 100 characters.

Remove unnecessary blank lines and add <hr>s

9c96008

Add <hr>s before the streaming-related algorithms to set them apart from the method algorithms. Remove some blank lines that shouldn't be there.

ricea mentioned this pull request Jan 26, 2018

Export "readable stream", "writable stream" and "chunk" whatwg/streams#871

Merged

Add missing "then" and fix spelling of behavior

1aaa427

ricea added 5 commits January 26, 2018 19:13

Fix indentation of list items with multiple children

981d360

Add "then" to "If ..., run these steps" lines

99d9837

Remove parenthesis about type of "transform" slots

b498ac0

Change what the "convert code unit to sv" algo is equivalent to

f8e1249

It was noted as being equivalent to the algorithm in WebIDL. Instead note that it is equivalent to the "convert a JavaScript string into a scalar value string" algorithm from INFRA. Also add a trailing full-stop.

Change the type of "pending high surrogate"

57d5e1c

TextEncoder's "pending high surrogate" was either a code unit or unset. Make it "null or a code unit" instead.

ricea added 3 commits March 22, 2018 23:07

Change TextDecoder to use separate state for streams

02fc31d

A new TextDecoder variable, "decForTransform" is introduced to the constructor to achieve this.

Change TextDecoder to use independent state for streams

fc6f745

Create a separate TextDecoder object "encForTransform" in the constructor, and use that to initialise "transform". Also clean up some wrapping problems in TextEncoder.

Pre-encode the replacement character in encode and flush

a734143

No point in running the encoder when the answer is always the same. Also fix markup indentation for notes.

Add note about why DOMString is used for streaming encoding

64dbb75

annevk reviewed Mar 26, 2018

View reviewed changes

ricea added 2 commits March 26, 2018 18:53

Rephrase behaviour of transform on ill-formed UTF-16 in note

16ba855

Update explanation about use of DOMString

c39d7a1

ricea closed this Jul 18, 2018


		<p>Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a {{ReadableStream}} source.

		<pre class=example id=example-textencode-writable>

		@@ -1241,6 +1314,88 @@ method, when invoked, must run these steps:
		</ol>
		</ol>

Make TextEncoder and TextDecoder be transform streams #127

Make TextEncoder and TextDecoder be transform streams #127

Conversation

ricea commented Dec 11, 2017 • edited by pr-preview bot Loading

inexorabletash commented Dec 11, 2017

ricea commented Dec 12, 2017

annevk commented Dec 13, 2017

domenic commented Dec 13, 2017 • edited Loading

annevk commented Dec 13, 2017

ricea commented Dec 14, 2017 • edited Loading

ricea commented Dec 14, 2017

annevk commented Dec 14, 2017

ricea commented Dec 14, 2017

ricea commented Dec 19, 2017

annevk commented Dec 19, 2017

ricea commented Jan 22, 2018

annevk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsivonen commented Jan 22, 2018

ricea commented Jan 23, 2018

hsivonen commented Jan 23, 2018

ricea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricea commented Jan 23, 2018

annevk commented Jan 23, 2018

ricea commented Jan 23, 2018

annevk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsivonen commented Jan 25, 2018

ricea commented Jan 26, 2018

ricea commented Jan 26, 2018

ricea commented Jan 29, 2018

ricea commented Feb 19, 2018

ricea commented Mar 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricea commented Jul 18, 2018

ricea commented Dec 11, 2017 •

edited by pr-preview bot

Loading

domenic commented Dec 13, 2017 •

edited

Loading

ricea commented Dec 14, 2017 •

edited

Loading