From 18564667dc8ffd42cf7f4aa6b0ef413ee6824017 Mon Sep 17 00:00:00 2001 From: Michael Howell Date: Wed, 24 Apr 2019 13:03:25 -0700 Subject: [PATCH 1/7] Create 0000-char-uax-31.md --- text/0000-char-uax-31.md | 109 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 text/0000-char-uax-31.md diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md new file mode 100644 index 00000000000..eefccca9cbe --- /dev/null +++ b/text/0000-char-uax-31.md @@ -0,0 +1,109 @@ +- Feature Name: `char_uax_31` +- Start Date: 2019-24-04 +- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) +- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) + +# Summary +[summary]: #summary + +Add functionc to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31") +`Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue`. + +# Motivation +[motivation]: #motivation + +As a systems language, Rust is heavily used for parsing. +As a progressive, forward-thinking language that accepts anyone, +Rust supports Unicode and makes the definitive string types UTF-8. +At the intersection of these needs sits *UAX #31: Unicode Identifier and Pattern Syntax* ("Annex 31"), +a standardized set of code point categories for defining computer language syntax. + +This is being used in production Rust code already. +Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer, +[but not everyone who works on the compiler knows about them](https://internals.rust-lang.org/t/for-await-loops/9819/16), +and since they're not in the standard library, +not everyone who works on Rust-related tooling has access to them. +I'm not asserting that putting these in libstd would've avoided that bug, +but if it was in the standard library, +it would resolve the questions about whether third-party tooling can be expected to support the full range of Unicode whitespace. + +[Other languages](https://rosettacode.org/wiki/Unicode_variable_names#C) also follow Annex 31, such as C# and Elixir. +Other common grammars, even ones that aren't actually for programming languages, can also be found or defined in Annex 31, +such as hashtags and XML. + +It's also pretty clear what the "right" API is for this, +since `is_whitespace` and `is_ascii_whitespace` already set the precedent here, +so there's little need to experiment with API design. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +In addition to functions for checking "ASCII white space" and "Unicode white space," +some languages, such as Rust and C#, use Unicode Annex 31 to define their syntax. +These functions are also exposed as methods on the `char` type. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## `fn char::is_xid_start(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `XID_Start` code point category. + +## `fn char::is_xid_continue(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `XID_Continue` code point category. + +## `fn char::is_id_nonstart(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `ID_Nonstart` code point category. + +## `fn char::is_pattern_syntax(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `Pattern_Syntax` code point category. + +## `fn char::is_pattern_white_space(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `Pattern_White_Space` code point category. + +# Drawbacks +[drawbacks]: #drawbacks + +The big problem, that has always made designing the text APIs hard, +is that it's not clear how much of Unicode we want to include in libstd. +The standard library certainly doesn't want a hashtag parser, even though Annex 31 describes one in section 6, +and libstd certainly doesn't want a character shaping algorithm, +even though Unicode places plenty of requirements on that process, too. + +The other problem is that a lot of languages aren't defined in terms of Annex 31 anyway, +like Swift and HTML, which simply spell out the set of allowed code points themselves, +so this isn't necessarily useful to allw of the language implementers. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +The design was chosen to line up with how character classification is already being done (like `is_whitespace`). +The alternative, of providing a more generic classification API, +seems to have enough room for debate that it would be better served in crates that provide purpose-built frameworks. +In particular, proposal is made for the benefit of parsers, not text layout engines. +Those will still need to use things like `rust-unic`. + +# Prior art +[prior-art]: #prior-art + +There's already a crate that mostly provides this API, [unicode-xid](https://lib.rs/crates/unicode-xid), +but it's actually less comprehensive than this proposal (it only provides XID_Start and XID_Continue). + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- What about ID_Start and ID_Continue? They're deprecated by the Unicode Consortium, but probably still useful for parsing some languages. +- `is_pattern_white_space`, like UAX 31 spells it? Or `is_pattern_whitespace`, for consistency with the rest of libstd? + +# Future possibilities +[future-possibilities]: #future-possibilities + +What does [Mosh](https://mosh.org/) use need to know for its UTF-8 handling? +Anything that's necessary to implement a correct UTF-8 enabled VT100 state machine seems applicable to Rust, +since that state machine is separate from the text shaping itself, but still has to know things like combining marks, +and what's necessary there is probably necessary for other, similar state machines like HTML and PDF, +where you have to pick out weird combining-mark corner cases. From 42bfc63c3e39420fa2aff3085b284b3d03cc7164 Mon Sep 17 00:00:00 2001 From: Michael Howell Date: Wed, 24 Apr 2019 13:13:05 -0700 Subject: [PATCH 2/7] Acknowledge that is_xid_start and continue exist --- text/0000-char-uax-31.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index eefccca9cbe..80fd37c6c5a 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -7,7 +7,8 @@ [summary]: #summary Add functionc to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31") -`Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue`. +`Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue` (the XID ones are already in the standard +library, but are unstable; this RFC proposes to stablize them). # Motivation [motivation]: #motivation @@ -45,17 +46,10 @@ These functions are also exposed as methods on the `char` type. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -## `fn char::is_xid_start(self) -> bool` - -Check if `self` is a member of Unicode Annex 31's `XID_Start` code point category. - -## `fn char::is_xid_continue(self) -> bool` - -Check if `self` is a member of Unicode Annex 31's `XID_Continue` code point category. - ## `fn char::is_id_nonstart(self) -> bool` Check if `self` is a member of Unicode Annex 31's `ID_Nonstart` code point category. +This function is defined as `self.is_xid_continue() && !self.is_xid_start()`. ## `fn char::is_pattern_syntax(self) -> bool` From 320cf1cd7b9758fad0e8985a596188d9dafd1952 Mon Sep 17 00:00:00 2001 From: Michael <5672750+mibac138@users.noreply.github.com> Date: Thu, 25 Apr 2019 09:14:22 -0700 Subject: [PATCH 3/7] Update text/0000-char-uax-31.md Co-Authored-By: notriddle --- text/0000-char-uax-31.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index 80fd37c6c5a..9d2621305ca 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -6,7 +6,7 @@ # Summary [summary]: #summary -Add functionc to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31") +Add functions to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31") `Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue` (the XID ones are already in the standard library, but are unstable; this RFC proposes to stablize them). From 8654dfbf7f4133b09adde05e451554913b12d80a Mon Sep 17 00:00:00 2001 From: Michael <5672750+mibac138@users.noreply.github.com> Date: Thu, 25 Apr 2019 09:14:36 -0700 Subject: [PATCH 4/7] Update text/0000-char-uax-31.md Co-Authored-By: notriddle --- text/0000-char-uax-31.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index 9d2621305ca..6051b4c77c0 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -70,7 +70,7 @@ even though Unicode places plenty of requirements on that process, too. The other problem is that a lot of languages aren't defined in terms of Annex 31 anyway, like Swift and HTML, which simply spell out the set of allowed code points themselves, -so this isn't necessarily useful to allw of the language implementers. +so this isn't necessarily useful to all of the language implementers. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives From 6db8ba16dd1e8cabc44b8bf406cb724492668748 Mon Sep 17 00:00:00 2001 From: mzji Date: Thu, 2 May 2019 18:34:38 -0700 Subject: [PATCH 5/7] Update text/0000-char-uax-31.md Co-Authored-By: notriddle --- text/0000-char-uax-31.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index 6051b4c77c0..93a49bad61c 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -1,5 +1,5 @@ - Feature Name: `char_uax_31` -- Start Date: 2019-24-04 +- Start Date: 2019-04-24 - RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) - Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) From 4143fef9e6cace411302efc5d7d10f53e179ef92 Mon Sep 17 00:00:00 2001 From: Michael Howell Date: Thu, 2 May 2019 18:36:30 -0700 Subject: [PATCH 6/7] Update 0000-char-uax-31.md --- text/0000-char-uax-31.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index 93a49bad61c..8d05f880f42 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -21,7 +21,7 @@ a standardized set of code point categories for defining computer language synta This is being used in production Rust code already. Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer, -[but not everyone who works on the compiler knows about them](https://internals.rust-lang.org/t/for-await-loops/9819/16), +[but not everyone who works on the compiler knows about them](https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876), and since they're not in the standard library, not everyone who works on Rust-related tooling has access to them. I'm not asserting that putting these in libstd would've avoided that bug, From b9d0c6c9b377a0549ebb7461fb2abc0a66fd1a57 Mon Sep 17 00:00:00 2001 From: Michael Howell Date: Thu, 2 May 2019 18:40:25 -0700 Subject: [PATCH 7/7] Mention the back-compat drawback --- text/0000-char-uax-31.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md index 8d05f880f42..1ab9bb32999 100644 --- a/text/0000-char-uax-31.md +++ b/text/0000-char-uax-31.md @@ -72,6 +72,10 @@ The other problem is that a lot of languages aren't defined in terms of Annex 31 like Swift and HTML, which simply spell out the set of allowed code points themselves, so this isn't necessarily useful to all of the language implementers. +The other big drawback is that Unicode changes, so keeping the standard library synced with it represents a backwards- +compatibility hazard. `is_whitespace` already has this problem, but the set of Unicode whitespace changes less +frequently than XID does, so the behavior of these functions would be expected to change more often. + # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives