Skip to content

Unicode check for crate_name attribute #2463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions gcc/rust/lex/rust-codepoint.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
#include "rust-system.h"

namespace Rust {

// FIXME: move this to rust-unicode.h?
struct Codepoint
Comment on lines +26 to 27
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be moved to rust-unicode.h.
At least it should not be put in under gcc/rust/lex/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems good to me

{
uint32_t value;
Expand Down
8 changes: 8 additions & 0 deletions gcc/rust/lex/rust-lex.h
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,14 @@ class Lexer
return c;
}
}

tl::optional<std::vector<Codepoint>> get_chars ()
{
if (is_valid ())
return {chars};
else
return tl::nullopt;
}
};

class FileInputSource : public InputSource
Expand Down
34 changes: 24 additions & 10 deletions gcc/rust/rust-session-manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
#include "rust-early-name-resolver.h"
#include "rust-cfg-strip.h"
#include "rust-expand-visitor.h"
#include "rust-unicode.h"

#include "diagnostic.h"
#include "input.h"
Expand Down Expand Up @@ -107,30 +108,39 @@ infer_crate_name (const std::string &filename)
return crate;
}

/* Validate the crate name using the ASCII rules
TODO: Support Unicode version of the rules */
/* Validate the crate name using the ASCII rules */

static bool
validate_crate_name (const std::string &crate_name, Error &error)
Comment on lines 113 to 114
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function would probably benefit from the new tl::expected type we have, but that's a problem for another PR!

{
if (crate_name.empty ())
Utf8String utf8_name = {crate_name};
tl::optional<std::vector<Codepoint>> uchars_opt = utf8_name.get_chars ();

if (!uchars_opt.has_value ())
{
error = Error (UNDEF_LOCATION, "crate name is not a valid UTF-8 string");
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then add this

Suggested change
}
}
auto uchars = uchars_opt.value();


std::vector<Codepoint> uchars = uchars_opt.value ();
if (uchars.empty ())
{
error = Error (UNDEF_LOCATION, "crate name cannot be empty");
return false;
}
if (crate_name.length () > kMaxNameLength)
if (uchars.size () > kMaxNameLength)
{
error = Error (UNDEF_LOCATION, "crate name cannot exceed %lu characters",
(unsigned long) kMaxNameLength);
return false;
}
for (auto &c : crate_name)
for (Codepoint &c : uchars)
{
if (!(ISALNUM (c) || c == '_'))
if (!(is_alphabetic (c.value) || is_numeric (c.value) || c.value == '_'))
{
error = Error (UNDEF_LOCATION,
"invalid character %<%c%> in crate name: %<%s%>", c,
crate_name.c_str ());
"invalid character %<%s%> in crate name: %<%s%>",
c.as_string ().c_str (), crate_name.c_str ());
return false;
}
}
Expand Down Expand Up @@ -1273,13 +1283,17 @@ rust_crate_name_validation_test (void)
ASSERT_TRUE (Rust::validate_crate_name ("example", error));
ASSERT_TRUE (Rust::validate_crate_name ("abcdefg_1234", error));
ASSERT_TRUE (Rust::validate_crate_name ("1", error));
// FIXME: The next test does not pass as of current implementation
// ASSERT_TRUE (Rust::CompileOptions::validate_crate_name ("惊吓"));
Comment on lines -1276 to -1277
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this original comment is probably wrong.
The kanji 惊(U+60CA) are not categorized as Alphabetic.

In https://www.unicode.org/Public/14.0.0/ucd/DerivedCoreProperties.txt

...
30FF          ; Alphabetic # Lo       KATAKANA DIGRAPH KOTO
3105..312F    ; Alphabetic # Lo  [43] BOPOMOFO LETTER B..BOPOMOFO LETTER NN
3131..318E    ; Alphabetic # Lo  [94] HANGUL LETTER KIYEOK..HANGUL LETTER ARAEAE
31A0..31BF    ; Alphabetic # Lo  [32] BOPOMOFO LETTER BU..BOPOMOFO LETTER AH
31F0..31FF    ; Alphabetic # Lo  [16] KATAKANA LETTER SMALL KU..KATAKANA LETTER SMALL RO
3400..4DBF    ; Alphabetic # Lo [6592] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DBF
4E00..A014    ; Alphabetic # Lo [21013] CJK UNIFIED IDEOGRAPH-4E00..YI SYLLABLE E
A015          ; Alphabetic # Lm       YI SYLLABLE WU
A016..A48C    ; Alphabetic # Lo [1143] YI SYLLABLE BIT..YI SYLLABLE YYR
A4D0..A4F7    ; Alphabetic # Lo  [40] LISU LETTER BA..LISU LETTER OE
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking haha. did you try it with rustc directly?

Copy link
Contributor Author

@tamaroning tamaroning Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my god. i just checked and rustc compiles it (@_@)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked 4E00..A014 containing 0x60CA haha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha no worries it's all good :D

ASSERT_TRUE (Rust::validate_crate_name ("クレート", error));
ASSERT_TRUE (Rust::validate_crate_name ("Sōkrátēs", error));
ASSERT_TRUE (Rust::validate_crate_name ("惊吓", error));

// NOTE: - is not allowed in the crate name ...

ASSERT_FALSE (Rust::validate_crate_name ("abcdefg-1234", error));
ASSERT_FALSE (Rust::validate_crate_name ("a+b", error));
ASSERT_FALSE (Rust::validate_crate_name ("/a+b/", error));
ASSERT_FALSE (Rust::validate_crate_name ("😸++", error));
ASSERT_FALSE (Rust::validate_crate_name ("∀", error));

/* Tests for crate name inference */
ASSERT_EQ (Rust::infer_crate_name ("c.rs"), "c");
Expand Down
23 changes: 4 additions & 19 deletions gcc/rust/util/rust-unicode.cc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ typedef std::vector<codepoint_t> string_t;
template <std::size_t SIZE>
int64_t
binary_search_ranges (
// FIXME: use binray search function from <algorithm>
const std::array<std::pair<uint32_t, uint32_t>, SIZE> &ranges,
uint32_t target_cp)
{
Expand Down Expand Up @@ -49,6 +50,7 @@ int64_t
binary_search_sorted_array (const std::array<std::uint32_t, SIZE> &array,
uint32_t target)
{
// FIXME: use binray search function from <algorithm>
if (SIZE == 0)
return -1;

Expand Down Expand Up @@ -104,9 +106,7 @@ recursive_decomp_cano (codepoint_t c, string_t &buf)
{
string_t decomped = it->second;
for (codepoint_t cp : decomped)
{
recursive_decomp_cano (cp, buf);
}
recursive_decomp_cano (cp, buf);
}
else
buf.push_back (c);
Expand Down Expand Up @@ -152,8 +152,7 @@ recomp (string_t s)
if (s.size () > 0)
{
int last_class = -1;
// int starter_pos = 0; // Assume the first character is Starter. Correct?
// int target_pos = 1;
// Assume the first character is Starter.
codepoint_t starter_ch = s[0];
for (unsigned int src_pos = 1; src_pos < s.size (); src_pos++)
{
Expand Down Expand Up @@ -189,20 +188,6 @@ recomp (string_t s)
return buf;
}

// TODO: remove
/*
void
dump_string (std::vector<uint32_t> s)
{
std::cout << "dump=";
for (auto c : s)
{
std::cout << std::hex << c << ", ";
}
std::cout << std::endl;
}
*/

string_t
nfc_normalize (string_t s)
{
Expand Down
19 changes: 19 additions & 0 deletions gcc/rust/util/rust-unicode.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,29 @@
#ifndef RUST_UNICODE_H
#define RUST_UNICODE_H

#include "optional.h"
#include "rust-system.h"
#include "rust-lex.h"

namespace Rust {

class Utf8String
{
private:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say a nit here to move the private fields to the bottom of the class declaration but then i realised we do it this style in AST and HIR classes.

What is the GCC style? I think having them at the top like this is probably the best now when i think about it.

@dkm @CohenArthur @tschwinge any opinions on the style of putting private fields at the top of the class or bottom?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having private fields at the bottom of my classes personally. I never gave it much attention haha

tl::optional<std::vector<Codepoint>> chars;

public:
Utf8String (const std::string &maybe_utf8)
{
Lexer::BufferInputSource input_source = {maybe_utf8, 0};
chars = input_source.get_chars ();
}

// Returns UTF codepoints when string is valid as UTF-8, returns nullopt
// otherwise.
tl::optional<std::vector<Codepoint>> get_chars () const { return chars; }
};

// TODO: add function nfc_normalize

bool
Expand Down
2 changes: 2 additions & 0 deletions gcc/testsuite/rust/compile/bad-crate-name2.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#![crate_name = "😅"] // { dg-error "invalid character ...." "" }
fn main() {}