Skip to content

Why is there a "not greedy" comment here and what does that mean? #1831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
historydev opened this issue May 17, 2025 · 4 comments
Closed

Why is there a "not greedy" comment here and what does that mean? #1831

historydev opened this issue May 17, 2025 · 4 comments
Labels
A-grammar Area: Syntax and parsing Language Cleanup Improvements to existing language which is correct but not clear, or missing examples, or the like.

Comments

@historydev
Copy link

https://doc.rust-lang.org/1.87.0/reference/tokens.html#raw-byte-string-literals

This seems to mean that not all of the x00-x7f range is allowed, the "non-greedy" comment refers to an invalid character in the pair, namely the carriage return (CR) - x0D.

It only confused, it's already clear that the carriage return can't be used, since the ASCII_FOR_RAW description has "except IsolatedCR".

I was told the following:
"This is a standard concept for regular expressions.
Greedy matching takes the maximum possible number of characters of the string to match the mask, non-greedy - the minimum possible.
For example, for the string axxxbxxxb greedy /a.*b/ will capture the entire string, and non-greedy /a.*?b/ only up to the first b.
"

Image

@ehuss ehuss added Language Cleanup Improvements to existing language which is correct but not clear, or missing examples, or the like. A-grammar Area: Syntax and parsing labels May 17, 2025
@ehuss
Copy link
Contributor

ehuss commented May 17, 2025

It's non-greedy in a sense that if you have something like:

let a = br#"example a"#;
let b = br#"example b"#;

while it is matching the ASCII_FOR_RAW bytes in example a, it knows to stop before the "# in the first line. If it was greedy (the default for the * repetition), it would continue gobbling up characters until the end of example b, and thus ASCII_FOR_RAW would match example a"#;\nlet b = br#"example b.

https://en.wikipedia.org/wiki/Regular_expression#Lazy_matching contains a little more of a description.

@historydev
Copy link
Author

It's non-greedy in a sense that if you have something like:

let a = br#"example a"#;
let b = br#"example b"#;
while it is matching the ASCII_FOR_RAW bytes in example a, it knows to stop before the "# in the first line. If it was greedy (the default for the * repetition), it would continue gobbling up characters until the end of example b, and thus ASCII_FOR_RAW would match example a"#;\nlet b = br#"example b.

https://en.wikipedia.org/wiki/Regular_expression#Lazy_matching contains a little more of a description.

Thx u very much!

Why isn't this described in the table? Is this such obvious information?

It doesn't even specify whether regular expression symbols are used, I didn't even think about it:
https://doc.rust-lang.org/1.87.0/reference/notation.html#string-table-productions

@mattheww
Copy link
Contributor

To take your questions in reverse:

No, this isn't so obvious it doesn't need documenting.

The fact that the formalism being used isn't documented is a bug in the Reference (there's isn't an open issue explicitly about this, but I suppose it comes under #567).

The reason it isn't documented is that the Reference has evolved gradually from a Rust "manual" which described the lexical structure only in English.

In 2017 a contributor was kind enough to submit a form of the current "Lexer" blocks, and the editors at the time thought that was valuable enough to include without an explicit desciption of how they need to be interpreted.

(As I understand it he was using Antlr4 in its "lexer grammar" mode.)

@historydev
Copy link
Author

To take your questions in reverse:

No, this isn't so obvious it doesn't need documenting.

The fact that the formalism being used isn't documented is a bug in the Reference (there's isn't an open issue explicitly about this, but I suppose it comes under #567).

The reason it isn't documented is that the Reference has evolved gradually from a Rust "manual" which described the lexical structure only in English.

In 2017 a contributor was kind enough to submit a form of the current "Lexer" blocks, and the editors at the time thought that was valuable enough to include without an explicit desciption of how they need to be interpreted.

(As I understand it he was using Antlr4 in its "lexer grammar" mode.)

Thx u very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: Syntax and parsing Language Cleanup Improvements to existing language which is correct but not clear, or missing examples, or the like.
Projects
None yet
Development

No branches or pull requests

3 participants