From cc7f09ac7ca36a963fa65b563ab826837f56c71a Mon Sep 17 00:00:00 2001 From: Iulia Bastys Date: Wed, 14 May 2025 23:23:25 +0200 Subject: [PATCH] Update Lesson 1.4 README. --- .../1_basic/04_disambiguation/README.md | 314 +++++++++++------- 1 file changed, 196 insertions(+), 118 deletions(-) diff --git a/k-distribution/k-tutorial/1_basic/04_disambiguation/README.md b/k-distribution/k-tutorial/1_basic/04_disambiguation/README.md index 59c05ee2954..1aeae455649 100644 --- a/k-distribution/k-tutorial/1_basic/04_disambiguation/README.md +++ b/k-distribution/k-tutorial/1_basic/04_disambiguation/README.md @@ -4,32 +4,36 @@ copyright: Copyright (c) Runtime Verification, Inc. All Rights Reserved. # Lesson 1.4: Disambiguating Parses -The purpose of this lesson is to teach how to use K's builtin features for -disambiguation to transform an ambiguous grammar into an unambiguous one that -expresses the intended ASTs. +In this lesson you will learn how to disambiguate a grammar by using K's +built-in features, rather than _artificial_ operators such as brackets. +You will learn how to define a precedence order on operators and +associativity directions for them. ## Priority blocks -In practice, very few formal languages outside the domain of natural language -processing are ambiguous. The main reason for this is that parsing unambiguous -languages is asymptotically faster than parsing ambiguous languages. -Programming language designers instead usually use the notions of operator -precedence and associativity to make expression grammars unambiguous. These -mechanisms work by instructing the parser to reject certain ASTs in favor of -others in case of ambiguities; it is often possible to remove *all* ambiguities -in a grammar with these techniques. - -While it is sometimes possible to explicitly rewrite the grammar to remove -these parses, because K's grammar specification and AST generation are -inextricably linked, this is generally discouraged. Instead, we use the -approach of explicitly expressing the relative precedence of different -operators in different situations in order to resolve the ambiguity. - -For example, in C, `&&` binds tighter in precedence than `||`, meaning that -the expression `true && false || false` has only one valid AST: -`(true && false) || false`. - -Consider, then, the third iteration on the grammar of this definition +Parsing unambiguous languages is asymptotically faster than parsing ambiguous +languages. That's why in practice, very few formal languages outside the domain +of natural language processing are ambiguous. + +Cluttering the code with brackets to remove ambiguity is not an ideal solution. +Instead, programming language designers have developed methods for +disambiguating language expressions by making use of operator precedence and +associativity directions for them. It is often possible to remove _all_ +ambiguities in a grammar with these methods, as they instruct the parser to +accept some ASTs in favor of others. + +In general, grammars can be rewritten to remove unwanted parses. However, +in K, the grammar specification and AST generation are intrinsically linked, +so we discourage this approach. You will still learn how to do this in K +towards the end of this lesson. Now we continue with showing you how to +explicitly specify the relative precedence of operators in different +situations in order to resolve the ambiguity. + +Recall that in C, `&&` binds tighter than `||`, meaning that it has higher +precedence, meaning additionally that the expression `true && false || false` +has only one valid AST: `(true && false) || false`. + +Consider, then, the third iteration on the grammar of Boolean expressions (`lesson-04-a.k`): ```k @@ -45,77 +49,96 @@ module LESSON-04-A endmodule ``` -In this example, some of the `|` symbols separating productions in a single -block have been replaced with `>`. This serves to describe the -**priority groups** associated with this block of productions. -The first priority group consists of the atoms of the -language: `true`, `false`, and the bracket operator. In general, a priority -group starts either at the `::=` or `>` operator and extends until either the -next `>` operator or the end of the production block. Thus, we can see that the -second, third, fourth, and fifth priority groups in this grammar all consist -of a single production. +In this example, some of the pipe symbols `|` separating productions have been +replaced with `>`. This serves to describe the **priority groups** associated +with a block of productions. The first priority group consists of the atoms +of the language: `true`, `false`, and the bracket operator. In general, a +priority group starts either at the `::=` or `>` operator and extends until +either the next `>` operator, or the end of the production block. Thus, we can +see that the second, third, fourth, and fifth priority groups in this grammar +all consist of a single production. The meaning of these priority groups becomes apparent when parsing programs: A symbol with a **lesser priority**, (i.e., one that **binds looser**), cannot appear as the **direct child** of a symbol with a **greater priority** (i.e., -one that **binds tighter**. In this case, the `>` operator can be seen as a +one that **binds tighter**). In this case, the `>` operator can be seen as a **greater-than** operator describing a transitive partial ordering on the productions in the production block, expressing their relative priority. To see this more concretely, let's look again at the program -`true && false || false`. As noted before, previously this program was -ambiguous because the parser could either choose that `&&` was the child of `||` -or vice versa. However, because a symbol with lesser priority (i.e., `||`) +`true && false || false`. As noted before, this program was ambiguous because +the parser could either choose `&&` to be the child of `||` or vice versa. +(Recall figures 3-A and 3-B in previous lesson.) +However, because a symbol with lesser priority (i.e., `||`) cannot appear as the direct child of a symbol with greater priority (i.e., `&&`), the parser will **reject** the parse where `||` is under the -`&&` operator. As a result, we are left with the unambiguous parse +`&&` operator (Fig. 3-B). As a result, we are left with the unambiguous parse `(true && false) || false`. Similarly, `true || false && false` parses unambiguously as `true || (false && false)`. Conversely, if the user explicitly wants the other parse, they can express this using brackets by explicitly writing `true && (false || false)`. This still parses successfully because the -`||` operator is no longer the **direct** child of the `&&` operator, but is -instead the direct child of the `()` operator, and the `&&` operator is an -**indirect** parent, which is not subject to the priority restriction. - -Astute readers, however, will already have noticed what seems to be a -contradiction: we have defined `()` as also having greater priority than `||`. -One would think that this should mean that `||` cannot appear as a direct -child of `()`. This is a problem because priority groups are applied to every -possible parse separately. That is to say, even if the term is unambiguous -prior to this disambiguation rule, we still reject that parse if it violates -the rule of priority. - -In fact, however, we do not reject this program as a parse error. Why is that? -Well, the rule for priority is slightly more complex than previously described. -In actual fact, it applies only conditionally. Specifically, it applies in +`||` operator is no longer the **direct** child of the `&&` operator, but under +the hood, is instead the direct child of the `()` operator, and the `&&` +operator is an **indirect** parent, which is not subject to the priority +restriction. + +You must have noticed that `()` has been defined as having greater priority +than `||` and we might have possibly reached a contradiction. One would think +that `||` cannot appear as a direct child of `()`. This is a problem because +priority groups are applied to every possible parse separately. That is to say, +even if the term is unambiguous, prior to this disambiguation rule, we still +reject that parse if it violates the rule of priority. + +However, we do not reject this program as a parse error because the rule for +priority is slightly more complex than previously described. +In fact, it applies only conditionally. Specifically, it applies in cases where the child is either the first or last production item in the parent's production. For example, in the production `Bool "&&" Bool`, the first `Bool` non-terminal is not preceded by any terminals, and the last `Bool` -non-terminal is not followed by any terminals. As a result of this, we apply -the priority rule to both children of `&&`. However, in the `()` operator, -the sole non-terminal is both preceded by and followed by terminals. As a -result, the priority rule is not applied when `()` is the parent. Because of -this, the program we mentioned above successfully parses. +non-terminal is not followed by any terminals. As a result, we apply the +priority rule to both children of `&&`. However, in the `()` operator, the +sole non-terminal is both preceded by and followed by terminals. Hence, the +priority rule is not applied when `()` is the parent. Because of this, the +program we mentioned above successfully parses. ### Exercise -Parse the program `true && false || false` using kast, and confirm that the AST -places `||` as the top level symbol. Then modify the definition so that you +Parse the program `true && false || false` using `kast`, and confirm that the +AST places `||` as the top-level symbol. Then modify the definition so that you will get the alternative parse. ## Associativity -Even having broken the expression grammar into priority blocks, the resulting -grammar is still ambiguous. We can see this if we try to parse the following -program (`assoc.bool`): +Sometimes, even after breaking the expression grammar into priority blocks we +still get an ambiguous grammar. Let's try to parse the following program +(`assoc.bool`): ``` true && false && false ``` Priority blocks will not help us here: the problem comes between two parses -where both possible parses have a direct parent and child which is within a -single priority block (in this case, `&&` is in the same block as itself). +where both possible ASTs (Figs. 4-A and 4-B) have a direct parent and child +which is within a single priority block (in this case, `&&` is in the same +block as itself): + +Fig. 4-A +``` + && + / \ + && false + / \ +true false +``` + +Fig. 4-B +``` + && + / \ +true && + / \ + false false +``` This is where the notion of associativity comes into play. Associativity applies the following additional rules to parses: @@ -129,7 +152,8 @@ child of a symbol with equal priority. In C, binary operators are all left-associative, meaning that the expression `true && false && false` parses unambiguously as `(true && false) && false`, -because `&&` cannot appear as the rightmost child of itself. +because `&&` cannot appear as the rightmost child of itself, i.e., only the AST +in Fig. 4-A is valid. Consider, then, the fourth iteration on the grammar of this definition (`lesson-04-b.k`): @@ -148,13 +172,12 @@ endmodule ``` Here each priority group, immediately after the `::=` or `>` operator, can -be followed by a symbol representing the associativity of that priority group: -either `left:` for left associativity, `right:` for right associativity, or +be followed by a literal representing the associativity of that priority group: +either `left:` for left-associativity, `right:` for right-associativity, or `non-assoc:` for non-associativity. In this example, each priority group we apply associativity to has only a single production, but we could equally well write a priority block with multiple productions and an associativity. - -For example, consider the following, different grammar (`lesson-04-c.k`): +For example, consider the grammar below (`lesson-04-c.k`): ```k module LESSON-04-C @@ -171,9 +194,9 @@ endmodule ``` In this example, unlike the one above, `&&`, `^`, and `||` have the same -priority. However, viewed as a group, the entire group is left associative. -This means that none of `&&`, `^`, and `||` can appear as the right child of -any of `&&`, `^`, or `||`. As a result of this, this grammar is also not +priority, as are part of the same block. Addionally, the entire group is +left-associative. This means that none of `&&`, `^`, and `||` can appear as the +right child of any of `&&`, `^`, or `||`. Hence, this grammar is also not ambiguous. However, it expresses a different grammar, and you are encouraged to think about what the differences are in practice. @@ -185,17 +208,14 @@ definition to generate the alternative parse. ## Explicit priority and associativity declarations -Previously we have only considered the case where all of the productions -which you wish to express a priority or associativity relation over are -co-located in the same block of productions. However, in practice this is not -always feasible or desirable, especially as a definition grows in size across -multiple modules. - -As a result of this, K provides a second way of declaring priority and -associativity relations. +Previously we have only considered the case where all productions expressing a +priority or associativity relation are co-located in the same block of +productions. However, in practice this is not always feasible or desirable, +especially as a definition grows in size across multiple modules. As a result, +K provides a second way of declaring priority and associativity relations. -Consider the following grammar, which we will name `lesson-04-d.k` and which -will express the exact same grammar as `lesson-04-b.k` +Consider the following grammar, which we name `lesson-04-d.k` and which +expresses the exact same grammar as `lesson-04-b.k`: ```k module LESSON-04-D @@ -215,33 +235,41 @@ endmodule ``` This introduces a couple of new features of K. First, the `group(_)` attribute -is used to conceptually group together sets of sentences under a common -user-defined name. For example, `literal` in the `syntax priority` sentence is -used to refer to all the productions marked with the `group(literal)` attribute, -i.e., `true` and `false`. A production can belong to multiple groups using -syntax such as `group(myGrp1,myGrp2)`. +of a production is used to conceptually group together sets of sentences under +a common user-defined name. For example, `literal` in the `syntax priority` +sentence is used to refer to all productions marked with the `group(literal)` +attribute, i.e., `true` and `false`, `atom` to all productions marked with +`group(atom)`, i.e., braket production, and so on and so fort. A production can +belong to multiple groups using syntax such as `group(myGrp1,myGrp2)`. Once we understand this, it becomes relatively straightforward to understand -the meaning of this grammar. Each `syntax priority` sentence defines a -priority relation where `>` separates different priority groups. Each priority -group is defined by a list of one or more group names, and consists of all -productions which are members of at least one of those named groups. - -In the same way, a `syntax left`, `syntax right`, or `syntax non-assoc` sentence -defines an associativity relation among left-, right-, or non-associative -groups. Specifically, this means that: +the meaning of this grammar. Each `syntax priority` sentence defines a priority +relation where `>` separates different priority groups. Each priority group is +defined by a list of one or more group names, and consists of all productions +which are members of at least one of those named groups. `literal` and `atom` +are only separated by space as they have the same precedence. + +In the same way, sentences `syntax left`, `syntax right`, or `syntax non-assoc` +define an associativity relation among left-, right-, or non-associative +groups, respectively. Specifically, this means that: + ``` syntax left a b ``` + is _different_ to: + ``` syntax left a syntax left b ``` -As a consequence of this, `syntax [left|right|non-assoc]` should not be used to + +TODO: Explain how they are different. + +As a consequence, `syntax [left|right|non-assoc]` should not be used to group together labels with different priority. -## Prefer/avoid +## Prefer/avoid productions Sometimes priority and associativity prove insufficient to disambiguate a grammar. In particular, sometimes it is desirable to be able to choose between @@ -249,7 +277,7 @@ two ambiguous parses directly while still not rejecting any parses if the term parsed is unambiguous. A good example of this is the famous "dangling else" problem in imperative C-like languages. -Consider the following definition (`lesson-04-E.k`): +Consider the following definition (`lesson-04-e.k`): ```k module LESSON-04-E @@ -258,6 +286,7 @@ module LESSON-04-E syntax Stmt ::= "if" "(" Exp ")" Stmt | "if" "(" Exp ")" Stmt "else" Stmt | "{" "}" + endmodule ``` @@ -269,24 +298,47 @@ if (true) if (false) {} else {} This is ambiguous because it is unclear whether the `else` clause is part of the outer `if` or the inner `if`. At first we might try to resolve this with -priorities, saying that the `if` without an `else` cannot appear as a child of -the `if` with an `else`. However, because the non-terminal in the parent symbol -is both preceded and followed by a terminal, this will not work. +priorities, specifying that the `if` without an `else` cannot appear as a child +of the `if` with an `else`. However, because the non-terminal in the parent +symbol is both preceded and followed by a terminal, this will not work. Instead, we can resolve the ambiguity directly by telling the parser to "prefer" or "avoid" certain productions when ambiguities arise. For example, when we parse this program, we see the following ambiguity as an error message: +``` +kompile lesson-04-e.k +kast --output kore dangling-else.if +``` + +gives the following output, minus the formatting: + ``` [Error] Inner Parser: Parsing ambiguity. 1: syntax Stmt ::= "if" "(" Exp ")" Stmt + `if(_)__LESSON-04-E_Stmt_Exp_Stmt`( + `true_LESSON-04-E_Exp`(.KList), + `if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`( + `false_LESSON-04-E_Exp`(.KList), + `{}_LESSON-04-E_Stmt`(.KList), + `{}_LESSON-04-E_Stmt`(.KList) + ) + ) -`if(_)__LESSON-04-E_Stmt_Exp_Stmt`(`true_LESSON-04-E_Exp`(.KList),`if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`(`false_LESSON-04-E_Exp`(.KList),`;_LESSON-04-E_Stmt`(.KList),`;_LESSON-04-E_Stmt`(.KList))) 2: syntax Stmt ::= "if" "(" Exp ")" Stmt "else" Stmt - -`if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`(`true_LESSON-04-E_Exp`(.KList),`if(_)__LESSON-04-E_Stmt_Exp_Stmt`(`false_LESSON-04-E_Exp`(.KList),`;_LESSON-04-E_Stmt`(.KList)),`;_LESSON-04-E_Stmt`(.KList)) - Source(./dangling-else.if) - Location(1,1,1,30) + `if(_)_else__LESSON-04-E_Stmt_Exp_Stmt_Stmt`( + `true_LESSON-04-E_Exp`(.KList), + `if(_)__LESSON-04-E_Stmt_Exp_Stmt`( + `false_LESSON-04-E_Exp`(.KList), + `{}_LESSON-04-E_Stmt`(.KList) + ), + `{}_LESSON-04-E_Stmt`(.KList) + ) + + Source(./dangling-else.if) + Location(1,1,1,32) + 1 | if (true) if (false) {} else {} + . ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Roughly, we see that the ambiguity is between an `if` with an `else` or an `if` @@ -304,18 +356,42 @@ module LESSON-04-F endmodule ``` -Here we have added the `avoid` attribute to the `else` production. As a result, -when an ambiguity occurs and one or more of the possible parses has that symbol -at the top of the ambiguous part of the parse, we remove those parses from -consideration and consider only those remaining. The `prefer` attribute behaves -similarly, but instead removes all parses which do not have that attribute. -In both cases, no action is taken if the parse is not ambiguous. +Here we have added the `avoid` attribute to the `else` production. As a result, +when an ambiguity occurs and any of the possible parses has that symbol at the +top of the ambiguous part of the parse, we discard those parses, and consider +only the remaining ones. The `prefer` attribute behaves similarly, but instead +removes all parses which do not have that attribute. In both cases, no action +is taken if the parse is not ambiguous. + +If we parse the program in `dangling-else.if` with this grammar, + +``` +kompile lesson-04-f.k +kast --output kast dangling-else.if +``` + +we get the following output, minus the formatting: + +``` +`if(_)__LESSON-04-F_Stmt_Exp_Stmt`( + `true_LESSON-04-F_Exp`(.KList), + `if(_)_else__LESSON-04-F_Stmt_Exp_Stmt_Stmt`( + `false_LESSON-04-F_Exp`(.KList), + `{}_LESSON-04-F_Stmt`(.KList), + `{}_LESSON-04-F_Stmt`(.KList) + ) +) +``` + +As we expected, the AST where the `else` corresponds to the first `if` is +discarded. + ## Exercises 1. Parse the program `if (true) if (false) {} else {}` using `lesson-04-f.k` -and confirm that else clause is part of the innermost `if` statement. Then -modify the definition so that you will get the alternative parse. +and confirm that the `else` clause is part of the innermost `if` statement. +Then modify the definition so that you will get the alternative parse. 2. Modify your solution from Lesson 1.3, Exercise 2 so that unary negation should bind tighter than multiplication and division, which should bind tighter than @@ -327,13 +403,15 @@ try to write them inline. resolved via priority or associativity, and then use the `prefer` attribute to resolve that ambiguity. -4. Explain why the following grammar is not labeled ambiguous by the K parser when parsing `abb`, then make the parser realize the ambiguity. +4. Explain why the following grammar is not labeled ambiguous by the K parser +when parsing `abb`, then make the parser realize the ambiguity. + ```k module EXERCISE4 -syntax Expr ::= "a" Expr "b" - | "abb" - | "b" + syntax Expr ::= "a" Expr "b" + | "abb" + | "b" endmodule ```