Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51695][SQL] Introduce Parser Changes for Table Constraints (CHECK, UNIQUE, PK, FK) #50496

Closed
wants to merge 23 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Apr 2, 2025

What changes were proposed in this pull request?

What changes are proposed in this PR?

This PR introduces parser support for ANSI SQL-compatible table constraints, including:

  • CHECK
  • UNIQUE
  • PRIMARY KEY
  • FOREIGN KEY

The updated parser supports these constraints in the following statements:

  • CREATE TABLE
  • REPLACE TABLE
  • ALTER TABLE ... ADD CONSTRAINT

Key Features

  • Constraints can be named or unnamed.
  • Constraints can appear as:
    • Column constraints, at the end of a column definition.
    • Table constraints, declared among table elements (in any order).
    • ALTER TABLE … ADD CONSTRAINT statements.
  • ALTER TABLE … DROP CONSTRAINT

Table Constraint Characteristics

  • ENFORCED: The constraint is validated by the Spark engine during write operations. If data violates the constraint, Spark will raise an error.
  • NOT ENFORCED: Spark does not validate the constraint during data writes; it’s treated as metadata only.
  • RELY: A user-provided hint that the constraint is known to be valid. This allows Spark to apply query optimizations based on the assumption that the constraint holds.
  • NORELY: The default.

Spark does not rely on the constraint for optimizations unless:

  • It is explicitly marked as RELY, or
  • It is ENFORCED and has been validated by Spark.

✅ CHECK Constraints

-- Column-level, unnamed
CREATE TABLE t1 (
  age INT CHECK (age > 0)
);

-- Column-level, named
CREATE TABLE t2 (
  age INT CONSTRAINT ck_age CHECK (age > 0)
);

-- Table-level, unnamed
CREATE TABLE t3 (
  age INT,
  CHECK (age > 0)
);

-- Table-level, named
CREATE TABLE t4 (
  age INT,
  CONSTRAINT ck_age CHECK (age > 0)
);

🔑 PRIMARY KEY Constraints

-- Column-level, unnamed
CREATE TABLE t1 (
  id INT PRIMARY KEY
);

-- Column-level, named, RELY
CREATE TABLE t2 (
  id INT CONSTRAINT pk_id PRIMARY KEY RELY
);

-- Table-level, unnamed, NORELY
CREATE TABLE t3 (
  id INT,
  PRIMARY KEY (id) NORELY
);

-- Table-level, named
CREATE TABLE t4 (
  id INT,
  CONSTRAINT pk_id PRIMARY KEY (id)
);

🔐 UNIQUE Constraints

-- Column-level, unnamed
CREATE TABLE t1 (
  email STRING UNIQUE
);

-- Column-level, named
CREATE TABLE t2 (
  email STRING CONSTRAINT uq_email UNIQUE
);

-- Table-level, unnamed
CREATE TABLE t3 (
  email STRING,
  UNIQUE (email)
);

-- Table-level, named
CREATE TABLE t4 (
  email STRING,
  CONSTRAINT uq_email UNIQUE (email)
);

🔗 FOREIGN KEY Constraints

CREATE TABLE dept (id INT PRIMARY KEY);

-- Column-level, unnamed
CREATE TABLE emp1 (
  dept_id INT REFERENCES dept(id)
);

-- Column-level, named
CREATE TABLE emp2 (
  dept_id INT CONSTRAINT fk_dept_col REFERENCES dept(id)
);

-- Table-level, unnamed
CREATE TABLE emp3 (
  dept_id INT,
  FOREIGN KEY (dept_id) REFERENCES dept(id)
);

-- Table-level, named
CREATE TABLE emp4 (
  dept_id INT,
  CONSTRAINT fk_dept_tbl FOREIGN KEY (dept_id) REFERENCES dept(id)
);

⚙️ ALTER TABLE

-- Add named constraint
ALTER TABLE t ADD CONSTRAINT ck_positive CHECK (amount > 0);

-- Add unnamed constraint
ALTER TABLE t ADD UNIQUE (email);
ALTER TABLE t ADD PRIMARY KEY (id);
ALTER TABLE t ADD FOREIGN KEY (dept_id) REFERENCES dept(id);

-- Drop named constraint
ALTER TABLE t DROP CONSTRAINT ck_positive;

Why are the changes needed?

Allow users to define, modify, and enforce table constraints in connectors that support them. This will facilitate data accuracy, ensure consistency, and enable performance optimizations in Spark.

Does this PR introduce any user-facing change?

Yes, introduce Parser Changes for Table Constraints (CHECK, UNIQUE, Primary Key, Foreign Key)

How was this patch tested?

New tests

Was this patch authored or co-authored using generative AI tooling?

No

@gengliangwang
Copy link
Member Author

cc @aokolnychyi @srielau


case class UniqueConstraint(
columns: Seq[String],
override val name: String = null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider to let name as Option[String], so we can use None instead of null for the case without name?

Copy link
Member

@viirya viirya Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more consistent (e.g., ConstraintCharacteristic uses Option to represent no enforced and rely).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design. The implementation of withName and withCharacteristic will be simpler and consistent in this way.

@github-actions github-actions bot added the DOCS label Apr 2, 2025
s"${tableName}_chk_${base}_$rand"
}

override def sql: String = s"CONSTRAINT $name CHECK ($condition)"
Copy link
Member

@viirya viirya Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if name is null, what this sql is? CONSTRAINT null CHECK ...? Is it valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the constraint name is not provided, all the constraints will generate names. See the method generateConstraintNameIfNeeded for details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I saw there is generateConstraintNameIfNeeded but I am not sure when it will be used. So if no name is provided (i.e., name = null), Spark will generate a name for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the name is always not null

import org.apache.spark.sql.catalyst.plans.logical.DropConstraint
import org.apache.spark.sql.test.SharedSparkSession

class AlterTableDropConstraintParseSuite extends AnalysisTest with SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for DropConstraint. Is there another test suite for AddConstraint too? Seems I don't find it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the other test suites, such as CheckConstraintParseSuite/PrimaryKeyConstraintParseSuite, etc

| PRIMARY KEY
;

uniqueConstraint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It took me a while to find PK, it wasn't obvious it is part of uniqueSpec. I would consider splitting them for clarity but also see that you probably wanted to cut on the number rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way should work. This is from ANSI SQL sytanx BTW.

}
}

// Generate a constraint name based on the table name if the name is not specified
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the table is renamed later? It seems we can simplify this logic quite a bit without the need to include the table name. I understand it is done to distinguish constraints but I wonder if we can leverage the catalog, namespace, table identifier in that code rather than attempting to generate a unique enough name here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand we replicate Postgres behavior here, but we don't guarantee there are no duplicates. Let's discuss a bit more on how we can implement things like pg_constraint table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can leverage the catalog, namespace, table identifier in that code rather than attempting to generate a unique enough name here?

This is possible for PK & FK, but hard for check and unique constraints.

I am ok to include catalog and namespace. Probably in analyzer rule ResolveTableSpec. However, it is a bit tricky to get the table name from the current V2 CreateTable plan

case class CreateTable(
    name: LogicalPlan,
    columns: Seq[ColumnDefinition],
    partitioning: Seq[Transform],
    tableSpec: TableSpecBase,
    ignoreIfExists: Boolean)

* @param ctx Parser context for error reporting
* @return New TableConstraint instance
*/
def withCharacteristic(c: ConstraintCharacteristic, ctx: ParserRuleContext): TableConstraint
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is a good idea to depend on ParserRuleContext here. It feels like TableConstraint that is mixed into expressions also handles parsing aspects. Can we handle name generation and characteristics in the code that instantiates these constraints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for better error messages. If ParseException can accept the current origin without specifying the start/stop Origin, we can remove this parameter.
Since this is internal classes, I suggest we have a followup to improve it.

c: ConstraintCharacteristic,
ctx: ParserRuleContext): TableConstraint = {
if (c.enforced.contains(true)) {
throw new ParseException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we have to parse the characteristics and validate them prior to constructing this class.

Copy link
Member Author

@gengliangwang gengliangwang Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to parse characteristics and constraints separately
image

Otherwise, we need to write duplicate syntaxes for each constraint, and have the characteristics check in the AstBuilder(which has over 6k lines of code)

}

case class UniqueConstraint(
columns: Seq[String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: If I remember correctly, PK can't reference nested columns. What about UNIQUE? Asking to confirm whether this should be a sequence of name parts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think nested sub-columns(e.g. col_1.col_2 are supported in unique constraints. The syntax for unique is mostly the same as PK.
cc @srielau for confirmation

@gengliangwang
Copy link
Member Author

gengliangwang commented Apr 4, 2025

@aokolnychyi @viirya All the tests passed. Any further comments on this one?
I will have follow-ups to revisit:

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no more comments. See if others still have some comments.

@gengliangwang
Copy link
Member Author

@srielau @aokolnychyi @viirya Thanks for the review. I am merging this one to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants