Skip to content

GH-3070: Add Variant logical type annotation to parquet-java #3072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 17, 2025

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Nov 22, 2024

Rationale for this change

This is to add Variant logical type in parquet-java to be used by dependent projects.

What changes are included in this PR?

The Variant logical type has been added to LogicalTypeAnnotation. For variant columns, the corresponding Parquet group is annotated as VARIANT(), indicating that the variant data may be encoded according to the specified or lower version. Readers can use this version information to validate compatibility and fail early if the version is not supported.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. Variant logical type is available.

Closes #3070

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Nov 22, 2024

@Fokko, @wgtmac Can you help check if this is right implementation? Thanks.

@wgtmac
Copy link
Member

wgtmac commented Nov 22, 2024

Do we need to support writing and reading variant data?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Nov 22, 2024

The Variant reading and writing are getting implemented in Iceberg and/or the engines themselves. I think later we can think of pulling the implementation to Parquet if needed.

@emkornfield
Copy link
Contributor

The Variant reading and writing are getting implemented in Iceberg and/or the engines themselves. I think later we can think of pulling the implementation to Parquet if needed.

I think this is problematic if the spec lives in parquet and doesn't have a complete implementation per previously agreed upon guidelines for new parquet features. This probably warrants a discussion on the mailing list. CC @julienledem @rdblue @RussellSpitzer

@Fokko
Copy link
Contributor

Fokko commented Nov 26, 2024

@aihuaxu I agree with @emkornfield that the iceberg-java implementation should be able to read and write the variant type.

It would also be great to drop some example parquet files in https://github.com/apache/parquet-testing, this will also help the adoption of other implementations, see apache/parquet-format#456 (comment)

@wgtmac
Copy link
Member

wgtmac commented Nov 26, 2024

Usually we need two reference implementations for spec changes like this. I'm not sure if there is any chance to have another implementation ready in a timely manner. IMO, at least parquet-java should support basic roundtrip read and write.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Nov 26, 2024

I see. Per guideline, we need to have the implementation in parquet-java and then another one. Do we usually include the implementation with this annotation change or should be separate?

Completeness: The goal of this phase is to ensure the feature is viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be parquet-java. It is preferred that the second implementation be parquet-cpp or parquet-rs,

@wgtmac
Copy link
Member

wgtmac commented Nov 27, 2024

I think it should be in one change. The parquet-format cannot be released without concrete PoC implementation in parquet-java. Without that release, separate changes may break CI and thus cannot be merged.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Feb 22, 2025

@wgtmac With https://github.com/apache/parquet-java/pull/3117/files implementing encoding/decoding, should we consider merging this separately?

@wgtmac
Copy link
Member

wgtmac commented Feb 23, 2025

I think at least it needs the conversion from/to thrift definition of the variant type. So we need to wait for the release of parquet-format 2.11.0.

@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch from 9473e1f to e7c97e6 Compare March 25, 2025 00:57
@aihuaxu aihuaxu requested a review from rdblue March 25, 2025 01:36
@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch from b67c034 to a683f6a Compare March 25, 2025 16:47
@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch 2 times, most recently from af4576a to ba4bbdf Compare March 26, 2025 18:01
@aihuaxu aihuaxu requested a review from rdblue March 26, 2025 18:07
@aihuaxu aihuaxu requested a review from rdblue April 2, 2025 00:00
@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch from 20d29b5 to 707a0a0 Compare April 10, 2025 16:20
@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch from 59bad9e to 3dcb486 Compare April 11, 2025 23:50
@aihuaxu aihuaxu force-pushed the aixu-add-variant-logical-type branch from c717a05 to eb679e8 Compare April 15, 2025 22:25
@rdblue rdblue merged commit 66e0c4e into apache:master Apr 17, 2025
7 checks passed
@rdblue
Copy link
Contributor

rdblue commented Apr 17, 2025

Thanks, @aihuaxu! And thanks to @emkornfield for taking a look as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Variant Logical Type
5 participants