Skip to content

[PATHFINDING] Parse json as variant #7403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Apr 10, 2025

This is a pathfinding exercise, to see how easy/hard it might be to parse JSON text into parquet's new variant type, using the tape decoder. Not intended to merge, it is more of a conversation starter.

In particular:

  • It would be better to leverage a general variant library for variant bit-wrangling instead of doing it all manually here.
  • TBD Where/how to expose this functionality through a public API
  • Still TBD how to assemble a bunch of variant metadata+value pairs inside an arrow array data that can eventually become a usable arrow array
  • For comparison, the same exercise is repeated using serde_json instead. This would almost certainly not belong in an actual contribution to arrow-json.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 10, 2025
@scovich
Copy link
Contributor Author

scovich commented Apr 10, 2025

Attn @alamb

@alamb
Copy link
Contributor

alamb commented Apr 11, 2025

See also the related PR for variant here:

@alamb
Copy link
Contributor

alamb commented Apr 11, 2025

Thank you for this PR @scovich

It would be better to leverage a general variant library for variant bit-wrangling instead of doing it all manually here.

TBD Where/how to expose this functionality through a public API

In my mind this functionality feels like a "computation kernel" (aka similarly to the functions in https://docs.rs/arrow/latest/arrow/compute/index.html)

The signature seems like it would roughly be something like:

/// Covert text stored as JSON in an input `StringArray`, `LargeStringArray` or `StringViewArray` into
/// a single "Variant" array (`StructArray` with an extension type)
fn json_to_variant(input: &ArrayRef) -> ArrayRef {
 ...
}

Since the arrow-json crate is currently for converting JSON to arrow it is not 100% clear to me that this functionality belongs in the arrow-json crate at all, espcially as variant is not part of the "core" arrow spec it seems.

Still TBD how to assemble a bunch of variant metadata+value pairs inside an arrow array data that can eventually become a usable arrow array

I think we will sort this out as part of implementing varint in #6736. TLDR is via a StructArray annotated with an extension type I think.

@scovich
Copy link
Contributor Author

scovich commented Apr 16, 2025

TBD Where/how to expose this functionality through a public API

In my mind this functionality feels like a "computation kernel" (aka similarly to the functions in https://docs.rs/arrow/latest/arrow/compute/index.html)

Since the arrow-json crate is currently for converting JSON to arrow it is not 100% clear to me that this functionality belongs in the arrow-json crate at all, espcially as variant is not part of the "core" arrow spec it seems.

I agree something like arrow-compute makes a lot of sense. Unfortunately, the tape decoder machinery is private to arrow-json crate, so I had to do the initial pathfinding here. Is there a better way forward?

@alamb
Copy link
Contributor

alamb commented Apr 16, 2025

I agree something like arrow-compute makes a lot of sense. Unfortunately, the tape decoder machinery is private to arrow-json crate, so I had to do the initial pathfinding here. Is there a better way forward?

SOme other options might be (not sure which one we should go with):

  1. copy/paste the code to avoid a dependency
  2. refactor the tape machinery into a new crate that they can both depend on

I have been thinking a lot about how we should introduce variant. What do you think about a structure like this (crates)

  • variant: Core definition of the open variant type, no dependencies
  • arrow-variant: Arrow extension type for variant, including conversion to/from JSON and arrow arrays (e.g. a compute kernel, etc)

I think depending on how arrow-variant is implemented, maybe it depends directly on arrow-json and maybe expose relevant parts

@alamb
Copy link
Contributor

alamb commented Apr 18, 2025

I filed #7423 to track this item

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants