Skip to content

Add support for Arrow string view type #1252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2025
Merged

Add support for Arrow string view type #1252

merged 1 commit into from
Mar 12, 2025

Conversation

ivant
Copy link
Contributor

@ivant ivant commented Mar 3, 2025

What this PR does / why we need it:

This PR adds support for Arrow's StringView data type, which is needed for correct handling of string data received from datasources using the newer Arrow libraries and which often send string-typed fields as StringView.

Which issue(s) this PR fixes: N/A

Special notes for your reviewer:

This change is needed for implementing a data source plugin for modern arrow-based data sources if one tries to rely on Grafana Plugin SDK's own Arrow-to-Frame translation. Without this the implementors of data sources would need to reimplement the Arrow-to-Frame translation on their own in its entirety.

There might be other new Arrow data types that are not being handled in the Grafana Plugin SDK. This PR is intentionally narrowly focused on StringView (since this is one of the most common types), but once this is merged it should be easy to add more types and tests for them.

@ivant ivant requested a review from a team as a code owner March 3, 2025 22:51
@CLAassistant
Copy link

CLAassistant commented Mar 3, 2025

CLA assistant check
All committers have signed the CLA.

@wbrowne
Copy link
Contributor

wbrowne commented Mar 5, 2025

Hey @ivant 👋 Thanks for your contribution.

This change is needed for implementing a data source plugin for modern arrow-based data sources if one tries to rely on Grafana Plugin SDK's own Arrow-to-Frame translation.

We're always curious how people are using our SDKs, if there are areas to improve, or dedicate more focus to. Is this aimed at SDK built plugins that target Arrow-based backends, or is this for a separate use case?

@ivant
Copy link
Contributor Author

ivant commented Mar 5, 2025

Hi @wbrowne! Happy to give more context!

The problem I am currently trying to solve is the lack of a well supported Arrow Flight plugin that can handle connections to datasources that use up-to-date versions of Arrow Flight libraries (see, for example, grafana/grafana#99936). Some recent changes to Arrow Flight made the old plugin all but incompatible (particularly due to introduction of StringView for representing strings, see also this excellent explanation).

One example where this can come up is using something like Roapi as a datasource. Old versions of Roapi were working OK (not great, but workable) with the old plugin, but an attempt to upgrade Roapi past the point where something like StringView was introduced would suddenly break all the queries that returned string-typed columns. Given that Arrow is being very actively developed, a plugin implementing support for it needs to track the newer versions of Arrow libraries, not stay frozen in time (as a "Public archive").

There are other Arrow data types that are currently missing:

  • BinaryView (same approach as StringView, but represented as json.RawMessage)
  • Timestamp support currently assumes UTC/nanoseconds, which is not true and breaks queries that return timestamps in other formats.
  • Date32/Date64 (can be converted to Timestamp?)
  • Duration (would need support on the Grafana library/UI side)
  • Interval (distinct from Duration in that it is calendar aware)
  • Float16 (probably can be converted to Float32 for simplicity)
  • Time32/Time64 (these should probably be converted to something like a time interval, but vector doesn't support it)
  • Decimal32/Decimal64/Decimal128/Decimal256 (useful for representing currency amounts, would need support on the Grafana library/UI side)
  • Map and Struct (can be represented as json.RawMessage, though that might be suboptimal)
  • Dictionary
  • ...

Obviously, not all of these need to be implemented at once, and it might be OK to not support some of the types that do not make sense to represent in the Grafana UI (leaving it up to the user queries to perform the conversions to something that is representable in the UI).

From my point of view, I'd probably need to tackle the Timestamp issues soonish, because they do cause very non-obvious problems in Grafana UI when querying data represented by timestamps in different time-zones or in different units (example: data returned by the datasource is in one timezone, but another timezone is assumed and presented to the user, the user thinks that the underlying data is stored in a wrong timezone and changes the query, which causes the query calculations to be wrong). I have a couple of easily reproducible issues that I need to convert to test cases, will follow up with an issue/PR when I do.

@wbrowne
Copy link
Contributor

wbrowne commented Mar 6, 2025

@ivant Thanks so much for the extra context - that's really helpful. Was the (now deprecated) Influx FlightSQL datasource plugin able to solve your use case apart from the recent Arrow updates? Is your goal now to build a new Arrow Flight plugin in that case?

Apologies as I lack a lot of context in the area of Influx specifically, but considering the note on the repo, do you think the Influx plugin be the most appropriate place to have Arrow API support?

@ivant
Copy link
Contributor Author

ivant commented Mar 7, 2025

Was the (now deprecated) Influx FlightSQL datasource plugin able to solve your use case apart from the recent Arrow updates?

It was able to solve my (company's) use case, but just barely, at a moment in time. Our needs grow and there are a bunch of things that we would really like to have, both in terms of bugfixes and functionality.

Is your goal now to build a new Arrow Flight plugin in that case?

That is my current endeavor and I already have a prototype of a plugin that already works with the new Arrow datasources, so we can finally start testing out the new functionality of these datasources that was blocked because of version support. I am starting to test this new plugin with existing queries and dashboards to see what is the gap until we can fully migrate, and it looks surprisingly small.

Apologies as I lack a lot of context in the area of Influx specifically, but considering the note on the repo, do you think the Influx plugin be the most appropriate place to have Arrow API support?

InfluxDB is a specific product and at this point I have a very limited amount of understanding of what it is in terms of what they plan to support, though it is very likely that they will stick with Arrow for a while (due to their developer being deeply involved in Arrow creation).

The code under https://github.com/grafana/grafana/tree/main/pkg/tsdb/influxdb/fsql seems to be implementing Arrow/FlightSQL datasource as part of 3 different types of InfluxDB backends they support. From my very brief look at this code:

  • It is somewhat InfluxDB specific, which is confusing if you are setting it up with a non-InfluxDB data source.
  • (Used to be) very pedantic about using particular types of authentication it required.
    • I just checked and it looks like this is no longer true, for a while there was no way to use a TLS backend connection without setting up password auth.
  • Re-implements part of the grafana-plugin-sdk-go library related to Arrow to Frame conversion, unclear why.
  • Does not implement StringView type, despite using a very recent Arrow library that supports it. I presume, InfluxDB does not send string data using StringView representation yet, so this is not a practical problem for them right now.
  • Does not handle timestamp units/timezones correctly.
  • Quietly fails on some queries, returning empty results instead of an error (Query Inspector shows a 200 result with an empty frame). I suspect this is related to incomplete support of Arrow data types and improper handling of them, but I'd need to investigate more to figure out.
    • This one is actually a really bad behavior, since a query not finding some data (e.g. in case one creates an alert) and query failing because the data type sent by the underlying database after an upgrade is unsupported is a difference between a missed page and quickly detected infrastructure issue.

My opinion is that Arrow/FlightSQL datasource support should not be bundled into InfluxDB specific plugin, because it serves to establish compatibility with InfluxDB specifically, not a more general variety of data sources that use Arrow/FlightSQL.

The unasked question here is probably whether my intention is to open source the plugin. The answer to that is "probably yes", though I might need to figure out whether I'd need to extract my company-specific features into a separate plugin to avoid adding "bloat" for external users who do not care about these features, and what is the best way to do that.

@wbrowne
Copy link
Contributor

wbrowne commented Mar 10, 2025

Thanks again @ivant for the extra context. I'm just working to get the right folk's eyes on this and will revert ASAP!

@wbrowne
Copy link
Contributor

wbrowne commented Mar 10, 2025

The unasked question here is probably whether my intention is to open source the plugin. The answer to that is "probably yes", though I might need to figure out whether I'd need to extract my company-specific features into a separate plugin to avoid adding "bloat" for external users who do not care about these features, and what is the best way to do that.

Awesome 👍 Feel free to consult the Plugins policy in that case. We would be happy to take a look if/when you decide the plugin belongs part of the Grafana Catalog 😄

From here I think this is good to pass onto the codeowners @grafana/grafana-datasources-core-services.

Thanks again for the information and your patience, @ivant!

@wbrowne wbrowne added enhancement New feature or request arrow https://godoc.org/github.com/apache/arrow/go/arrow area/dataplane Dataplane Project (Data type contract) labels Mar 10, 2025
Copy link
Contributor

@gabor gabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivant thanks for the contribution!

PR looks fine, going to merge it.

@gabor gabor merged commit fd1af1b into grafana:main Mar 12, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dataplane Dataplane Project (Data type contract) arrow https://godoc.org/github.com/apache/arrow/go/arrow enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants