Skip to content

Support Gap Filling on Time Series Data #4809

Open
@wolffcm

Description

@wolffcm

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A common use case when working with time series data is to compute an aggregate value for windows of time, e.g., every minute, hour, week or whatever. It is possible to do this with the DATE_BIN function in DataFusion. However, DATE_BIN will not produce any value for a window that did not contain any rows.

For example, for this input date:

time c0
2022-12-01 10
2022-12-03 30

We might run this query;

select
  date_bin(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
  avg(c0) 
from t
group by day;

And we would get something like:

day avg
2022-12-01 10
2022-12-03 30

Generating a row in the output for 2022-12-02 is difficult to do with ANSI-SQL. Here is one attempt: Fill Gaps in Time Series with this simple trick in SQL. Having to write SQL like this for what is an intuitive and common use case is frustrating.

Describe the solution you'd like

It would be good to have a concise, idiomatic way to do this. Many vendors provide a solution for this problem. The have the following in common:

  • They provide a way to break up an interval of time into contiguous windows
  • They provide some kind of way to produce a value where there were no input rows

One such solution would be to use a function like TimeScale's functions time_bucket_gapfill and locf (last observation carried forward):
https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/

The above query might be changed to this, using time_bucket_gapfill and locf:

select
  time_bucket_gapfill(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
  avg(c0),
  locf(avg(c0))
from t
group by day;
day avg locf
2022-12-01 10 10
2022-12-02 10
2022-12-03 30 30

TimeScale also provides interpolate to populate a gap with an interpolated value (e.g., would put 20 in the gap for the example).

I've written up an approach to this work here:
https://docs.google.com/document/d/1vIcs9uhlCX_AkD9bemcDx-YhBOVe_TW5sBbXtKCHIfk/edit?usp=sharing

Initially we (InfluxData) were going to implement this in IOx directly, but seems like it could be worthy of upstreaming into DataFusion.

Describe alternatives you've considered

Postgres provides a general purpose way to generate data:
https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
But this seems like it would be more difficult to use than something like time_bucket_gapfill.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions