Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A common use case when working with time series data is to compute an aggregate value for windows of time, e.g., every minute, hour, week or whatever. It is possible to do this with the DATE_BIN
function in DataFusion. However, DATE_BIN
will not produce any value for a window that did not contain any rows.
For example, for this input date:
time | c0 |
---|---|
2022-12-01 | 10 |
2022-12-03 | 30 |
We might run this query;
select
date_bin(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
avg(c0)
from t
group by day;
And we would get something like:
day | avg |
---|---|
2022-12-01 | 10 |
2022-12-03 | 30 |
Generating a row in the output for 2022-12-02
is difficult to do with ANSI-SQL. Here is one attempt: Fill Gaps in Time Series with this simple trick in SQL. Having to write SQL like this for what is an intuitive and common use case is frustrating.
Describe the solution you'd like
It would be good to have a concise, idiomatic way to do this. Many vendors provide a solution for this problem. The have the following in common:
- They provide a way to break up an interval of time into contiguous windows
- They provide some kind of way to produce a value where there were no input rows
One such solution would be to use a function like TimeScale's functions time_bucket_gapfill
and locf
(last observation carried forward):
https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/
The above query might be changed to this, using time_bucket_gapfill
and locf
:
select
time_bucket_gapfill(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
avg(c0),
locf(avg(c0))
from t
group by day;
day | avg | locf |
---|---|---|
2022-12-01 | 10 | 10 |
2022-12-02 | 10 | |
2022-12-03 | 30 | 30 |
TimeScale also provides interpolate
to populate a gap with an interpolated value (e.g., would put 20
in the gap for the example).
I've written up an approach to this work here:
https://docs.google.com/document/d/1vIcs9uhlCX_AkD9bemcDx-YhBOVe_TW5sBbXtKCHIfk/edit?usp=sharing
Initially we (InfluxData) were going to implement this in IOx directly, but seems like it could be worthy of upstreaming into DataFusion.
Describe alternatives you've considered
Postgres provides a general purpose way to generate data:
https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
But this seems like it would be more difficult to use than something like time_bucket_gapfill
.