Support Gap Filling on Time Series Data

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

A common use case when working with time series data is to compute an aggregate value for windows of time, e.g., every minute, hour, week or whatever. It is possible to do this with the `DATE_BIN` function in DataFusion. However, `DATE_BIN` will not produce any value for a window that did not contain any rows.

For example, for this input date:

| time | c0 |
| --- | --- |
| 2022-12-01 | 10 |
| 2022-12-03 | 30 |

We might run this query;
```sql
select
  date_bin(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
  avg(c0) 
from t
group by day;
```
And we would get something like:

| day | avg |
| --- | --- |
| 2022-12-01 | 10 |
| 2022-12-03 | 30 |

Generating a row in the output for `2022-12-02` is difficult to do with ANSI-SQL.  Here is one attempt: [Fill Gaps in Time Series with this simple trick in SQL](https://medium.com/codex/fill-gaps-in-time-series-with-this-simple-trick-in-sql-81ac655e5ad7). Having to write SQL like this for what is an intuitive and common use case is frustrating.

**Describe the solution you'd like**

It would be good to have a concise, idiomatic way to do this. Many vendors provide a solution for this problem. The have the following in common:
- They provide a way to break up an interval of time into contiguous windows
- They provide some kind of way to produce a value where there were no input rows

One such solution would be to use a function like TimeScale's functions `time_bucket_gapfill` and `locf` (last observation carried forward):
https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/

The above query might be changed to this, using `time_bucket_gapfill` and `locf`:
```sql
select
  time_bucket_gapfill(interval '1 day', time, timestamp '1970-01-01T00:00:00Z') as day,
  avg(c0),
  locf(avg(c0))
from t
group by day;
```

| day | avg | locf |
| --- | --- | --- |
| 2022-12-01 | 10 | 10 |
| 2022-12-02 |  | 10 |   
| 2022-12-03 | 30 | 30 |

TimeScale also provides `interpolate` to populate a gap with an interpolated value (e.g., would put `20` in the gap for the example).

I've written up an approach to this work here:
https://docs.google.com/document/d/1vIcs9uhlCX_AkD9bemcDx-YhBOVe_TW5sBbXtKCHIfk/edit?usp=sharing

Initially we (InfluxData) were going to implement this in IOx directly, but seems like it could be worthy of upstreaming into DataFusion.

**Describe alternatives you've considered**

Postgres provides a general purpose way to generate data:
https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
But this seems like it would be more difficult to use than something like `time_bucket_gapfill`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Gap Filling on Time Series Data #4809

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Gap Filling on Time Series Data #4809

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions