Skip to content

Can't convert date_bin aggregated with count(*) to arrow if some windows contain null data #862

@rickspencer3

Description

@rickspencer3

Describe the bug
When using count(*) to aggregate data with date_bin where some of the windows have no data, the datafusion.dataframe object is created fine, but to_arrow_table() raises the below exception.

To Reproduce
The following code reproduces the error, as I see it.

from pyarrow import flight
from datafusion import SessionContext

ctx = SessionContext()

test_data = [
    {"id": 1, "created_at": "2024-09-07 10:01:05", "content": "First entry in first minute"},
    {"id": 2, "created_at": "2024-09-07 10:01:45", "content": "Second entry in first minute"},
    {"id": 3, "created_at": "2024-09-07 10:03:10", "content": "First entry in third minute"},
    {"id": 4, "created_at": "2024-09-07 10:03:55", "content": "Second entry in third minute"},
]

ctx.from_pylist(test_data, "count_me")

sql = """ SELECT
DATE_BIN(INTERVAL '1 minute', created_at) AS time_window,
        COUNT(*) AS count
    FROM
        count_me

GROUP BY
    time_window
ORDER BY
    time_window;"""

df = ctx.sql(sql)
df.to_arrow_table()

This results in:

pyarrow.lib.ArrowInvalid: Schema at index 0 was different: 
time_window: timestamp[ns]
count: int64 not null
vs
time_window: timestamp[ns]
count: int64

Expected behavior
I would expect there to be 3 rows of data, with the count column being [2,0,2]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions