RFC: Re-work some DataFrame APIs

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

- [ ] - `DataFrame.cache() -> DataFrame`  ===>  `DataFrame.collect() -> DataFrame` 
- [ ] - `DataFrame.collect() -> list[pyarrow.RecordBatch]`  ===>  `DataFrame.to_batches() -> list[pyarrow.RecordBatch]`
- [x] - `DataFrame.join`  ===> `DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on:  str | sequence[str] | None, right_on:  str | sequence[str] | None`
- [ ] - `DataFrame.schema -> pyarrow.Schema` ===> `DataFrame.schema -> datafusion.Schema` **Map Rust arrow  types to dafusion-py types**
- [x] - `DataFrame.with_column` ===> `DataFrame.with_columns` **Allow multiple inputs as exprs or key value pairs**
- [x] - `DataFrame.with_column_renamed` ===> `DataFrame.rename()` **a simple rename is clear enough and should allow a dict as input**
- [ ] - `DataFrame.aggregate` ===> `DataFrame.group_by().agg()` this feels more natural coming from PySpark/Polars/Pandas

Can remove these:
- [x] -` DataFrame.select_columns` already covered by `DataFrame.select`

Missing APIs:
- [x] - `DataFrame.cast` to cast on top level a single or multiple columns
- [x] - `DataFrame.drop` to drop columns, instead of writing a very verbose select
- [x] - `DataFrame.fill_null`/`fill_nan` to fill null or nan values
- [ ] - `DataFrame.interpolate` interpolate values per col
- [ ] - Asof join missing in df api?
- [x] - Join on (inequality join)
- [x] - `DataFrame.head/tail`
- [ ] - `DataFrame.pivot`
- [ ] - `DataFrame.unpivot`


Optional but useful:
- [ ] - `DataFrame.with_row_idx`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Re-work some DataFrame APIs #875

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: Re-work some DataFrame APIs #875

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions