Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):
Can remove these:
Missing APIs:
Optional but useful:
Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):
DataFrame.cache() -> DataFrame===>DataFrame.collect() -> DataFrameDataFrame.collect() -> list[pyarrow.RecordBatch]===>DataFrame.to_batches() -> list[pyarrow.RecordBatch]DataFrame.join===>DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | NoneDataFrame.schema -> pyarrow.Schema===>DataFrame.schema -> datafusion.SchemaMap Rust arrow types to dafusion-py typesDataFrame.with_column===>DataFrame.with_columnsAllow multiple inputs as exprs or key value pairsDataFrame.with_column_renamed===>DataFrame.rename()a simple rename is clear enough and should allow a dict as inputDataFrame.aggregate===>DataFrame.group_by().agg()this feels more natural coming from PySpark/Polars/PandasCan remove these:
DataFrame.select_columnsalready covered byDataFrame.selectMissing APIs:
DataFrame.castto cast on top level a single or multiple columnsDataFrame.dropto drop columns, instead of writing a very verbose selectDataFrame.fill_null/fill_nanto fill null or nan valuesDataFrame.interpolateinterpolate values per colDataFrame.head/tailDataFrame.pivotDataFrame.unpivotOptional but useful:
DataFrame.with_row_idx