Skip to content

RFC: Re-work some DataFrame APIs #875

@ion-elgreco

Description

@ion-elgreco

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

  • - DataFrame.cache() -> DataFrame ===> DataFrame.collect() -> DataFrame
  • - DataFrame.collect() -> list[pyarrow.RecordBatch] ===> DataFrame.to_batches() -> list[pyarrow.RecordBatch]
  • - DataFrame.join ===> DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
  • - DataFrame.schema -> pyarrow.Schema ===> DataFrame.schema -> datafusion.Schema Map Rust arrow types to dafusion-py types
  • - DataFrame.with_column ===> DataFrame.with_columns Allow multiple inputs as exprs or key value pairs
  • - DataFrame.with_column_renamed ===> DataFrame.rename() a simple rename is clear enough and should allow a dict as input
  • - DataFrame.aggregate ===> DataFrame.group_by().agg() this feels more natural coming from PySpark/Polars/Pandas

Can remove these:

  • - DataFrame.select_columns already covered by DataFrame.select

Missing APIs:

  • - DataFrame.cast to cast on top level a single or multiple columns
  • - DataFrame.drop to drop columns, instead of writing a very verbose select
  • - DataFrame.fill_null/fill_nan to fill null or nan values
  • - DataFrame.interpolate interpolate values per col
  • - Asof join missing in df api?
  • - Join on (inequality join)
  • - DataFrame.head/tail
  • - DataFrame.pivot
  • - DataFrame.unpivot

Optional but useful:

  • - DataFrame.with_row_idx

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions