Skip to content

Support for "Schema evolution" / Schema Adapters #6735

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Sometimes it is desired to convert RecordBatches from one schema so they match another. This is often done when data is stored in several different sources (like parquet files) that are "compatible" but not exactly the same (e.g. maybe newer files have new columns)

Common transformations are:

  • Reorder columns by name (so a file with (a int, b char) and one with (b char, a int) could be read as a single stream
  • Insert missing columns (so a file with (a int, b char) and a file with (a int) could be merged

It is also common to want to fill in missing columns with either Null or some constant (e.g. 0) so a controllable policy would be nice

Note that these these usecases are pretty similar to casting Structs (e.g. reordering fields with the same name but different position)

Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.

Describe the solution you'd like
Add some API in Arrow-rs to do this mapping

One alternative, suggested by @tustvold would be to add a first-party schema adapter into arrow-rs.

Describe alternatives you've considered

For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):

Screenshot 2024-11-13 at 6 57 28 AM

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions