Describe the bug
When creating a Python/Pyarrow UDF, extension types and arrays aren't propagated from the output of one to the input of another.
To Reproduce
from uuid import UUID
import datafusion
import pyarrow as pa
@datafusion.udf([pa.string()], pa.uuid(), "stable")
def uuid_from_string(uuid_string):
return pa.array((UUID(s).bytes for s in uuid_string.to_pylist()), pa.uuid())
@datafusion.udf([pa.uuid()], pa.int64(), "stable")
def uuid_version(uuid):
return pa.array(s.version for s in uuid.to_pylist())
def main():
ctx = datafusion.SessionContext()
batch = pa.record_batch({"idx": pa.array(range(100))})
tab = (
ctx.create_dataframe([[batch]])
.with_column("uuid_string", datafusion.functions.uuid())
.with_column("uuid", uuid_from_string(datafusion.col("uuid_string")))
.with_column("uuid_version", uuid_version(datafusion.col("uuid")))
)
#> AttributeError("'bytes' object has no attribute 'version'"), since metadata doesn't make it through
print(tab)
if __name__ == "__main__":
main()
Expected behavior
The pyarrow.Array that is returned from uuid_from_string() is a UuidArray:
pa.array([uuid4().bytes], pa.uuid())
#> <pyarrow.lib.UuidArray object at 0x120292350>
However, the pyarrow.Array that is passed to uuid_version() is a FixedSizeBinary array. I would have expected the array passed here to have the pa.uuid() type.
Additional context
It seems like create_udf() is the mechanism being used to create the UDF; however, this doesn't propagate field information I believe since everything goes through the DataType:
|
fn new( |
|
name: &str, |
|
func: PyObject, |
|
input_types: PyArrowType<Vec<DataType>>, |
|
return_type: PyArrowType<DataType>, |
|
volatility: &str, |
|
) -> PyResult<Self> { |
|
let function = create_udf( |
|
name, |
|
input_types.0, |
|
return_type.0, |
|
parse_volatility(volatility)?, |
|
to_scalar_function_impl(func), |
|
); |
|
Ok(Self { function }) |
Describe the bug
When creating a Python/Pyarrow UDF, extension types and arrays aren't propagated from the output of one to the input of another.
To Reproduce
Expected behavior
The
pyarrow.Arraythat is returned fromuuid_from_string()is aUuidArray:However, the
pyarrow.Arraythat is passed touuid_version()is aFixedSizeBinaryarray. I would have expected the array passed here to have thepa.uuid()type.Additional context
It seems like
create_udf()is the mechanism being used to create the UDF; however, this doesn't propagate field information I believe since everything goes through theDataType:datafusion-python/src/udf.rs
Lines 91 to 105 in 9545634