Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to close #1453 by exposing several DataFusion scalar functions that exist upstream but were not previously available in the Python API, along with adding Python unit tests for the new bindings.
Changes:
- Added Python wrappers and exports for
arrow_metadata,get_field,union_extract,union_tag,version, plus a Python-levelrowalias forstruct. - Added unit tests covering the newly exposed functions (notably union functions and
version). - Updated
codespellskip paths formatting inpyproject.toml.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
python/datafusion/functions.py |
Adds new Python-level function wrappers/exports (arrow_metadata, get_field, union_*, version, row). |
crates/core/src/functions.rs |
Exposes new functions from the Rust extension module to Python via pyo3 (arrow_metadata, get_field, union_extract, union_tag, version). |
python/tests/test_functions.py |
Adds tests for the newly exposed functions. |
pyproject.toml |
Normalizes codespell skip path entries (removes ./ prefixes). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
192593f to
2771621
Compare
…row_metadata, version, row Expose upstream DataFusion scalar functions that were not yet available in the Python API. Closes apache#1453. - get_field: extracts a field from a struct or map by name - union_extract: extracts a value from a union type by field name - union_tag: returns the active field name of a union type - arrow_metadata: returns Arrow field metadata (all or by key) - version: returns the DataFusion version string - row: alias for the struct constructor Note: arrow_try_cast was listed in the issue but does not exist in DataFusion 53, so it is not included. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests for get_field, arrow_metadata, version, row, union_tag, and union_extract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow arrow_cast, get_field, and union_extract to accept plain str arguments instead of requiring Expr wrappers. Also improve arrow_metadata test coverage and fix parameter shadowing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4384c1f to
df1ead1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def arrow_cast(expr: Expr, data_type: Expr | str) -> Expr: | ||
| """Casts an expression to a specified data type. | ||
|
|
||
| Examples: | ||
| >>> ctx = dfn.SessionContext() |
There was a problem hiding this comment.
This PR declares Closes #1453, but the issue also lists arrow_try_cast as a missing scalar function. I verified there is no arrow_try_cast wrapper anywhere in the repo (no Python wrapper in python/datafusion/functions.py and no Rust binding in crates/core/src/functions.rs). Either add arrow_try_cast (and a corresponding unit test) or adjust the PR description/linked issue closure so we’re not closing the issue prematurely.
ntjohnson1
left a comment
There was a problem hiding this comment.
Claude didn't do as good a job maintaining existing structure as the last one. Not sure how pedantic we want to be about some of the formatting stuff since there isn't a ruff rule around it. A copilot setting or custom lint rule could help enforce if desired
| >>> result.collect_column("c")[0].as_py() | ||
| 1.0 | ||
| """ | ||
| if isinstance(data_type, str): |
There was a problem hiding this comment.
Nice! I don't know if anyone has run into it yet but I wonder if a helper around strings might be nice. Hopefully most common is people just passing python strings, but I could image someone passing a numpy string or pyarrow string extracted from some other operation. Definitely follow on work/issue
| expr: An expression whose metadata to retrieve. | ||
| key: Optional metadata key to look up. Can be a string or an Expr. | ||
|
|
||
| Returns: |
There was a problem hiding this comment.
No example and I think nothing else in the file uses the Returns category. I'm not sure how consistent that is across the code base.
| expr: A struct or map expression. | ||
| name: The field name to extract. | ||
|
|
||
| Returns: |
There was a problem hiding this comment.
Echo example/returns note above
| union_expr: A union-typed expression. | ||
| field_name: The name of the field to extract. | ||
|
|
||
| Returns: |
| Args: | ||
| union_expr: A union-typed expression. | ||
|
|
||
| Returns: |
| def version() -> Expr: | ||
| """Returns the DataFusion version string. | ||
|
|
||
| Returns: |
There was a problem hiding this comment.
example/returns
In this case the returns is definitely redundant with the definition
| def row(*args: Expr) -> Expr: | ||
| """Returns a struct with the given arguments. | ||
|
|
||
| This is an alias for :py:func:`struct`. |
There was a problem hiding this comment.
Doesn't use the See Also block
| Args: | ||
| args: The expressions to include in the struct. | ||
|
|
||
| Returns: |
| import numpy as np | ||
| import pyarrow as pa | ||
| import pytest | ||
| from datafusion import SessionContext, column, literal, string_literal |
There was a problem hiding this comment.
Love that this is no longer needed
Which issue does this PR close?
Closes #1453
Rationale for this change
These functions exist upstream but were not exposed to Python.
What changes are included in this PR?
Expose functions to Python
Add unit testss
Are there any user-facing changes?
New addition only.