Is your feature request related to a problem or challenge?
This is based on the wonderful writeup from @2010YOUY01 in #7977
As previously discussed in #7110 #7752 there are a few challenges with how ScalarFunctions are handled, notably that there are two distinct implementations -- BuiltinScalarFunction and ScalarUDF
Problems with BuiltinScalarFunction
- As more functions are added, the total footprint of DataFusion grows, even for those who don't need the specific functions. This also acts to limit the number of functions built into DataFusion
- The desired semantics may be different for different users(e.g. many of the built in functions in DataFusion mirror postgres behavior, but some users wish to mimic spark behavior)
- User defined functions are treated differently from built in functions in some ways (e.g. they can't have aliases)
- Adding a new built in function requires modifications in multiple places which makes the barrier overly high.Built-in functions are implemented with
Enum BuiltinScalarFunction, and function implementations like return_type() are large methods that match every enum variant.
Problems with ScalarUDF
- The current implementation of
ScalarUDFs is a struct, and does not cover all the functionalities of existing built-in functions
- Defining a new
ScalarUDF requires constructing a struct in an imperative way providing Arc function pointers (see examples/simple_udf.rs) for each part of the UDF, which is not familiar to Rust users where it is more common to see dyn Trait objects
Describe the solution you'd like
I propose moving DataFusion to only use ScalarUDFs and remove BuiltInScalarFunction. This will ensure:
- ScalarUDFs have access to all the same functionality as "built in " functions.
- No function specific code will escape the planning phase
- DataFusion's core can remain focused, and external libraries of packages can be used to customize its use.
We will keep the existing ScalarUDF interface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object).
Describe alternatives you've considered
#7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait.
This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface create_udf() can keep unchanged.
However, I think it is a bigger change for users, and has the danger of making the overall complexity of DataFusion worse. As demonstrated in #8046 it is also feasible to allow new ScalarUDFs to be defined using a trait while retaining backwards compatibility for existing ScalarUDF implementations
Additional context
Proposed implementation steps:
Is your feature request related to a problem or challenge?
This is based on the wonderful writeup from @2010YOUY01 in #7977
As previously discussed in #7110 #7752 there are a few challenges with how ScalarFunctions are handled, notably that there are two distinct implementations --
BuiltinScalarFunctionandScalarUDFProblems with
BuiltinScalarFunctionEnum BuiltinScalarFunction, and function implementations likereturn_type()are large methods that match every enum variant.Problems with
ScalarUDFScalarUDFs is a struct, and does not cover all the functionalities of existing built-in functionsScalarUDFrequires constructing a struct in an imperative way providingArcfunction pointers (see examples/simple_udf.rs) for each part of the UDF, which is not familiar to Rust users where it is more common to seedyn TraitobjectsDescribe the solution you'd like
I propose moving DataFusion to only use
ScalarUDFs and removeBuiltInScalarFunction. This will ensure:We will keep the existing
ScalarUDFinterface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object).Describe alternatives you've considered
#7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait.
This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface
create_udf()can keep unchanged.However, I think it is a bigger change for users, and has the danger of making the overall complexity of DataFusion worse. As demonstrated in #8046 it is also feasible to allow new
ScalarUDFs to be defined using a trait while retaining backwards compatibility for existingScalarUDFimplementationsAdditional context
Proposed implementation steps:
pub): RFC: Make fields of ScalarUDF non pub #8039ScalarUDFAPI changes for real: Make fields ofScalarUDF,AggregateUDFandWindowUDFnonpub#8079Expr::AggregateFunctionandExpr::AggregateUDF#8346expr::window_function::WindowFunctiontoWindowFunctionDefintionfor consistency #8347Exprcreation forScalarUDF: Resolve function calls by name during planning #8157ScalarUDFImpl(rather than the function pointers) #8712datafusion-functioncrate with an initial set of functions as a model (see RFC: Demonstrate what a function package might look like -- encoding expressions #8046)datafusion_functionscrate, file tickets to track them ([Epic] Port BuiltInFunctons todatafusion-functions-*crates #9285)datafusion-functions-*crates #9285FunctionRegistry::register_udafandFunctionRegistry::register_udwf#9074AggregateUDF[Epic] UnifyAggregateFunctionInterface (remove built in list ofAggregateFunctions), improve the system #8708 andWindowUDF[Epic] UnifyWindowFunctionInterface (remove built in list ofBuiltInWindowFunctions) #8709