Describe the bug
I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.
The problems we faced with:
- Optimization of the logical plan works slowly because it has to copy the whole schema in some rules.
We workarounded it with prepared queries (we cache parametrized logical plan)
- Creation of physical plan consume up to 35% on CPU, which is more than it's execution (we use several hundreds of aggregation functions and DF shows pretty good execution time)
Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)
29 % datafusion_physical_expr::planner::create_physical_expr
28.5 % --> datafusion_common::dfschema::DFSchema::index_of_column
28.5 % -- --> datafusion_common::dfschema::DFSchema::index_of_column_by_name
7.4 % -- -- --> __memcmp_sse4_1
14.6 % -- -- --> datafusion_common::table_reference::TableReference::resolved_eq
6.8 % -- -- -- --> __memcmp_sse4_1

Now algorithm has O(N^2) complexity (N in iterating all the columns in
datafusion_common::dfschema::DFSchema::index_of_column_by_name
and N in datafusion_common::table_reference::TableReference::resolved_eq).
https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211
Some ideas to resolve:
- Use hashmap or btree in DFSchema instead of list (decrease complexity of resolving column index by it's name)
- Implement parametrization of Physical plan and prepared physical plans (in order to enable caching it the same as prepared logical plan)
Thank you for developing a such great tool!
To Reproduce
It's hard to extract some code from the project, but I will try to build simple repro
Expected behavior
Creation of physical plan spent much less time in CPU than it's execution
Additional context
No response
Describe the bug
I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.
The problems we faced with:
We workarounded it with prepared queries (we cache parametrized logical plan)
Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)
Now algorithm has O(N^2) complexity (N in iterating all the columns in
datafusion_common::dfschema::DFSchema::index_of_column_by_nameand N in
datafusion_common::table_reference::TableReference::resolved_eq).https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211
Some ideas to resolve:
Thank you for developing a such great tool!
To Reproduce
It's hard to extract some code from the project, but I will try to build simple repro
Expected behavior
Creation of physical plan spent much less time in CPU than it's execution
Additional context
No response