Bad performance on wide tables (1000+ columns) due to planning overhead

### Describe the bug

I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.

The problems we faced with:
1) Optimization of the logical plan works slowly because it has to copy the whole schema in some rules.
We workarounded it with prepared queries (we cache parametrized logical plan)
2) Creation of physical plan consume up to 35% on CPU, which is more than it's execution  (we use several hundreds of aggregation functions and DF shows pretty good execution time)

Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)
``` 
  29 %      datafusion_physical_expr::planner::create_physical_expr 
  28.5 %    --> datafusion_common::dfschema::DFSchema::index_of_column
  28.5 %    -- --> datafusion_common::dfschema::DFSchema::index_of_column_by_name
   7.4 %    -- -- --> __memcmp_sse4_1
  14.6 %    -- -- --> datafusion_common::table_reference::TableReference::resolved_eq
   6.8 %    -- -- -- --> __memcmp_sse4_1
```
![photo_2023-09-29_14-29-16](https://github.com/apache/arrow-datafusion/assets/3950601/ceb1f854-bb3f-4490-bbc4-e02c8ccd63c5)

Now algorithm has O(N^2) complexity (N in iterating all the columns in
`datafusion_common::dfschema::DFSchema::index_of_column_by_name` 
and N in `datafusion_common::table_reference::TableReference::resolved_eq`).

https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211


Some ideas to resolve:

- Use hashmap or btree in DFSchema instead of list (decrease complexity of resolving column index by it's name)
- Implement parametrization of Physical plan and prepared physical plans (in order to enable caching it the same as prepared logical plan)

Thank you for developing a such great tool!

### To Reproduce

It's hard to extract some code from the project, but I will try to build simple repro

### Expected behavior

Creation of physical plan spent much less time in CPU than it's execution

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance on wide tables (1000+ columns) due to planning overhead #7698

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bad performance on wide tables (1000+ columns) due to planning overhead #7698

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions