Skip to content

OOM when nested join + limit #15628

@mmooyyii

Description

@mmooyyii

Is your feature request related to a problem or challenge?

# make testdata
import pandas as pd

n_rows = 100000

data = {'id': range(1, n_rows + 1)}
df = pd.DataFrame(data)
df.to_parquet('test1.parquet')
df.to_parquet('test2.parquet')
yimo@YideMacBook-Pro PythonProject % datafusion-cli
DataFusion CLI v46.0.1
> select * from 'test1.parquet' as t1 join 'test2.parquet' as t2 on t1.id < t2.id limit 10;
zsh: killed     datafusion-cli

Describe the solution you'd like

same sql in postgresql

create table t1
(
    id integer
);

create table t2
(
    id integer
);
truncate t1;
truncate t2;
insert into t1(select *
               from generate_series(0, 100000));
insert into t2(select *
               from generate_series(0, 100000));

explain analyse select * from t1 join t2 on t1.id <= t2.id limit 10;

Limit  (cost=0.00..0.45 rows=10 width=8) (actual time=0.021..36.926 rows=10 loops=1)
  ->  Nested Loop  (cost=0.00..169452460.14 rows=3765537655 width=8) (actual time=0.021..36.924 rows=10 loops=1)
        Join Filter: (t1.id <= t2.id)
        Rows Removed by Join Filter: 299997
        ->  Seq Scan on t2  (cost=0.00..1572.65 rows=112965 width=4) (actual time=0.007..0.011 rows=4 loops=1)
        ->  Materialize  (cost=0.00..1943.02 rows=100001 width=4) (actual time=0.008..6.338 rows=75002 loops=4)
              ->  Seq Scan on t1  (cost=0.00..1443.01 rows=100001 width=4) (actual time=0.003..4.574 rows=100001 loops=1)
Planning Time: 0.044 ms
Execution Time: 37.334 ms

Describe alternatives you've considered

I think should merge limit and join to one physical plan;

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions