Skip to content

Filters on RANDOM() are applied incorrectly when pushdown_filters is enabled. #13268

@adamfaulkner-at

Description

@adamfaulkner-at

Describe the bug

When running a query like

SELECT * FROM table WHERE RANDOM() < 0.1;

I get different results depending on the value of "datafusion.execution.parquet.pushdown_filters". When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.

I suspect I'm seeing these results because pushdown with TableProviderFilterPushDown::Inexact is applying this filter at both the parquet level and a FilterExec: random() <= 0.1. This results in the RANDOM() filter being evaluated twice, which causes fewer rows to be sampled.

To Reproduce

This can be reproduced with datafusion-cli version 42.2.0:

Without pushdown_filters

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 605572   |
+----------+
1 row(s) fetched.
Elapsed 0.043 seconds.

With pushdown_filters (note that you must re-create the table with the updated setting):

> set datafusion.execution.parquet.pushdown_filters=true;
0 row(s) fetched.
Elapsed 0.002 seconds.

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
0 row(s) fetched.
Elapsed 0.007 seconds.

> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 60152    |
+----------+
1 row(s) fetched.
Elapsed 0.045 seconds.

Expected behavior

I would expect that a filter on RANDOM() would be applied only once, so that RANDOM() < 0.1 means that only 10% of all rows will be sampled.

It would be acceptable if RANDOM() was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.

It feels like the "right" solution is to somehow guarantee that RANDOM() always returns the same value for a given row and query evaluation, perhaps by "caching" its values.

Additional context

In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions