Is your feature request related to a problem or challenge?
The TABLESAMPLE statement is used to sample the table.
Different DBs have different sample implementations.
Spark:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html
example with replacement, poisson sample.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html
it implements
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage.
a. poisson sample, (only in dataframe api)
b. bernoulli sample.
Spark introduced a Sample logical plan, and many other dataframe apis are also based on this logical plan. e.g.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html
Hive:
https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage. (bernoulli)
- Sample on column, it's useful for clustered table.
Clickhouse:
https://clickhouse.com/docs/sql-reference/statements/select/sample
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage.
- Sample with offset.
Postgres:
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
- BERNOULLI sample.
- SYSTEM
Describe the solution you'd like
Add a Sample logical plan.
Describe alternatives you've considered
I have considered resusing current logical plan, e.g. Filter. But it seems that it's hard to implement poisson sample with current logical plan.
in spark, real_seed = input_seed + partition_id. then different partitions have different sample results. it makes sense to me. it's also hard to implement with current logical plan.
Additional context
No response
Is your feature request related to a problem or challenge?
The TABLESAMPLE statement is used to sample the table.
Different DBs have different sample implementations.
Spark:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html
example with replacement, poisson sample.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html
it implements
a. poisson sample, (only in dataframe api)
b. bernoulli sample.
Spark introduced a Sample logical plan, and many other dataframe apis are also based on this logical plan. e.g.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html
Hive:
https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling
Clickhouse:
https://clickhouse.com/docs/sql-reference/statements/select/sample
Postgres:
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
Describe the solution you'd like
Add a Sample logical plan.
Describe alternatives you've considered
I have considered resusing current logical plan, e.g. Filter. But it seems that it's hard to implement poisson sample with current logical plan.
in spark, real_seed = input_seed + partition_id. then different partitions have different sample results. it makes sense to me. it's also hard to implement with current logical plan.
Additional context
No response