Parquet-go can write extra entries in the column index for 1 column chunk. This causes Parquet java to fail with java.lang.ArrayIndexOutOfBoundsException when using column indexes.
DEBUG: COLUMN1 ColumnIndex=5 stats
Entry 0: nullPage=false, nullCount=0, min='', max=''
Entry 1: nullPage=false, nullCount=0, min='xyz', max='abc'
Entry 2: nullPage=false, nullCount=0, min='xyz', max='abc'
Entry 3: nullPage=false, nullCount=0, min='xyz', max='abc'
Entry 4: nullPage=false, nullCount=0, min='xyz', max='abc'
DEBUG: COLUMN1 OffsetIndex=4 pages
Entry 0: offset=..., compressedSize=..., firstRowIndex=0
Entry 1: offset=..., compressedSize=..., firstRowIndex=30
Entry 2: offset=..., compressedSize=..., firstRowIndex=60
Entry 3: offset=..., compressedSize=..., firstRowIndex=90
(Here we have 1 additional column index for the same data page)
at org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder$OffsetIndexImpl.getFirstRowIndex(OffsetIndexBuilder.java:66)
at org.apache.parquet.internal.filter2.columnindex.RowRanges.create(RowRanges.java:144)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.applyPredicate(ColumnIndexFilter.java:189)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:126)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:57)
at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:87)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:82)
at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:82)
at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:1219)
at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:875)
...
Describe the enhancement requested
Parquet-go can write extra entries in the column index for 1 column chunk. This causes Parquet java to fail with java.lang.ArrayIndexOutOfBoundsException when using column indexes.
Structure of indexes which lead to the failure:
There could be potentially 2 ways to fix this:
Stacktrace:
Component(s)
No response