Describe the bug, including details regarding any error messages, version, and platform.
Noticed when upgrading from 1.13.1 to 1.14.1
java.lang.ClassCastException: class org.apache.parquet.column.values.dictionary.DictionaryValuesReader cannot be cast to class org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader (org.apache.parquet.column.values.dictionary.DictionaryValuesReader and org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader are in unnamed module of loader 'app')
at org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader.setPreviousReader(DeltaByteArrayReader.java:92)
at org.apache.parquet.column.impl.ColumnReaderBase.initDataReader(ColumnReaderBase.java:734)
at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:766)
at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:56)
at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:695)
at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:686)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:232)
at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:686)
at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:660)
at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:802)
at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:427)
This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732
Looking into how my code hit this and it seems to be that CorruptDeltaByteArrays.requiresSequentialReads was essentially doing the dataColumn instanceof RequiresPreviousReader check previously (CorruptDeltaByteArrays.requiresSequentialReads can only return true when encoding == Encoding.DELTA_BYTE_ARRAY, and org.apache.parquet.column.values.RequiresPreviousReader is only implemented by *DeltaByteArrayReader classes).
With no check on previousReader instanceof RequiresPreviousReader the ClassCastException is possible above.
This is more likely to happen when using org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory() to read files without createdBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which prevents CorruptDeltaByteArrays.requiresSequentialReads from returning true
val reader: ParquetFileReader = ...
val fileMetadata = reader.getFooter.getFileMetaData
val createdBy = fileMetadata.getCreatedBy
val columnIO: MessageColumnIO = new ColumnIOFactory(createdBy)...
Component(s)
No response
Describe the bug, including details regarding any error messages, version, and platform.
Noticed when upgrading from 1.13.1 to 1.14.1
This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732
Looking into how my code hit this and it seems to be that
CorruptDeltaByteArrays.requiresSequentialReadswas essentially doing thedataColumn instanceof RequiresPreviousReadercheck previously (CorruptDeltaByteArrays.requiresSequentialReadscan only return true whenencoding == Encoding.DELTA_BYTE_ARRAY, andorg.apache.parquet.column.values.RequiresPreviousReaderis only implemented by *DeltaByteArrayReader classes).With no check on
previousReader instanceof RequiresPreviousReaderthe ClassCastException is possible above.This is more likely to happen when using
org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory()to read files withoutcreatedBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which preventsCorruptDeltaByteArrays.requiresSequentialReadsfrom returning trueComponent(s)
No response