Class VectorizedPageIterator
- java.lang.Object
-
- org.apache.iceberg.parquet.BasePageIterator
-
- org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator
-
public class VectorizedPageIterator extends BasePageIterator
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.iceberg.parquet.BasePageIterator
BasePageIterator.IntIterator
-
-
Field Summary
-
Fields inherited from class org.apache.iceberg.parquet.BasePageIterator
currentDL, currentRL, definitionLevels, desc, dictionary, hasNext, page, repetitionLevels, triplesCount, triplesRead, valueEncoding, values, writerVersion
-
-
Constructor Summary
Constructors Constructor Description VectorizedPageIterator(org.apache.parquet.column.ColumnDescriptor desc, java.lang.String writerVersion, boolean setValidityVector)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
initDataReader(org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.ByteBufferInputStream in, int valueCount)
protected void
initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV1 dataPageV1, org.apache.parquet.column.ColumnDescriptor desc, org.apache.parquet.bytes.ByteBufferInputStream in, int triplesCount)
protected void
initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV2 dataPageV2, org.apache.parquet.column.ColumnDescriptor desc)
int
nextBatchBoolean(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)
Method for reading batches of booleans.int
nextBatchDictionaryIds(org.apache.arrow.vector.IntVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder holder)
Method for reading a batch of dictionary ids from the dicitonary encoded data pages.int
nextBatchDoubles(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of DOUBLE data typeint
nextBatchFixedLengthDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading a batch of decimals backed by fixed length byte array parquet data type.int
nextBatchFixedWidthBinary(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading batches of fixed width binary type (e.g.int
nextBatchFloats(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of FLOAT data type.int
nextBatchIntegers(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of INT32 data typeint
nextBatchIntLongBackedDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading a batch of decimals backed by INT32 and INT64 parquet data types.int
nextBatchLongs(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of INT64 data typeint
nextBatchTimestampMillis(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of TIMESTAMP_MILLIS data type.int
nextBatchVarWidthType(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)
Method for reading a batch of variable width data type (ENUM, JSON, UTF8, BSON).boolean
producesDictionaryEncodedVector()
protected void
reset()
void
setAllPagesDictEncoded(boolean allDictEncoded)
-
Methods inherited from class org.apache.iceberg.parquet.BasePageIterator
currentPageCount, hasNext, initFromPage, initFromPage, setDictionary, setPage
-
-
-
-
Method Detail
-
setAllPagesDictEncoded
public void setAllPagesDictEncoded(boolean allDictEncoded)
-
reset
protected void reset()
- Overrides:
reset
in classBasePageIterator
-
nextBatchDictionaryIds
public int nextBatchDictionaryIds(org.apache.arrow.vector.IntVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder holder)
Method for reading a batch of dictionary ids from the dicitonary encoded data pages. Like definition levels, dictionary ids in Parquet are RLE/bin-packed encoded as well.
-
nextBatchIntegers
public int nextBatchIntegers(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of INT32 data type
-
nextBatchLongs
public int nextBatchLongs(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of INT64 data type
-
nextBatchTimestampMillis
public int nextBatchTimestampMillis(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of TIMESTAMP_MILLIS data type. In iceberg, TIMESTAMP is always represented in micro-seconds. So we multiply values stored in millis with 1000 before writing them to the vector.
-
nextBatchFloats
public int nextBatchFloats(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of FLOAT data type.
-
nextBatchDoubles
public int nextBatchDoubles(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)
Method for reading a batch of values of DOUBLE data type
-
nextBatchIntLongBackedDecimal
public int nextBatchIntLongBackedDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading a batch of decimals backed by INT32 and INT64 parquet data types. Since Arrow stores all decimals in 16 bytes, byte arrays are appropriately padded before being written to Arrow data buffers.
-
nextBatchFixedLengthDecimal
public int nextBatchFixedLengthDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading a batch of decimals backed by fixed length byte array parquet data type. Arrow stores all decimals in 16 bytes. This method provides the necessary padding to the decimals read. Moreover, Arrow interprets the decimals in Arrow buffer as little endian. Parquet stores fixed length decimals as big endian. So, this method usesDecimalVector.setBigEndian(int, byte[])
method so that the data in Arrow vector is indeed little endian.
-
nextBatchVarWidthType
public int nextBatchVarWidthType(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)
Method for reading a batch of variable width data type (ENUM, JSON, UTF8, BSON).
-
nextBatchFixedWidthBinary
public int nextBatchFixedWidthBinary(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)
Method for reading batches of fixed width binary type (e.g. BYTE[7]). Spark does not support fixed width binary data type. To work around this limitation, the data is read as fixed width binary from parquet and stored in aVarBinaryVector
in Arrow.
-
producesDictionaryEncodedVector
public boolean producesDictionaryEncodedVector()
-
nextBatchBoolean
public int nextBatchBoolean(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)
Method for reading batches of booleans.
-
initDataReader
protected void initDataReader(org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.ByteBufferInputStream in, int valueCount)
- Specified by:
initDataReader
in classBasePageIterator
-
initDefinitionLevelsReader
protected void initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV1 dataPageV1, org.apache.parquet.column.ColumnDescriptor desc, org.apache.parquet.bytes.ByteBufferInputStream in, int triplesCount) throws java.io.IOException
- Specified by:
initDefinitionLevelsReader
in classBasePageIterator
- Throws:
java.io.IOException
-
initDefinitionLevelsReader
protected void initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV2 dataPageV2, org.apache.parquet.column.ColumnDescriptor desc)
- Specified by:
initDefinitionLevelsReader
in classBasePageIterator
-
-