java.lang.Object
- org.apache.iceberg.parquet.BasePageIterator
- - org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator

public class VectorizedPageIterator
extends BasePageIterator

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.iceberg.parquet.BasePageIterator
  BasePageIterator.IntIterator

Field Summary
- Fields inherited from class org.apache.iceberg.parquet.BasePageIterator
  currentDL, currentRL, definitionLevels, desc, dictionary, hasNext, page, repetitionLevels, triplesCount, triplesRead, valueEncoding, values, writerVersion

Constructor Summary

Constructors
Constructor	Description
`VectorizedPageIterator(org.apache.parquet.column.ColumnDescriptor desc, java.lang.String writerVersion, boolean setValidityVector)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`initDataReader(org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.ByteBufferInputStream in, int valueCount)`
`protected void`	`initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV1 dataPageV1, org.apache.parquet.column.ColumnDescriptor desc, org.apache.parquet.bytes.ByteBufferInputStream in, int triplesCount)`
`protected void`	`initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV2 dataPageV2, org.apache.parquet.column.ColumnDescriptor desc)`
`int`	`nextBatchBoolean(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)`	Method for reading batches of booleans.
`int`	`nextBatchDictionaryIds(org.apache.arrow.vector.IntVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder holder)`	Method for reading a batch of dictionary ids from the dicitonary encoded data pages.
`int`	`nextBatchDoubles(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)`	Method for reading a batch of values of DOUBLE data type
`int`	`nextBatchFixedLengthDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)`	Method for reading a batch of decimals backed by fixed length byte array parquet data type.
`int`	`nextBatchFixedWidthBinary(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)`	Method for reading batches of fixed width binary type (e.g.
`int`	`nextBatchFloats(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)`	Method for reading a batch of values of FLOAT data type.
`int`	`nextBatchIntegers(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)`	Method for reading a batch of values of INT32 data type
`int`	`nextBatchIntLongBackedDecimal(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder nullabilityHolder)`	Method for reading a batch of decimals backed by INT32 and INT64 parquet data types.
`int`	`nextBatchLongs(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)`	Method for reading a batch of values of INT64 data type
`int`	`nextBatchTimestampMillis(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, int typeWidth, NullabilityHolder holder)`	Method for reading a batch of values of TIMESTAMP_MILLIS data type.
`int`	`nextBatchVarWidthType(org.apache.arrow.vector.FieldVector vector, int expectedBatchSize, int numValsInVector, NullabilityHolder nullabilityHolder)`	Method for reading a batch of variable width data type (ENUM, JSON, UTF8, BSON).
`boolean`	`producesDictionaryEncodedVector()`
`protected void`	`reset()`
`void`	`setAllPagesDictEncoded(boolean allDictEncoded)`

Methods inherited from class org.apache.iceberg.parquet.BasePageIterator
currentPageCount, hasNext, initFromPage, initFromPage, setDictionary, setPage

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

VectorizedPageIterator

public VectorizedPageIterator(org.apache.parquet.column.ColumnDescriptor desc,
                              java.lang.String writerVersion,
                              boolean setValidityVector)

Method Detail

setAllPagesDictEncoded

public void setAllPagesDictEncoded(boolean allDictEncoded)

reset
```
protected void reset()
```
Overrides:

reset in class BasePageIterator

nextBatchDictionaryIds

public int nextBatchDictionaryIds(org.apache.arrow.vector.IntVector vector,
                                  int expectedBatchSize,
                                  int numValsInVector,
                                  NullabilityHolder holder)

Method for reading a batch of dictionary ids from the dicitonary encoded data pages. Like definition levels, dictionary ids in Parquet are RLE/bin-packed encoded as well.

nextBatchIntegers

public int nextBatchIntegers(org.apache.arrow.vector.FieldVector vector,
                             int expectedBatchSize,
                             int numValsInVector,
                             int typeWidth,
                             NullabilityHolder holder)

Method for reading a batch of values of INT32 data type

nextBatchLongs

public int nextBatchLongs(org.apache.arrow.vector.FieldVector vector,
                          int expectedBatchSize,
                          int numValsInVector,
                          int typeWidth,
                          NullabilityHolder holder)

Method for reading a batch of values of INT64 data type

nextBatchTimestampMillis

public int nextBatchTimestampMillis(org.apache.arrow.vector.FieldVector vector,
                                    int expectedBatchSize,
                                    int numValsInVector,
                                    int typeWidth,
                                    NullabilityHolder holder)

Method for reading a batch of values of TIMESTAMP_MILLIS data type. In iceberg, TIMESTAMP is always represented in micro-seconds. So we multiply values stored in millis with 1000 before writing them to the vector.

nextBatchFloats

public int nextBatchFloats(org.apache.arrow.vector.FieldVector vector,
                           int expectedBatchSize,
                           int numValsInVector,
                           int typeWidth,
                           NullabilityHolder holder)

Method for reading a batch of values of FLOAT data type.

nextBatchDoubles

public int nextBatchDoubles(org.apache.arrow.vector.FieldVector vector,
                            int expectedBatchSize,
                            int numValsInVector,
                            int typeWidth,
                            NullabilityHolder holder)

Method for reading a batch of values of DOUBLE data type

nextBatchIntLongBackedDecimal

public int nextBatchIntLongBackedDecimal(org.apache.arrow.vector.FieldVector vector,
                                         int expectedBatchSize,
                                         int numValsInVector,
                                         int typeWidth,
                                         NullabilityHolder nullabilityHolder)

Method for reading a batch of decimals backed by INT32 and INT64 parquet data types. Since Arrow stores all decimals in 16 bytes, byte arrays are appropriately padded before being written to Arrow data buffers.

nextBatchFixedLengthDecimal
```
public int nextBatchFixedLengthDecimal(org.apache.arrow.vector.FieldVector vector,
                                       int expectedBatchSize,
                                       int numValsInVector,
                                       int typeWidth,
                                       NullabilityHolder nullabilityHolder)
```
Method for reading a batch of decimals backed by fixed length byte array parquet data type. Arrow stores all decimals in 16 bytes. This method provides the necessary padding to the decimals read. Moreover, Arrow interprets the decimals in Arrow buffer as little endian. Parquet stores fixed length decimals as big endian. So, this method uses DecimalVector.setBigEndian(int, byte[]) method so that the data in Arrow vector is indeed little endian.

nextBatchVarWidthType

public int nextBatchVarWidthType(org.apache.arrow.vector.FieldVector vector,
                                 int expectedBatchSize,
                                 int numValsInVector,
                                 NullabilityHolder nullabilityHolder)

Method for reading a batch of variable width data type (ENUM, JSON, UTF8, BSON).

nextBatchFixedWidthBinary

public int nextBatchFixedWidthBinary(org.apache.arrow.vector.FieldVector vector,
                                     int expectedBatchSize,
                                     int numValsInVector,
                                     int typeWidth,
                                     NullabilityHolder nullabilityHolder)

Method for reading batches of fixed width binary type (e.g. BYTE[7]). Spark does not support fixed width binary data type. To work around this limitation, the data is read as fixed width binary from parquet and stored in a VarBinaryVector in Arrow.

producesDictionaryEncodedVector

public boolean producesDictionaryEncodedVector()

nextBatchBoolean

public int nextBatchBoolean(org.apache.arrow.vector.FieldVector vector,
                            int expectedBatchSize,
                            int numValsInVector,
                            NullabilityHolder nullabilityHolder)

Method for reading batches of booleans.

initDataReader

protected void initDataReader(org.apache.parquet.column.Encoding dataEncoding,
                              org.apache.parquet.bytes.ByteBufferInputStream in,
                              int valueCount)

Specified by:: initDataReader in class BasePageIterator

initDefinitionLevelsReader

protected void initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV1 dataPageV1,
                                          org.apache.parquet.column.ColumnDescriptor desc,
                                          org.apache.parquet.bytes.ByteBufferInputStream in,
                                          int triplesCount)
                                   throws java.io.IOException

Specified by:: initDefinitionLevelsReader in class BasePageIterator
Throws:: java.io.IOException

initDefinitionLevelsReader

protected void initDefinitionLevelsReader(org.apache.parquet.column.page.DataPageV2 dataPageV2,
                                          org.apache.parquet.column.ColumnDescriptor desc)

Specified by:: initDefinitionLevelsReader in class BasePageIterator

Class VectorizedPageIterator

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.iceberg.parquet.BasePageIterator

Field Summary

Fields inherited from class org.apache.iceberg.parquet.BasePageIterator

Constructor Summary

Method Summary

Methods inherited from class org.apache.iceberg.parquet.BasePageIterator

Methods inherited from class java.lang.Object

Constructor Detail

VectorizedPageIterator

Method Detail

setAllPagesDictEncoded

reset

nextBatchDictionaryIds

nextBatchIntegers

nextBatchLongs

nextBatchTimestampMillis

nextBatchFloats

nextBatchDoubles

nextBatchIntLongBackedDecimal

nextBatchFixedLengthDecimal

nextBatchVarWidthType

nextBatchFixedWidthBinary

producesDictionaryEncodedVector

nextBatchBoolean

initDataReader

initDefinitionLevelsReader

initDefinitionLevelsReader