Class VectorizedPageIterator

    • Constructor Detail

      • VectorizedPageIterator

        public VectorizedPageIterator​(org.apache.parquet.column.ColumnDescriptor desc,
                                      java.lang.String writerVersion,
                                      boolean setValidityVector)
    • Method Detail

      • setAllPagesDictEncoded

        public void setAllPagesDictEncoded​(boolean allDictEncoded)
      • nextBatchDictionaryIds

        public int nextBatchDictionaryIds​(org.apache.arrow.vector.IntVector vector,
                                          int expectedBatchSize,
                                          int numValsInVector,
                                          NullabilityHolder holder)
        Method for reading a batch of dictionary ids from the dicitonary encoded data pages. Like definition levels, dictionary ids in Parquet are RLE/bin-packed encoded as well.
      • nextBatchIntegers

        public int nextBatchIntegers​(org.apache.arrow.vector.FieldVector vector,
                                     int expectedBatchSize,
                                     int numValsInVector,
                                     int typeWidth,
                                     NullabilityHolder holder)
        Method for reading a batch of values of INT32 data type
      • nextBatchLongs

        public int nextBatchLongs​(org.apache.arrow.vector.FieldVector vector,
                                  int expectedBatchSize,
                                  int numValsInVector,
                                  int typeWidth,
                                  NullabilityHolder holder)
        Method for reading a batch of values of INT64 data type
      • nextBatchTimestampMillis

        public int nextBatchTimestampMillis​(org.apache.arrow.vector.FieldVector vector,
                                            int expectedBatchSize,
                                            int numValsInVector,
                                            int typeWidth,
                                            NullabilityHolder holder)
        Method for reading a batch of values of TIMESTAMP_MILLIS data type. In iceberg, TIMESTAMP is always represented in micro-seconds. So we multiply values stored in millis with 1000 before writing them to the vector.
      • nextBatchFloats

        public int nextBatchFloats​(org.apache.arrow.vector.FieldVector vector,
                                   int expectedBatchSize,
                                   int numValsInVector,
                                   int typeWidth,
                                   NullabilityHolder holder)
        Method for reading a batch of values of FLOAT data type.
      • nextBatchDoubles

        public int nextBatchDoubles​(org.apache.arrow.vector.FieldVector vector,
                                    int expectedBatchSize,
                                    int numValsInVector,
                                    int typeWidth,
                                    NullabilityHolder holder)
        Method for reading a batch of values of DOUBLE data type
      • nextBatchIntLongBackedDecimal

        public int nextBatchIntLongBackedDecimal​(org.apache.arrow.vector.FieldVector vector,
                                                 int expectedBatchSize,
                                                 int numValsInVector,
                                                 int typeWidth,
                                                 NullabilityHolder nullabilityHolder)
        Method for reading a batch of decimals backed by INT32 and INT64 parquet data types. Since Arrow stores all decimals in 16 bytes, byte arrays are appropriately padded before being written to Arrow data buffers.
      • nextBatchFixedLengthDecimal

        public int nextBatchFixedLengthDecimal​(org.apache.arrow.vector.FieldVector vector,
                                               int expectedBatchSize,
                                               int numValsInVector,
                                               int typeWidth,
                                               NullabilityHolder nullabilityHolder)
        Method for reading a batch of decimals backed by fixed length byte array parquet data type. Arrow stores all decimals in 16 bytes. This method provides the necessary padding to the decimals read. Moreover, Arrow interprets the decimals in Arrow buffer as little endian. Parquet stores fixed length decimals as big endian. So, this method uses DecimalVector.setBigEndian(int, byte[]) method so that the data in Arrow vector is indeed little endian.
      • nextBatchVarWidthType

        public int nextBatchVarWidthType​(org.apache.arrow.vector.FieldVector vector,
                                         int expectedBatchSize,
                                         int numValsInVector,
                                         NullabilityHolder nullabilityHolder)
        Method for reading a batch of variable width data type (ENUM, JSON, UTF8, BSON).
      • nextBatchFixedWidthBinary

        public int nextBatchFixedWidthBinary​(org.apache.arrow.vector.FieldVector vector,
                                             int expectedBatchSize,
                                             int numValsInVector,
                                             int typeWidth,
                                             NullabilityHolder nullabilityHolder)
        Method for reading batches of fixed width binary type (e.g. BYTE[7]). Spark does not support fixed width binary data type. To work around this limitation, the data is read as fixed width binary from parquet and stored in a VarBinaryVector in Arrow.
      • producesDictionaryEncodedVector

        public boolean producesDictionaryEncodedVector()
      • nextBatchBoolean

        public int nextBatchBoolean​(org.apache.arrow.vector.FieldVector vector,
                                    int expectedBatchSize,
                                    int numValsInVector,
                                    NullabilityHolder nullabilityHolder)
        Method for reading batches of booleans.
      • initDataReader

        protected void initDataReader​(org.apache.parquet.column.Encoding dataEncoding,
                                      org.apache.parquet.bytes.ByteBufferInputStream in,
                                      int valueCount)
        Specified by:
        initDataReader in class BasePageIterator
      • initDefinitionLevelsReader

        protected void initDefinitionLevelsReader​(org.apache.parquet.column.page.DataPageV1 dataPageV1,
                                                  org.apache.parquet.column.ColumnDescriptor desc,
                                                  org.apache.parquet.bytes.ByteBufferInputStream in,
                                                  int triplesCount)
                                           throws java.io.IOException
        Specified by:
        initDefinitionLevelsReader in class BasePageIterator
        Throws:
        java.io.IOException
      • initDefinitionLevelsReader

        protected void initDefinitionLevelsReader​(org.apache.parquet.column.page.DataPageV2 dataPageV2,
                                                  org.apache.parquet.column.ColumnDescriptor desc)
        Specified by:
        initDefinitionLevelsReader in class BasePageIterator