Class ArrowReader

java.lang.Object
org.apache.iceberg.io.CloseableGroup
org.apache.iceberg.arrow.vectorized.ArrowReader
All Implemented Interfaces:
Closeable, AutoCloseable

public class ArrowReader extends CloseableGroup
Vectorized reader that returns an iterator of ColumnarBatch. See open(CloseableIterable) ()} to learn about the behavior of the iterator.

The following Iceberg data types are supported and have been tested:

Features that don't work in this implementation:

  • Type promotion: In case of type promotion, the Arrow vector corresponding to the data type in the parquet file is returned instead of the data type in the latest schema. See https://github.com/apache/iceberg/issues/2483.
  • Columns with constant values are physically encoded as a dictionary. The Arrow vector type is int32 instead of the type as per the schema. See https://github.com/apache/iceberg/issues/2484.
  • Data types: Types.ListType, Types.MapType, Types.StructType, Types.FixedType and Types.DecimalType See https://github.com/apache/iceberg/issues/2485 and https://github.com/apache/iceberg/issues/2486.
  • Delete files are not supported. See https://github.com/apache/iceberg/issues/2487.
  • Constructor Details

    • ArrowReader

      public ArrowReader(TableScan scan, int batchSize, boolean reuseContainers)
      Create a new instance of the reader.
      Parameters:
      scan - the table scan object.
      batchSize - the maximum number of rows per Arrow batch.
      reuseContainers - whether to reuse Arrow vectors when iterating through the data. If set to false, every Iterator.next() call creates new instances of Arrow vectors. If set to true, the Arrow vectors in the previous Iterator.next() may be reused for the data returned in the current Iterator.next(). This option avoids allocating memory again and again. Irrespective of the value of reuseContainers, the Arrow vectors in the previous Iterator.next() call are closed before creating new instances if the current Iterator.next().
  • Method Details