Class ArrowReader
- java.lang.Object
-
- org.apache.iceberg.io.CloseableGroup
-
- org.apache.iceberg.arrow.vectorized.ArrowReader
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable
public class ArrowReader extends CloseableGroup
Vectorized reader that returns an iterator ofColumnarBatch. Seeopen(CloseableIterable)()} to learn about the behavior of the iterator.The following Iceberg data types are supported and have been tested:
- Iceberg:
Types.BooleanType, Arrow:Types.MinorType.BIT - Iceberg:
Types.IntegerType, Arrow:Types.MinorType.INT - Iceberg:
Types.LongType, Arrow:Types.MinorType.BIGINT - Iceberg:
Types.FloatType, Arrow:Types.MinorType.FLOAT4 - Iceberg:
Types.DoubleType, Arrow:Types.MinorType.FLOAT8 - Iceberg:
Types.StringType, Arrow:Types.MinorType.VARCHAR - Iceberg:
Types.TimestampType(both with and without timezone), Arrow:Types.MinorType.TIMEMICRO - Iceberg:
Types.BinaryType, Arrow:Types.MinorType.VARBINARY - Iceberg:
Types.DateType, Arrow:Types.MinorType.DATEDAY
Features that don't work in this implementation:
- Type promotion: In case of type promotion, the Arrow vector corresponding to the data type in the parquet file is returned instead of the data type in the latest schema. See https://github.com/apache/iceberg/issues/2483.
- Columns with constant values are physically encoded as a dictionary. The Arrow vector type is int32 instead of the type as per the schema. See https://github.com/apache/iceberg/issues/2484.
- Data types:
Types.ListType,Types.MapType,Types.StructType,Types.FixedTypeandTypes.DecimalTypeSee https://github.com/apache/iceberg/issues/2485 and https://github.com/apache/iceberg/issues/2486. - Iceberg v2 spec is not supported. See https://github.com/apache/iceberg/issues/2487.
-
-
Constructor Summary
Constructors Constructor Description ArrowReader(TableScan scan, int batchSize, boolean reuseContainers)Create a new instance of the reader.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()CloseableIterator<ColumnarBatch>open(CloseableIterable<CombinedScanTask> tasks)Returns a new iterator ofColumnarBatchobjects.-
Methods inherited from class org.apache.iceberg.io.CloseableGroup
addCloseable
-
-
-
-
Constructor Detail
-
ArrowReader
public ArrowReader(TableScan scan, int batchSize, boolean reuseContainers)
Create a new instance of the reader.- Parameters:
scan- the table scan object.batchSize- the maximum number of rows per Arrow batch.reuseContainers- whether to reuse Arrow vectors when iterating through the data. If set tofalse, everyIterator.next()call creates new instances of Arrow vectors. If set totrue, the Arrow vectors in the previousIterator.next()may be reused for the data returned in the currentIterator.next(). This option avoids allocating memory again and again. Irrespective of the value ofreuseContainers, the Arrow vectors in the previousIterator.next()call are closed before creating new instances if the currentIterator.next().
-
-
Method Detail
-
open
public CloseableIterator<ColumnarBatch> open(CloseableIterable<CombinedScanTask> tasks)
Returns a new iterator ofColumnarBatchobjects.Note that the reader owns the
ColumnarBatchobjects and takes care of closing them. The caller should not hold onto aColumnarBatchor try to close them.If
reuseContainersisfalse, the Arrow vectors in the previousColumnarBatchare closed before returning the nextColumnarBatchobject. This implies that the caller should either use theColumnarBatchor transfer the ownership ofColumnarBatchbefore getting the nextColumnarBatch.If
reuseContainersistrue, the Arrow vectors in the previousColumnarBatchmay be reused for the nextColumnarBatch. This implies that the caller should either use theColumnarBatchor deep copy theColumnarBatchbefore getting the nextColumnarBatch.
-
close
public void close() throws java.io.IOException- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable- Overrides:
closein classCloseableGroup- Throws:
java.io.IOException
-
-