Class ArrowReader
- java.lang.Object
-
- org.apache.iceberg.io.CloseableGroup
-
- org.apache.iceberg.arrow.vectorized.ArrowReader
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class ArrowReader extends CloseableGroup
Vectorized reader that returns an iterator ofColumnarBatch
. Seeopen(CloseableIterable)
()} to learn about the behavior of the iterator.The following Iceberg data types are supported and have been tested:
- Iceberg:
Types.BooleanType
, Arrow:Types.MinorType.BIT
- Iceberg:
Types.IntegerType
, Arrow:Types.MinorType.INT
- Iceberg:
Types.LongType
, Arrow:Types.MinorType.BIGINT
- Iceberg:
Types.FloatType
, Arrow:Types.MinorType.FLOAT4
- Iceberg:
Types.DoubleType
, Arrow:Types.MinorType.FLOAT8
- Iceberg:
Types.StringType
, Arrow:Types.MinorType.VARCHAR
- Iceberg:
Types.TimestampType
(both with and without timezone), Arrow:Types.MinorType.TIMEMICRO
- Iceberg:
Types.BinaryType
, Arrow:Types.MinorType.VARBINARY
- Iceberg:
Types.DateType
, Arrow:Types.MinorType.DATEDAY
- Iceberg:
Types.TimeType
, Arrow:Types.MinorType.TIMEMICRO
- Iceberg:
Types.UUIDType
, Arrow:Types.MinorType.FIXEDSIZEBINARY
(16)
Features that don't work in this implementation:
- Type promotion: In case of type promotion, the Arrow vector corresponding to the data type in the parquet file is returned instead of the data type in the latest schema. See https://github.com/apache/iceberg/issues/2483.
- Columns with constant values are physically encoded as a dictionary. The Arrow vector type is int32 instead of the type as per the schema. See https://github.com/apache/iceberg/issues/2484.
- Data types:
Types.ListType
,Types.MapType
,Types.StructType
,Types.FixedType
andTypes.DecimalType
See https://github.com/apache/iceberg/issues/2485 and https://github.com/apache/iceberg/issues/2486. - Delete files are not supported. See https://github.com/apache/iceberg/issues/2487.
-
-
Constructor Summary
Constructors Constructor Description ArrowReader(TableScan scan, int batchSize, boolean reuseContainers)
Create a new instance of the reader.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Close all the registered resources.CloseableIterator<ColumnarBatch>
open(CloseableIterable<CombinedScanTask> tasks)
Returns a new iterator ofColumnarBatch
objects.-
Methods inherited from class org.apache.iceberg.io.CloseableGroup
addCloseable, addCloseable, setSuppressCloseFailure
-
-
-
-
Constructor Detail
-
ArrowReader
public ArrowReader(TableScan scan, int batchSize, boolean reuseContainers)
Create a new instance of the reader.- Parameters:
scan
- the table scan object.batchSize
- the maximum number of rows per Arrow batch.reuseContainers
- whether to reuse Arrow vectors when iterating through the data. If set tofalse
, everyIterator.next()
call creates new instances of Arrow vectors. If set totrue
, the Arrow vectors in the previousIterator.next()
may be reused for the data returned in the currentIterator.next()
. This option avoids allocating memory again and again. Irrespective of the value ofreuseContainers
, the Arrow vectors in the previousIterator.next()
call are closed before creating new instances if the currentIterator.next()
.
-
-
Method Detail
-
open
public CloseableIterator<ColumnarBatch> open(CloseableIterable<CombinedScanTask> tasks)
Returns a new iterator ofColumnarBatch
objects.Note that the reader owns the
ColumnarBatch
objects and takes care of closing them. The caller should not hold onto aColumnarBatch
or try to close them.If
reuseContainers
isfalse
, the Arrow vectors in the previousColumnarBatch
are closed before returning the nextColumnarBatch
object. This implies that the caller should either use theColumnarBatch
or transfer the ownership ofColumnarBatch
before getting the nextColumnarBatch
.If
reuseContainers
istrue
, the Arrow vectors in the previousColumnarBatch
may be reused for the nextColumnarBatch
. This implies that the caller should either use theColumnarBatch
or deep copy theColumnarBatch
before getting the nextColumnarBatch
.This method works for only when the following conditions are true:
- At least one column is queried,
- There are no delete files, and
- Supported data types are queried (see
SUPPORTED_TYPES
).
UnsupportedOperationException
is thrown.
-
close
public void close() throws java.io.IOException
Description copied from class:CloseableGroup
Close all the registered resources. Close method of each resource will only be called once. Checked exception from AutoCloseable will be wrapped to runtime exception.- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
- Overrides:
close
in classCloseableGroup
- Throws:
java.io.IOException
-
-