Class ColumnarBatchUtil
java.lang.Object
org.apache.iceberg.spark.data.vectorized.ColumnarBatchUtil
-
Method Summary
Modifier and TypeMethodDescriptionstatic boolean[]
buildIsDeleted
(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize) Builds a boolean array to indicate if a row is deleted or not.buildRowIdMapping
(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize) Builds a row ID mapping inside a batch to skip deleted rows.static org.apache.spark.sql.vectorized.ColumnVector[]
removeExtraColumns
(DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, org.apache.spark.sql.vectorized.ColumnVector[] columnVectors) Removes extra column vectors added for processing equality delete filters that are not part of the final query output.
-
Method Details
-
buildRowIdMapping
public static Pair<int[],Integer> buildRowIdMapping(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize) Builds a row ID mapping inside a batch to skip deleted rows.Initial state Data values: [v0, v1, v2, v3, v4, v5, v6, v7] Row ID mapping: [0, 1, 2, 3, 4, 5, 6, 7] Apply position deletes Position deletes: 2, 6 Row ID mapping: [0, 1, 3, 4, 5, 7, -, -] (6 live records) Apply equality deletes Equality deletes: v1, v2, v3 Row ID mapping: [0, 4, 5, 7, -, -, -, -] (4 live records)
- Parameters:
columnVectors
- the array of column vectors for the batchdeletes
- the delete filter containing delete informationrowStartPosInBatch
- the starting position of the row in the batchbatchSize
- the size of the batch- Returns:
- the mapping array and the number of live rows, or
null
if nothing is deleted
-
buildIsDeleted
public static boolean[] buildIsDeleted(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize) Builds a boolean array to indicate if a row is deleted or not.Initial state Data values: [v0, v1, v2, v3, v4, v5, v6, v7] Is deleted array: [F, F, F, F, F, F, F, F] Apply position deletes Position deletes: 2, 6 Is deleted array: [F, F, T, F, F, F, T, F] (6 live records) Apply equality deletes Equality deletes: v1, v2, v3 Is deleted array: [F, T, T, T, F, F, T, F] (4 live records)
- Parameters:
columnVectors
- the array of column vectors for the batch.deletes
- the delete filter containing information about which rows should be deleted.rowStartPosInBatch
- the starting position of the row in the batch, used to calculate the absolute position of the rows in the context of the entire dataset.batchSize
- the number of rows in the current batch.- Returns:
- an array of boolean values to indicate if a row is deleted or not
-
removeExtraColumns
public static org.apache.spark.sql.vectorized.ColumnVector[] removeExtraColumns(DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, org.apache.spark.sql.vectorized.ColumnVector[] columnVectors) Removes extra column vectors added for processing equality delete filters that are not part of the final query output.During query execution, additional columns may be included in the schema to evaluate equality delete filters. For example, if the table schema contains columns C1, C2, C3, C4, and C5, and the query is 'SELECT C5 FROM table'. While equality delete filters are applied on C3 and C4, the processing schema includes C5, C3, and C4. These extra columns (C3 and C4) are needed to identify rows to delete but are not included in the final result.
This method removes the extra column vectors from the end of column vectors array, ensuring only the expected column vectors remain.
- Parameters:
deletes
- the delete filter containing delete information.columnVectors
- the array of column vectors representing query result data- Returns:
- a new column vectors array with extra column vectors removed, or the original column vectors array if no extra column vectors are found
-