Package org.apache.iceberg.parquet
Class VariantShreddingAnalyzer<T,S>
java.lang.Object
org.apache.iceberg.parquet.VariantShreddingAnalyzer<T,S>
- Type Parameters:
T- the engine-specific row type (e.g., Spark InternalRow, Flink RowData)S- the engine-specific schema type (e.g., Spark StructType, Flink RowType)
Analyzes variant data across buffered rows to determine an optimal shredding schema.
Determinism contract: for a given set of variant values (regardless of row arrival order),
this analyzer produces the same shredded schema. When the number of distinct fields at any level
exceeds MAX_INTERMEDIATE_FIELDS, field tracking becomes insertion-order dependent and
determinism is not guaranteed.
- Object fields use a TreeMap, so field ordering is alphabetical and deterministic.
- Type selection picks the most common type with explicit tie-break priority (see TIE_BREAK_PRIORITY), not enum ordinal.
- Integer types (INT8/16/32/64) and decimal types (DECIMAL4/8/16) are each promoted to the widest observed before competing with other types.
- Fields below
MIN_FIELD_FREQUENCYare pruned. AboveMAX_SHREDDED_FIELDS, the most frequent are kept with alphabetical tie-breaking. - Recursion into nested objects/arrays stops at
MAX_SHREDDING_DEPTH(default 50). - New struct fields are not tracked once a node reaches
MAX_INTERMEDIATE_FIELDS(default 1000) to bound memory during inference.
This contract holds within a single batch. Different batches with different distributions may produce different layouts; cross-batch stability requires schema pinning (not yet implemented).
Subclasses implement extractVariantValues(java.util.List<T>, int) to convert engine-specific row types into
VariantValue instances.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionorg.apache.parquet.schema.TypeanalyzeAndCreateSchema(List<T> bufferedRows, int variantFieldIndex) Analyzes buffered variant values to determine the optimal shredding schema.analyzeVariantColumns(List<T> bufferedRows, Schema icebergSchema, S engineSchema) Analyzes all variant columns in the schema, resolving column indices via the engine-specificresolveColumnIndex(S, java.lang.String)method.protected abstract List<VariantValue> extractVariantValues(List<T> bufferedRows, int variantFieldIndex) protected abstract intresolveColumnIndex(S engineSchema, String columnName) Resolves a column name to its index in the engine-specific schema.
-
Constructor Details
-
VariantShreddingAnalyzer
protected VariantShreddingAnalyzer()
-
-
Method Details
-
analyzeAndCreateSchema
public org.apache.parquet.schema.Type analyzeAndCreateSchema(List<T> bufferedRows, int variantFieldIndex) Analyzes buffered variant values to determine the optimal shredding schema.- Parameters:
bufferedRows- the buffered rows to analyzevariantFieldIndex- the index of the variant field in the rows- Returns:
- the shredded schema type, or null if no shredding should be performed
-
extractVariantValues
protected abstract List<VariantValue> extractVariantValues(List<T> bufferedRows, int variantFieldIndex) -
resolveColumnIndex
Resolves a column name to its index in the engine-specific schema. Returns -1 if the column is not found. -
analyzeVariantColumns
public Map<Integer,org.apache.parquet.schema.Type> analyzeVariantColumns(List<T> bufferedRows, Schema icebergSchema, S engineSchema) Analyzes all variant columns in the schema, resolving column indices via the engine-specificresolveColumnIndex(S, java.lang.String)method.- Parameters:
bufferedRows- the buffered rows to analyzeicebergSchema- the Iceberg table schemaengineSchema- the engine-specific schema used to resolve column indices- Returns:
- a map from Iceberg field ID to the shredded Parquet type for each variant column
-