org.apache.iceberg.parquet.VariantShreddingAnalyzer<T,S>

Type Parameters:: T - the engine-specific row type (e.g., Spark InternalRow, Flink RowData); S - the engine-specific schema type (e.g., Spark StructType, Flink RowType)

public abstract class VariantShreddingAnalyzer<T,S> extends Object

Analyzes variant data across buffered rows to determine an optimal shredding schema.

Determinism contract: for a given set of variant values (regardless of row arrival order), this analyzer produces the same shredded schema. When the number of distinct fields at any level exceeds MAX_INTERMEDIATE_FIELDS, field tracking becomes insertion-order dependent and determinism is not guaranteed.

Object fields use a TreeMap, so field ordering is alphabetical and deterministic.
Type selection picks the most common type with explicit tie-break priority (see TIE_BREAK_PRIORITY), not enum ordinal.
Integer types (INT8/16/32/64) and decimal types (DECIMAL4/8/16) are each promoted to the widest observed before competing with other types.
Fields below MIN_FIELD_FREQUENCY are pruned. Above MAX_SHREDDED_FIELDS, the most frequent are kept with alphabetical tie-breaking.
Recursion into nested objects/arrays stops at MAX_SHREDDING_DEPTH (default 50).
New struct fields are not tracked once a node reaches MAX_INTERMEDIATE_FIELDS (default 1000) to bound memory during inference.

This contract holds within a single batch. Different batches with different distributions may produce different layouts; cross-batch stability requires schema pinning (not yet implemented).

Subclasses implement extractVariantValues(java.util.List<T>, int) to convert engine-specific row types into VariantValue instances.

Constructor Summary

Constructors

Modifier

Constructor

Description

protected

VariantShreddingAnalyzer()
Method Summary

Modifier and Type

Method

Description

org.apache.parquet.schema.Type

analyzeAndCreateSchema(List<T> bufferedRows, int variantFieldIndex)

Analyzes buffered variant values to determine the optimal shredding schema.

Map<Integer,org.apache.parquet.schema.Type>

analyzeVariantColumns(List<T> bufferedRows, Schema icebergSchema, S engineSchema)

Analyzes all variant columns in the schema, resolving column indices via the engine-specific resolveColumnIndex(S, java.lang.String) method.

protected abstract List<VariantValue>

extractVariantValues(List<T> bufferedRows, int variantFieldIndex)

protected abstract int

resolveColumnIndex(S engineSchema, String columnName)

Resolves a column name to its index in the engine-specific schema.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- VariantShreddingAnalyzer
  
  protected VariantShreddingAnalyzer()
Method Details
- analyzeAndCreateSchema
  
  public org.apache.parquet.schema.Type analyzeAndCreateSchema(List<T> bufferedRows, int variantFieldIndex)
  
  Analyzes buffered variant values to determine the optimal shredding schema.
  
  Parameters:
  
  bufferedRows - the buffered rows to analyze
  
  variantFieldIndex - the index of the variant field in the rows
  
  Returns:
  
  the shredded schema type, or null if no shredding should be performed
- extractVariantValues
  
  protected abstract List<VariantValue> extractVariantValues(List<T> bufferedRows, int variantFieldIndex)
- resolveColumnIndex
  
  protected abstract int resolveColumnIndex(S engineSchema, String columnName)
  
  Resolves a column name to its index in the engine-specific schema. Returns -1 if the column is not found.
- analyzeVariantColumns
  
  public Map<Integer,org.apache.parquet.schema.Type> analyzeVariantColumns(List<T> bufferedRows, Schema icebergSchema, S engineSchema)
  
  Analyzes all variant columns in the schema, resolving column indices via the engine-specific resolveColumnIndex(S, java.lang.String) method.
  
  Parameters:
  
  bufferedRows - the buffered rows to analyze
  
  icebergSchema - the Iceberg table schema
  
  engineSchema - the engine-specific schema used to resolve column indices
  
  Returns:
  
  a map from Iceberg field ID to the shredded Parquet type for each variant column

Class VariantShreddingAnalyzer<T,S>

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

VariantShreddingAnalyzer

Method Details

analyzeAndCreateSchema

extractVariantValues

resolveColumnIndex

analyzeVariantColumns