Class VariantShreddingAnalyzer<T,S>

java.lang.Object
org.apache.iceberg.parquet.VariantShreddingAnalyzer<T,S>
Type Parameters:
T - the engine-specific row type (e.g., Spark InternalRow, Flink RowData)
S - the engine-specific schema type (e.g., Spark StructType, Flink RowType)

public abstract class VariantShreddingAnalyzer<T,S> extends Object
Analyzes variant data across buffered rows to determine an optimal shredding schema.

Determinism contract: for a given set of variant values (regardless of row arrival order), this analyzer produces the same shredded schema. When the number of distinct fields at any level exceeds MAX_INTERMEDIATE_FIELDS, field tracking becomes insertion-order dependent and determinism is not guaranteed.

  • Object fields use a TreeMap, so field ordering is alphabetical and deterministic.
  • Type selection picks the most common type with explicit tie-break priority (see TIE_BREAK_PRIORITY), not enum ordinal.
  • Integer types (INT8/16/32/64) and decimal types (DECIMAL4/8/16) are each promoted to the widest observed before competing with other types.
  • Fields below MIN_FIELD_FREQUENCY are pruned. Above MAX_SHREDDED_FIELDS, the most frequent are kept with alphabetical tie-breaking.
  • Recursion into nested objects/arrays stops at MAX_SHREDDING_DEPTH (default 50).
  • New struct fields are not tracked once a node reaches MAX_INTERMEDIATE_FIELDS (default 1000) to bound memory during inference.

This contract holds within a single batch. Different batches with different distributions may produce different layouts; cross-batch stability requires schema pinning (not yet implemented).

Subclasses implement extractVariantValues(java.util.List<T>, int) to convert engine-specific row types into VariantValue instances.

  • Constructor Details

    • VariantShreddingAnalyzer

      protected VariantShreddingAnalyzer()
  • Method Details

    • analyzeAndCreateSchema

      public org.apache.parquet.schema.Type analyzeAndCreateSchema(List<T> bufferedRows, int variantFieldIndex)
      Analyzes buffered variant values to determine the optimal shredding schema.
      Parameters:
      bufferedRows - the buffered rows to analyze
      variantFieldIndex - the index of the variant field in the rows
      Returns:
      the shredded schema type, or null if no shredding should be performed
    • extractVariantValues

      protected abstract List<VariantValue> extractVariantValues(List<T> bufferedRows, int variantFieldIndex)
    • resolveColumnIndex

      protected abstract int resolveColumnIndex(S engineSchema, String columnName)
      Resolves a column name to its index in the engine-specific schema. Returns -1 if the column is not found.
    • analyzeVariantColumns

      public Map<Integer,org.apache.parquet.schema.Type> analyzeVariantColumns(List<T> bufferedRows, Schema icebergSchema, S engineSchema)
      Analyzes all variant columns in the schema, resolving column indices via the engine-specific resolveColumnIndex(S, java.lang.String) method.
      Parameters:
      bufferedRows - the buffered rows to analyze
      icebergSchema - the Iceberg table schema
      engineSchema - the engine-specific schema used to resolve column indices
      Returns:
      a map from Iceberg field ID to the shredded Parquet type for each variant column