Class SparkSchemaUtil

java.lang.Object
org.apache.iceberg.spark.SparkSchemaUtil

public class SparkSchemaUtil extends Object
Helper methods for working with Spark/Hive metadata.
  • Method Details

    • schemaForTable

      public static Schema schemaForTable(org.apache.spark.sql.SparkSession spark, String name)
      Returns a Schema for the given table with fresh field ids.

      This creates a Schema for an existing table by looking up the table's schema with Spark and converting that schema. Spark/Hive partition columns are included in the schema.

      Parameters:
      spark - a Spark session
      name - a table name and (optional) database
      Returns:
      a Schema for the table, if found
    • specForTable

      public static PartitionSpec specForTable(org.apache.spark.sql.SparkSession spark, String name) throws org.apache.spark.sql.AnalysisException
      Returns a PartitionSpec for the given table.

      This creates a partition spec for an existing table by looking up the table's schema and creating a spec with identity partitions for each partition column.

      Parameters:
      spark - a Spark session
      name - a table name and (optional) database
      Returns:
      a PartitionSpec for the table
      Throws:
      org.apache.spark.sql.AnalysisException - if thrown by the Spark catalog
    • convert

      public static org.apache.spark.sql.types.StructType convert(Schema schema)
      Convert a Schema to a Spark type.
      Parameters:
      schema - a Schema
      Returns:
      the equivalent Spark type
      Throws:
      IllegalArgumentException - if the type cannot be converted to Spark
    • convert

      public static org.apache.spark.sql.types.DataType convert(Type type)
      Convert a Type to a Spark type.
      Parameters:
      type - a Type
      Returns:
      the equivalent Spark type
      Throws:
      IllegalArgumentException - if the type cannot be converted to Spark
    • convert

      public static Schema convert(org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema with new field ids.

      This conversion assigns fresh ids.

      Some data types are represented as the same Spark type. These are converted to a default type.

      To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

      Parameters:
      sparkType - a Spark StructType
      Returns:
      the equivalent Schema
      Throws:
      IllegalArgumentException - if the type cannot be converted
    • convert

      public static Type convert(org.apache.spark.sql.types.DataType sparkType)
      Convert a Spark struct to a Type with new field ids.

      This conversion assigns fresh ids.

      Some data types are represented as the same Spark type. These are converted to a default type.

      To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

      Parameters:
      sparkType - a Spark DataType
      Returns:
      the equivalent Type
      Throws:
      IllegalArgumentException - if the type cannot be converted
    • convert

      public static Schema convert(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema based on the given schema.

      This conversion does not assign new ids; it uses ids from the base schema.

      Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

      Parameters:
      baseSchema - a Schema on which conversion is based
      sparkType - a Spark StructType
      Returns:
      the equivalent Schema
      Throws:
      IllegalArgumentException - if the type cannot be converted or there are missing ids
    • convert

      public static Schema convert(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType, boolean caseSensitive)
      Convert a Spark struct to a Schema based on the given schema.

      This conversion does not assign new ids; it uses ids from the base schema.

      Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

      Parameters:
      baseSchema - a Schema on which conversion is based
      sparkType - a Spark StructType
      caseSensitive - when false, the case of schema fields is ignored
      Returns:
      the equivalent Schema
      Throws:
      IllegalArgumentException - if the type cannot be converted or there are missing ids
    • convertWithFreshIds

      public static Schema convertWithFreshIds(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema based on the given schema.

      This conversion will assign new ids for fields that are not found in the base schema.

      Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

      Parameters:
      baseSchema - a Schema on which conversion is based
      sparkType - a Spark StructType
      Returns:
      the equivalent Schema
      Throws:
      IllegalArgumentException - if the type cannot be converted or there are missing ids
    • convertWithFreshIds

      public static Schema convertWithFreshIds(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType, boolean caseSensitive)
      Convert a Spark struct to a Schema based on the given schema.

      This conversion will assign new ids for fields that are not found in the base schema.

      Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

      Parameters:
      baseSchema - a Schema on which conversion is based
      sparkType - a Spark StructType
      caseSensitive - when false, case of field names in schema is ignored
      Returns:
      the equivalent Schema
      Throws:
      IllegalArgumentException - if the type cannot be converted or there are missing ids
    • prune

      public static Schema prune(Schema schema, org.apache.spark.sql.types.StructType requestedType)
      Prune columns from a Schema using a Spark type projection.

      This requires that the Spark type is a projection of the Schema. Nullability and types must match.

      Parameters:
      schema - a Schema
      requestedType - a projection of the Spark representation of the Schema
      Returns:
      a Schema corresponding to the Spark projection
      Throws:
      IllegalArgumentException - if the Spark type does not match the Schema
    • prune

      public static Schema prune(Schema schema, org.apache.spark.sql.types.StructType requestedType, List<Expression> filters)
      Prune columns from a Schema using a Spark type projection.

      This requires that the Spark type is a projection of the Schema. Nullability and types must match.

      The filters list of Expression is used to ensure that columns referenced by filters are projected.

      Parameters:
      schema - a Schema
      requestedType - a projection of the Spark representation of the Schema
      filters - a list of filters
      Returns:
      a Schema corresponding to the Spark projection
      Throws:
      IllegalArgumentException - if the Spark type does not match the Schema
    • prune

      public static Schema prune(Schema schema, org.apache.spark.sql.types.StructType requestedType, Expression filter, boolean caseSensitive)
      Prune columns from a Schema using a Spark type projection.

      This requires that the Spark type is a projection of the Schema. Nullability and types must match.

      The filters list of Expression is used to ensure that columns referenced by filters are projected.

      Parameters:
      schema - a Schema
      requestedType - a projection of the Spark representation of the Schema
      filter - a filters
      Returns:
      a Schema corresponding to the Spark projection
      Throws:
      IllegalArgumentException - if the Spark type does not match the Schema
    • estimateSize

      public static long estimateSize(org.apache.spark.sql.types.StructType tableSchema, long totalRecords)
      Estimate approximate table size based on Spark schema and total records.
      Parameters:
      tableSchema - Spark schema
      totalRecords - total records in the table
      Returns:
      approximate size based on table schema
    • validateMetadataColumnReferences

      public static void validateMetadataColumnReferences(Schema tableSchema, Schema readSchema)
    • indexQuotedNameById

      public static Map<Integer,String> indexQuotedNameById(Schema schema)