Class SparkSchemaUtil


  • public class SparkSchemaUtil
    extends java.lang.Object
    Helper methods for working with Spark/Hive metadata.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.apache.spark.sql.types.StructType convert​(Schema schema)
      Convert a Schema to a Spark type.
      static Schema convert​(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema based on the given schema.
      static org.apache.spark.sql.types.DataType convert​(Type type)
      Convert a Type to a Spark type.
      static Type convert​(org.apache.spark.sql.types.DataType sparkType)
      Convert a Spark struct to a Type with new field ids.
      static Schema convert​(org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema with new field ids.
      static Schema convert​(org.apache.spark.sql.types.StructType sparkType, boolean useTimestampWithoutZone)
      Convert a Spark struct to a Schema with new field ids.
      static long estimateSize​(org.apache.spark.sql.types.StructType tableSchema, long totalRecords)
      Estimate approximate table size based on Spark schema and total records.
      static java.util.Map<java.lang.Integer,​java.lang.String> indexQuotedNameById​(Schema schema)  
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType)
      Prune columns from a Schema using a Spark type projection.
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType, java.util.List<Expression> filters)
      Prune columns from a Schema using a Spark type projection.
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType, Expression filter, boolean caseSensitive)
      Prune columns from a Schema using a Spark type projection.
      static Schema schemaForTable​(org.apache.spark.sql.SparkSession spark, java.lang.String name)
      Returns a Schema for the given table with fresh field ids.
      static PartitionSpec specForTable​(org.apache.spark.sql.SparkSession spark, java.lang.String name)
      Returns a PartitionSpec for the given table.
      static void validateMetadataColumnReferences​(Schema tableSchema, Schema readSchema)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • schemaForTable

        public static Schema schemaForTable​(org.apache.spark.sql.SparkSession spark,
                                            java.lang.String name)
        Returns a Schema for the given table with fresh field ids.

        This creates a Schema for an existing table by looking up the table's schema with Spark and converting that schema. Spark/Hive partition columns are included in the schema.

        Parameters:
        spark - a Spark session
        name - a table name and (optional) database
        Returns:
        a Schema for the table, if found
      • specForTable

        public static PartitionSpec specForTable​(org.apache.spark.sql.SparkSession spark,
                                                 java.lang.String name)
                                          throws org.apache.spark.sql.AnalysisException
        Returns a PartitionSpec for the given table.

        This creates a partition spec for an existing table by looking up the table's schema and creating a spec with identity partitions for each partition column.

        Parameters:
        spark - a Spark session
        name - a table name and (optional) database
        Returns:
        a PartitionSpec for the table
        Throws:
        org.apache.spark.sql.AnalysisException - if thrown by the Spark catalog
      • convert

        public static org.apache.spark.sql.types.StructType convert​(Schema schema)
        Convert a Schema to a Spark type.
        Parameters:
        schema - a Schema
        Returns:
        the equivalent Spark type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted to Spark
      • convert

        public static org.apache.spark.sql.types.DataType convert​(Type type)
        Convert a Type to a Spark type.
        Parameters:
        type - a Type
        Returns:
        the equivalent Spark type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted to Spark
      • convert

        public static Schema convert​(org.apache.spark.sql.types.StructType sparkType)
        Convert a Spark struct to a Schema with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark StructType
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Schema convert​(org.apache.spark.sql.types.StructType sparkType,
                                     boolean useTimestampWithoutZone)
        Convert a Spark struct to a Schema with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark StructType
        useTimestampWithoutZone - boolean flag indicates that timestamp should be stored without timezone
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Type convert​(org.apache.spark.sql.types.DataType sparkType)
        Convert a Spark struct to a Type with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark DataType
        Returns:
        the equivalent Type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Schema convert​(Schema baseSchema,
                                     org.apache.spark.sql.types.StructType sparkType)
        Convert a Spark struct to a Schema based on the given schema.

        This conversion does not assign new ids; it uses ids from the base schema.

        Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

        Parameters:
        baseSchema - a Schema on which conversion is based
        sparkType - a Spark StructType
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted or there are missing ids
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType,
                                   java.util.List<Expression> filters)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        The filters list of Expression is used to ensure that columns referenced by filters are projected.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        filters - a list of filters
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType,
                                   Expression filter,
                                   boolean caseSensitive)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        The filters list of Expression is used to ensure that columns referenced by filters are projected.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        filter - a filters
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • estimateSize

        public static long estimateSize​(org.apache.spark.sql.types.StructType tableSchema,
                                        long totalRecords)
        Estimate approximate table size based on Spark schema and total records.
        Parameters:
        tableSchema - Spark schema
        totalRecords - total records in the table
        Returns:
        approximate size based on table schema
      • validateMetadataColumnReferences

        public static void validateMetadataColumnReferences​(Schema tableSchema,
                                                            Schema readSchema)
      • indexQuotedNameById

        public static java.util.Map<java.lang.Integer,​java.lang.String> indexQuotedNameById​(Schema schema)