Class SparkSchemaUtil


  • public class SparkSchemaUtil
    extends java.lang.Object
    Helper methods for working with Spark/Hive metadata.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.apache.spark.sql.types.StructType convert​(Schema schema)
      Convert a Schema to a Spark type.
      static Schema convert​(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema based on the given schema.
      static Schema convert​(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType, boolean caseSensitive)
      Convert a Spark struct to a Schema based on the given schema.
      static org.apache.spark.sql.types.DataType convert​(Type type)
      Convert a Type to a Spark type.
      static Type convert​(org.apache.spark.sql.types.DataType sparkType)
      Convert a Spark struct to a Type with new field ids.
      static Schema convert​(org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema with new field ids.
      static Schema convert​(org.apache.spark.sql.types.StructType sparkType, boolean useTimestampWithoutZone)
      Convert a Spark struct to a Schema with new field ids.
      static Schema convertWithFreshIds​(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType)
      Convert a Spark struct to a Schema based on the given schema.
      static Schema convertWithFreshIds​(Schema baseSchema, org.apache.spark.sql.types.StructType sparkType, boolean caseSensitive)
      Convert a Spark struct to a Schema based on the given schema.
      static long estimateSize​(org.apache.spark.sql.types.StructType tableSchema, long totalRecords)
      Estimate approximate table size based on Spark schema and total records.
      static java.util.Map<java.lang.Integer,​java.lang.String> indexQuotedNameById​(Schema schema)  
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType)
      Prune columns from a Schema using a Spark type projection.
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType, java.util.List<Expression> filters)
      Prune columns from a Schema using a Spark type projection.
      static Schema prune​(Schema schema, org.apache.spark.sql.types.StructType requestedType, Expression filter, boolean caseSensitive)
      Prune columns from a Schema using a Spark type projection.
      static Schema schemaForTable​(org.apache.spark.sql.SparkSession spark, java.lang.String name)
      Returns a Schema for the given table with fresh field ids.
      static PartitionSpec specForTable​(org.apache.spark.sql.SparkSession spark, java.lang.String name)
      Returns a PartitionSpec for the given table.
      static void validateMetadataColumnReferences​(Schema tableSchema, Schema readSchema)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • schemaForTable

        public static Schema schemaForTable​(org.apache.spark.sql.SparkSession spark,
                                            java.lang.String name)
        Returns a Schema for the given table with fresh field ids.

        This creates a Schema for an existing table by looking up the table's schema with Spark and converting that schema. Spark/Hive partition columns are included in the schema.

        Parameters:
        spark - a Spark session
        name - a table name and (optional) database
        Returns:
        a Schema for the table, if found
      • specForTable

        public static PartitionSpec specForTable​(org.apache.spark.sql.SparkSession spark,
                                                 java.lang.String name)
                                          throws org.apache.spark.sql.AnalysisException
        Returns a PartitionSpec for the given table.

        This creates a partition spec for an existing table by looking up the table's schema and creating a spec with identity partitions for each partition column.

        Parameters:
        spark - a Spark session
        name - a table name and (optional) database
        Returns:
        a PartitionSpec for the table
        Throws:
        org.apache.spark.sql.AnalysisException - if thrown by the Spark catalog
      • convert

        public static org.apache.spark.sql.types.StructType convert​(Schema schema)
        Convert a Schema to a Spark type.
        Parameters:
        schema - a Schema
        Returns:
        the equivalent Spark type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted to Spark
      • convert

        public static org.apache.spark.sql.types.DataType convert​(Type type)
        Convert a Type to a Spark type.
        Parameters:
        type - a Type
        Returns:
        the equivalent Spark type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted to Spark
      • convert

        public static Schema convert​(org.apache.spark.sql.types.StructType sparkType)
        Convert a Spark struct to a Schema with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark StructType
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Schema convert​(org.apache.spark.sql.types.StructType sparkType,
                                     boolean useTimestampWithoutZone)
        Convert a Spark struct to a Schema with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark StructType
        useTimestampWithoutZone - boolean flag indicates that timestamp should be stored without timezone
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Type convert​(org.apache.spark.sql.types.DataType sparkType)
        Convert a Spark struct to a Type with new field ids.

        This conversion assigns fresh ids.

        Some data types are represented as the same Spark type. These are converted to a default type.

        To convert using a reference schema for field ids and ambiguous types, use convert(Schema, StructType).

        Parameters:
        sparkType - a Spark DataType
        Returns:
        the equivalent Type
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted
      • convert

        public static Schema convert​(Schema baseSchema,
                                     org.apache.spark.sql.types.StructType sparkType)
        Convert a Spark struct to a Schema based on the given schema.

        This conversion does not assign new ids; it uses ids from the base schema.

        Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

        Parameters:
        baseSchema - a Schema on which conversion is based
        sparkType - a Spark StructType
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted or there are missing ids
      • convert

        public static Schema convert​(Schema baseSchema,
                                     org.apache.spark.sql.types.StructType sparkType,
                                     boolean caseSensitive)
        Convert a Spark struct to a Schema based on the given schema.

        This conversion does not assign new ids; it uses ids from the base schema.

        Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

        Parameters:
        baseSchema - a Schema on which conversion is based
        sparkType - a Spark StructType
        caseSensitive - when false, the case of schema fields is ignored
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted or there are missing ids
      • convertWithFreshIds

        public static Schema convertWithFreshIds​(Schema baseSchema,
                                                 org.apache.spark.sql.types.StructType sparkType)
        Convert a Spark struct to a Schema based on the given schema.

        This conversion will assign new ids for fields that are not found in the base schema.

        Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

        Parameters:
        baseSchema - a Schema on which conversion is based
        sparkType - a Spark StructType
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted or there are missing ids
      • convertWithFreshIds

        public static Schema convertWithFreshIds​(Schema baseSchema,
                                                 org.apache.spark.sql.types.StructType sparkType,
                                                 boolean caseSensitive)
        Convert a Spark struct to a Schema based on the given schema.

        This conversion will assign new ids for fields that are not found in the base schema.

        Data types, field order, and nullability will match the spark type. This conversion may return a schema that is not compatible with base schema.

        Parameters:
        baseSchema - a Schema on which conversion is based
        sparkType - a Spark StructType
        caseSensitive - when false, case of field names in schema is ignored
        Returns:
        the equivalent Schema
        Throws:
        java.lang.IllegalArgumentException - if the type cannot be converted or there are missing ids
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType,
                                   java.util.List<Expression> filters)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        The filters list of Expression is used to ensure that columns referenced by filters are projected.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        filters - a list of filters
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • prune

        public static Schema prune​(Schema schema,
                                   org.apache.spark.sql.types.StructType requestedType,
                                   Expression filter,
                                   boolean caseSensitive)
        Prune columns from a Schema using a Spark type projection.

        This requires that the Spark type is a projection of the Schema. Nullability and types must match.

        The filters list of Expression is used to ensure that columns referenced by filters are projected.

        Parameters:
        schema - a Schema
        requestedType - a projection of the Spark representation of the Schema
        filter - a filters
        Returns:
        a Schema corresponding to the Spark projection
        Throws:
        java.lang.IllegalArgumentException - if the Spark type does not match the Schema
      • estimateSize

        public static long estimateSize​(org.apache.spark.sql.types.StructType tableSchema,
                                        long totalRecords)
        Estimate approximate table size based on Spark schema and total records.
        Parameters:
        tableSchema - Spark schema
        totalRecords - total records in the table
        Returns:
        approximate size based on table schema
      • validateMetadataColumnReferences

        public static void validateMetadataColumnReferences​(Schema tableSchema,
                                                            Schema readSchema)
      • indexQuotedNameById

        public static java.util.Map<java.lang.Integer,​java.lang.String> indexQuotedNameById​(Schema schema)