Class BaseDeleteOrphanFilesSparkAction

  • All Implemented Interfaces:
    Action<DeleteOrphanFiles,​DeleteOrphanFiles.Result>, DeleteOrphanFiles

    public class BaseDeleteOrphanFilesSparkAction
    extends java.lang.Object
    implements DeleteOrphanFiles
    An action that removes orphan metadata and data files by listing a given location and comparing the actual files in that location with data and metadata files referenced by all valid snapshots. The location must be accessible for listing via the Hadoop FileSystem.

    By default, this action cleans up the table location returned by Table.location() and removes unreachable files that are older than 3 days using Table.io(). The behavior can be modified by passing a custom location to location and a custom timestamp to olderThan(long). For example, someone might point this action to the data folder to clean up only orphan data files. In addition, there is a way to configure an alternative delete method via deleteWith(Consumer).

    Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.

    • Constructor Detail

      • BaseDeleteOrphanFilesSparkAction

        public BaseDeleteOrphanFilesSparkAction​(org.apache.spark.sql.SparkSession spark,
                                                Table table)
    • Method Detail

      • executeDeleteWith

        public BaseDeleteOrphanFilesSparkAction executeDeleteWith​(java.util.concurrent.ExecutorService executorService)
        Description copied from interface: DeleteOrphanFiles
        Passes an alternative executor service that will be used for removing orphaned files.

        If this method is not called, orphaned manifests and data files will still be deleted in the current thread.

        Specified by:
        executeDeleteWith in interface DeleteOrphanFiles
        Parameters:
        executorService - the service to use
        Returns:
        this for method chaining
      • location

        public BaseDeleteOrphanFilesSparkAction location​(java.lang.String newLocation)
        Description copied from interface: DeleteOrphanFiles
        Passes a location which should be scanned for orphan files.

        If not set, the root table location will be scanned potentially removing both orphan data and metadata files.

        Specified by:
        location in interface DeleteOrphanFiles
        Parameters:
        newLocation - the location where to look for orphan files
        Returns:
        this for method chaining
      • olderThan

        public BaseDeleteOrphanFilesSparkAction olderThan​(long newOlderThanTimestamp)
        Description copied from interface: DeleteOrphanFiles
        Removes orphan files only if they are older than the given timestamp.

        This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.

        If not set, defaults to a timestamp 3 days ago.

        Specified by:
        olderThan in interface DeleteOrphanFiles
        Parameters:
        newOlderThanTimestamp - a long timestamp, as returned by System.currentTimeMillis()
        Returns:
        this for method chaining
      • deleteWith

        public BaseDeleteOrphanFilesSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Description copied from interface: DeleteOrphanFiles
        Passes an alternative delete implementation that will be used for orphan files.

        This method allows users to customize the delete func. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.

        If not set, defaults to using the table's io implementation.

        Specified by:
        deleteWith in interface DeleteOrphanFiles
        Parameters:
        newDeleteFunc - a function that will be called to delete files
        Returns:
        this for method chaining
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sparkContext

        protected org.apache.spark.api.java.JavaSparkContext sparkContext()
      • option

        public ThisT option​(java.lang.String name,
                            java.lang.String value)
        Description copied from interface: Action
        Configures this action with an extra option.

        Certain actions allow users to control internal details of their execution via options.

        Specified by:
        option in interface Action<ThisT,​R>
        Parameters:
        name - an option name
        value - an option value
        Returns:
        this for method chaining
      • options

        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
        Description copied from interface: Action
        Configures this action with extra options.

        Certain actions allow users to control internal details of their execution via options.

        Specified by:
        options in interface Action<ThisT,​R>
        Parameters:
        newOptions - a map of extra options
        Returns:
        this for method chaining
      • options

        protected java.util.Map<java.lang.String,​java.lang.String> options()
      • withJobGroupInfo

        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
      • newJobGroupInfo

        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
      • buildValidDataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF​(Table table)
      • buildManifestFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF​(Table table)
      • buildManifestListDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF​(Table table)
      • buildOtherMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF​(Table table)
      • buildValidMetadataFileDF

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF​(Table table)
      • loadMetadataTable

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)