Class DeleteOrphanFilesSparkAction

    Action<DeleteOrphanFiles,​DeleteOrphanFiles.Result>, DeleteOrphanFiles

    public class DeleteOrphanFilesSparkAction
    extends java.lang.Object
    implements DeleteOrphanFiles
    An action that removes orphan metadata, data and delete files by listing a given location and comparing the actual files in that location with content and metadata files referenced by all valid snapshots. The location must be accessible for listing via the Hadoop FileSystem.

    By default, this action cleans up the table location returned by Table.location() and removes unreachable files that are older than 3 days using The behavior can be modified by passing a custom location to location and a custom timestamp to olderThan(long). For example, someone might point this action to the data folder to clean up only orphan data files.

    Configure an alternative delete method using deleteWith(Consumer).

    For full control of the set of files being evaluated, use the compareToFileList(Dataset) argument. This skips the directory listing - any files in the dataset provided which are not found in table metadata will be deleted, using the same Table.location() and olderThan(long) filtering as above.

    Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.

        public DeleteOrphanFilesSparkAction executeDeleteWith​(java.util.concurrent.ExecutorService executorService)
        Passes an alternative executor service that will be used for removing orphaned files.

        If this method is not called, orphaned manifests and data files will still be deleted in the current thread.

        public DeleteOrphanFilesSparkAction location​(java.lang.String newLocation)
        Passes a location which should be scanned for orphan files.

        If not set, the root table location will be scanned potentially removing both orphan data and metadata files.

        public DeleteOrphanFilesSparkAction olderThan​(long newOlderThanTimestamp)
        Removes orphan files only if they are older than the given timestamp.

        This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.

        If not set, defaults to a timestamp 3 days ago.

        public DeleteOrphanFilesSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Passes an alternative delete implementation that will be used for orphan files.

        This method allows users to customize the delete func. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.

        If not set, defaults to using the table's io implementation.

        public DeleteOrphanFilesSparkAction compareToFileList​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files)
        protected org.apache.spark.sql.SparkSession spark()
        protected sparkContext()
        public ThisT option​(java.lang.String name,
                            java.lang.String value)
        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
        protected java.util.Map<java.lang.String,​java.lang.String> options()
        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileWithTypeDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildAllReachableOtherMetadataFileDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF​(Table table)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> withFileType​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds,
                                                                                      java.lang.String type)
        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)