Class DeleteOrphanFilesSparkAction

java.lang.Object
org.apache.iceberg.spark.actions.DeleteOrphanFilesSparkAction
All Implemented Interfaces:
Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>, DeleteOrphanFiles

public class DeleteOrphanFilesSparkAction extends Object implements DeleteOrphanFiles
An action that removes orphan metadata, data and delete files by listing a given location and comparing the actual files in that location with content and metadata files referenced by all valid snapshots. The location must be accessible for listing via the Hadoop FileSystem.

By default, this action cleans up the table location returned by Table.location() and removes unreachable files that are older than 3 days using Table.io(). The behavior can be modified by passing a custom location to location and a custom timestamp to olderThan(long). For example, someone might point this action to the data folder to clean up only orphan data files.

Configure an alternative delete method using deleteWith(Consumer).

For full control of the set of files being evaluated, use the compareToFileList(Dataset) argument. This skips the directory listing - any files in the dataset provided which are not found in table metadata will be deleted, using the same Table.location() and olderThan(long) filtering as above.

Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.

  • Field Details Link icon

  • Method Details Link icon

    • self Link icon

      protected DeleteOrphanFilesSparkAction self()
    • executeDeleteWith Link icon

      public DeleteOrphanFilesSparkAction executeDeleteWith(ExecutorService executorService)
      Description copied from interface: DeleteOrphanFiles
      Passes an alternative executor service that will be used for removing orphaned files. This service will only be used if a custom delete function is provided by DeleteOrphanFiles.deleteWith(Consumer) or if the FileIO does not support bulk deletes. Otherwise, parallelism should be controlled by the IO specific deleteFiles method.

      If this method is not called and bulk deletes are not supported, orphaned manifests and data files will still be deleted in the current thread.

      Specified by:
      executeDeleteWith in interface DeleteOrphanFiles
      Parameters:
      executorService - the service to use
      Returns:
      this for method chaining
    • prefixMismatchMode Link icon

      public DeleteOrphanFilesSparkAction prefixMismatchMode(DeleteOrphanFiles.PrefixMismatchMode newPrefixMismatchMode)
      Description copied from interface: DeleteOrphanFiles
      Passes a prefix mismatch mode that determines how this action should handle situations when the metadata references files that match listed/provided files except for authority/scheme.

      Possible values are "ERROR", "IGNORE", "DELETE". The default mismatch mode is "ERROR", which means an exception is thrown whenever there is a mismatch in authority/scheme. It's the recommended mismatch mode and should be changed only in some rare circumstances. If there is a mismatch, use DeleteOrphanFiles.equalSchemes(Map) and DeleteOrphanFiles.equalAuthorities(Map) to resolve conflicts by providing equivalent schemes and authorities. If it is impossible to determine whether the conflicting authorities/schemes are equal, set the prefix mismatch mode to "IGNORE" to skip files with mismatches. If you have manually inspected all conflicting authorities/schemes, provided equivalent schemes/authorities and are absolutely confident the remaining ones are different, set the prefix mismatch mode to "DELETE" to consider files with mismatches as orphan. It will be impossible to recover files after deletion, so the "DELETE" prefix mismatch mode must be used with extreme caution.

      Specified by:
      prefixMismatchMode in interface DeleteOrphanFiles
      Parameters:
      newPrefixMismatchMode - mode for handling prefix mismatches
      Returns:
      this for method chaining
    • equalSchemes Link icon

      public DeleteOrphanFilesSparkAction equalSchemes(Map<String,String> newEqualSchemes)
      Description copied from interface: DeleteOrphanFiles
      Passes schemes that should be considered equal.

      The key may include a comma-separated list of schemes. For instance, Map("s3a,s3,s3n", "s3").

      Specified by:
      equalSchemes in interface DeleteOrphanFiles
      Parameters:
      newEqualSchemes - list of equal schemes
      Returns:
      this for method chaining
    • equalAuthorities Link icon

      public DeleteOrphanFilesSparkAction equalAuthorities(Map<String,String> newEqualAuthorities)
      Description copied from interface: DeleteOrphanFiles
      Passes authorities that should be considered equal.

      The key may include a comma-separate list of authorities. For instance, Map("s1name,s2name", "servicename").

      Specified by:
      equalAuthorities in interface DeleteOrphanFiles
      Parameters:
      newEqualAuthorities - list of equal authorities
      Returns:
      this for method chaining
    • location Link icon

      public DeleteOrphanFilesSparkAction location(String newLocation)
      Description copied from interface: DeleteOrphanFiles
      Passes a location which should be scanned for orphan files.

      If not set, the root table location will be scanned potentially removing both orphan data and metadata files.

      Specified by:
      location in interface DeleteOrphanFiles
      Parameters:
      newLocation - the location where to look for orphan files
      Returns:
      this for method chaining
    • olderThan Link icon

      public DeleteOrphanFilesSparkAction olderThan(long newOlderThanTimestamp)
      Description copied from interface: DeleteOrphanFiles
      Removes orphan files only if they are older than the given timestamp.

      This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.

      If not set, defaults to a timestamp 3 days ago.

      Specified by:
      olderThan in interface DeleteOrphanFiles
      Parameters:
      newOlderThanTimestamp - a long timestamp, as returned by System.currentTimeMillis()
      Returns:
      this for method chaining
    • deleteWith Link icon

      public DeleteOrphanFilesSparkAction deleteWith(Consumer<String> newDeleteFunc)
      Description copied from interface: DeleteOrphanFiles
      Passes an alternative delete implementation that will be used for orphan files.

      This method allows users to customize the delete function. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.

      If not set, defaults to using the table's io implementation.

      Specified by:
      deleteWith in interface DeleteOrphanFiles
      Parameters:
      newDeleteFunc - a function that will be called to delete files
      Returns:
      this for method chaining
    • compareToFileList Link icon

      public DeleteOrphanFilesSparkAction compareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files)
    • execute Link icon

      public DeleteOrphanFiles.Result execute()
      Description copied from interface: Action
      Executes this action.
      Specified by:
      execute in interface Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>
      Returns:
      the result of this action
    • spark Link icon

      protected org.apache.spark.sql.SparkSession spark()
    • sparkContext Link icon

      protected org.apache.spark.api.java.JavaSparkContext sparkContext()
    • option Link icon

      public DeleteOrphanFilesSparkAction option(String name, String value)
    • options Link icon

      public DeleteOrphanFilesSparkAction options(Map<String,String> newOptions)
    • options Link icon

      protected Map<String,String> options()
    • withJobGroupInfo Link icon

      protected <T> T withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)
    • newJobGroupInfo Link icon

      protected JobGroupInfo newJobGroupInfo(String groupId, String desc)
    • newStaticTable Link icon

      protected Table newStaticTable(TableMetadata metadata, FileIO io)
    • newStaticTable Link icon

      protected Table newStaticTable(String metadataFileLocation, FileIO io)
    • contentFileDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table)
    • contentFileDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds)
    • manifestDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table)
    • manifestDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds)
    • manifestListDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table)
    • manifestListDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, Set<Long> snapshotIds)
    • statisticsFileDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, Set<Long> snapshotIds)
    • otherMetadataFileDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
    • allReachableOtherMetadataFileDS Link icon

      protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
    • loadMetadataTable Link icon

      protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
    • deleteFiles Link icon

      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)
      Deletes files and keeps track of how many files were removed for each file type.
      Parameters:
      executorService - an executor service to use for parallel deletes
      deleteFunc - a delete func
      files - an iterator of Spark rows of the structure (path: String, type: String)
      Returns:
      stats on which files were deleted
    • deleteFiles Link icon

      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)