Class DeleteOrphanFilesSparkAction
- All Implemented Interfaces:
Action<DeleteOrphanFiles,
,DeleteOrphanFiles.Result> DeleteOrphanFiles
FileSystem
.
By default, this action cleans up the table location returned by Table.location()
and
removes unreachable files that are older than 3 days using Table.io()
. The behavior can
be modified by passing a custom location to location
and a custom timestamp to olderThan(long)
. For example, someone might point this action to the data folder to clean up
only orphan data files.
Configure an alternative delete method using deleteWith(Consumer)
.
For full control of the set of files being evaluated, use the compareToFileList(Dataset)
argument. This skips the directory listing - any files in the
dataset provided which are not found in table metadata will be deleted, using the same Table.location()
and olderThan(long)
filtering as above.
Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.DeleteOrphanFiles
DeleteOrphanFiles.PrefixMismatchMode, DeleteOrphanFiles.Result
-
Field Summary
Modifier and TypeFieldDescriptionprotected static final org.apache.iceberg.relocated.com.google.common.base.Joiner
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
-
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.spark.sql.Dataset<FileInfo>
compareToFileList
(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files) protected org.apache.spark.sql.Dataset<FileInfo>
contentFileDS
(Table table) protected org.apache.spark.sql.Dataset<FileInfo>
contentFileDS
(Table table, Set<Long> snapshotIds) protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(SupportsBulkOperations io, Iterator<FileInfo> files) deleteWith
(Consumer<String> newDeleteFunc) Passes an alternative delete implementation that will be used for orphan files.equalAuthorities
(Map<String, String> newEqualAuthorities) Passes authorities that should be considered equal.equalSchemes
(Map<String, String> newEqualSchemes) Passes schemes that should be considered equal.execute()
Executes this action.executeDeleteWith
(ExecutorService executorService) Passes an alternative executor service that will be used for removing orphaned files.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
loadMetadataTable
(Table table, MetadataTableType type) Passes a location which should be scanned for orphan files.protected org.apache.spark.sql.Dataset<FileInfo>
manifestDS
(Table table) protected org.apache.spark.sql.Dataset<FileInfo>
manifestDS
(Table table, Set<Long> snapshotIds) protected org.apache.spark.sql.Dataset<FileInfo>
manifestListDS
(Table table) protected org.apache.spark.sql.Dataset<FileInfo>
manifestListDS
(Table table, Set<Long> snapshotIds) protected JobGroupInfo
newJobGroupInfo
(String groupId, String desc) protected Table
newStaticTable
(TableMetadata metadata, FileIO io) olderThan
(long newOlderThanTimestamp) Removes orphan files only if they are older than the given timestamp.options()
protected org.apache.spark.sql.Dataset<FileInfo>
otherMetadataFileDS
(Table table) prefixMismatchMode
(DeleteOrphanFiles.PrefixMismatchMode newPrefixMismatchMode) Passes a prefix mismatch mode that determines how this action should handle situations when the metadata references files that match listed/provided files except for authority/scheme.protected DeleteOrphanFilesSparkAction
self()
protected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
protected org.apache.spark.sql.Dataset<FileInfo>
statisticsFileDS
(Table table, Set<Long> snapshotIds) protected <T> T
withJobGroupInfo
(JobGroupInfo info, Supplier<T> supplier)
-
Field Details
-
MANIFEST
- See Also:
-
MANIFEST_LIST
- See Also:
-
STATISTICS_FILES
- See Also:
-
OTHERS
- See Also:
-
FILE_PATH
- See Also:
-
LAST_MODIFIED
- See Also:
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER -
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Details
-
self
-
executeDeleteWith
Description copied from interface:DeleteOrphanFiles
Passes an alternative executor service that will be used for removing orphaned files. This service will only be used if a custom delete function is provided byDeleteOrphanFiles.deleteWith(Consumer)
or if the FileIO does notsupport bulk deletes
. Otherwise, parallelism should be controlled by the IO specificdeleteFiles
method.If this method is not called and bulk deletes are not supported, orphaned manifests and data files will still be deleted in the current thread.
- Specified by:
executeDeleteWith
in interfaceDeleteOrphanFiles
- Parameters:
executorService
- the service to use- Returns:
- this for method chaining
-
prefixMismatchMode
public DeleteOrphanFilesSparkAction prefixMismatchMode(DeleteOrphanFiles.PrefixMismatchMode newPrefixMismatchMode) Description copied from interface:DeleteOrphanFiles
Passes a prefix mismatch mode that determines how this action should handle situations when the metadata references files that match listed/provided files except for authority/scheme.Possible values are "ERROR", "IGNORE", "DELETE". The default mismatch mode is "ERROR", which means an exception is thrown whenever there is a mismatch in authority/scheme. It's the recommended mismatch mode and should be changed only in some rare circumstances. If there is a mismatch, use
DeleteOrphanFiles.equalSchemes(Map)
andDeleteOrphanFiles.equalAuthorities(Map)
to resolve conflicts by providing equivalent schemes and authorities. If it is impossible to determine whether the conflicting authorities/schemes are equal, set the prefix mismatch mode to "IGNORE" to skip files with mismatches. If you have manually inspected all conflicting authorities/schemes, provided equivalent schemes/authorities and are absolutely confident the remaining ones are different, set the prefix mismatch mode to "DELETE" to consider files with mismatches as orphan. It will be impossible to recover files after deletion, so the "DELETE" prefix mismatch mode must be used with extreme caution.- Specified by:
prefixMismatchMode
in interfaceDeleteOrphanFiles
- Parameters:
newPrefixMismatchMode
- mode for handling prefix mismatches- Returns:
- this for method chaining
-
equalSchemes
Description copied from interface:DeleteOrphanFiles
Passes schemes that should be considered equal.The key may include a comma-separated list of schemes. For instance, Map("s3a,s3,s3n", "s3").
- Specified by:
equalSchemes
in interfaceDeleteOrphanFiles
- Parameters:
newEqualSchemes
- list of equal schemes- Returns:
- this for method chaining
-
equalAuthorities
Description copied from interface:DeleteOrphanFiles
Passes authorities that should be considered equal.The key may include a comma-separate list of authorities. For instance, Map("s1name,s2name", "servicename").
- Specified by:
equalAuthorities
in interfaceDeleteOrphanFiles
- Parameters:
newEqualAuthorities
- list of equal authorities- Returns:
- this for method chaining
-
location
Description copied from interface:DeleteOrphanFiles
Passes a location which should be scanned for orphan files.If not set, the root table location will be scanned potentially removing both orphan data and metadata files.
- Specified by:
location
in interfaceDeleteOrphanFiles
- Parameters:
newLocation
- the location where to look for orphan files- Returns:
- this for method chaining
-
olderThan
Description copied from interface:DeleteOrphanFiles
Removes orphan files only if they are older than the given timestamp.This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
If not set, defaults to a timestamp 3 days ago.
- Specified by:
olderThan
in interfaceDeleteOrphanFiles
- Parameters:
newOlderThanTimestamp
- a long timestamp, as returned bySystem.currentTimeMillis()
- Returns:
- this for method chaining
-
deleteWith
Description copied from interface:DeleteOrphanFiles
Passes an alternative delete implementation that will be used for orphan files.This method allows users to customize the delete function. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.
If not set, defaults to using the table's
io
implementation.- Specified by:
deleteWith
in interfaceDeleteOrphanFiles
- Parameters:
newDeleteFunc
- a function that will be called to delete files- Returns:
- this for method chaining
-
compareToFileList
public DeleteOrphanFilesSparkAction compareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files) -
execute
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<DeleteOrphanFiles,
DeleteOrphanFiles.Result> - Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark() -
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext() -
option
-
options
-
options
-
withJobGroupInfo
-
newJobGroupInfo
-
newStaticTable
-
contentFileDS
-
contentFileDS
-
manifestDS
-
manifestDS
-
manifestListDS
-
manifestListDS
-
statisticsFileDS
-
otherMetadataFileDS
-
allReachableOtherMetadataFileDS
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) -
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)
-