Class DeleteOrphanFilesSparkAction
- java.lang.Object
-
- org.apache.iceberg.spark.actions.DeleteOrphanFilesSparkAction
-
- All Implemented Interfaces:
Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>,DeleteOrphanFiles
public class DeleteOrphanFilesSparkAction extends java.lang.Object implements DeleteOrphanFiles
An action that removes orphan metadata, data and delete files by listing a given location and comparing the actual files in that location with content and metadata files referenced by all valid snapshots. The location must be accessible for listing via the HadoopFileSystem.By default, this action cleans up the table location returned by
Table.location()and removes unreachable files that are older than 3 days usingTable.io(). The behavior can be modified by passing a custom location tolocationand a custom timestamp toolderThan(long). For example, someone might point this action to the data folder to clean up only orphan data files.Configure an alternative delete method using
deleteWith(Consumer).For full control of the set of files being evaluated, use the
compareToFileList(Dataset)argument. This skips the directory listing - any files in the dataset provided which are not found in table metadata will be deleted, using the sameTable.location()andolderThan(long)filtering as above.Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.DeleteOrphanFiles
DeleteOrphanFiles.Result
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.StringCONTENT_FILEprotected static java.lang.StringFILE_PATHprotected static java.lang.StringFILE_TYPEprotected static java.lang.StringLAST_MODIFIEDprotected static java.lang.StringMANIFESTprotected static java.lang.StringMANIFEST_LISTprotected static java.lang.StringOTHERS
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildAllReachableOtherMetadataFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildManifestFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildManifestListDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildOtherMetadataFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidContentFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidContentFileWithTypeDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidMetadataFileDF(Table table)DeleteOrphanFilesSparkActioncompareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files)DeleteOrphanFilesSparkActiondeleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)Passes an alternative delete implementation that will be used for orphan files.DeleteOrphanFiles.Resultexecute()Executes this action.DeleteOrphanFilesSparkActionexecuteDeleteWith(java.util.concurrent.ExecutorService executorService)Passes an alternative executor service that will be used for removing orphaned files.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>loadMetadataTable(Table table, MetadataTableType type)DeleteOrphanFilesSparkActionlocation(java.lang.String newLocation)Passes a location which should be scanned for orphan files.protected JobGroupInfonewJobGroupInfo(java.lang.String groupId, java.lang.String desc)protected TablenewStaticTable(TableMetadata metadata, FileIO io)DeleteOrphanFilesSparkActionolderThan(long newOlderThanTimestamp)Removes orphan files only if they are older than the given timestamp.ThisToption(java.lang.String name, java.lang.String value)protected java.util.Map<java.lang.String,java.lang.String>options()ThisToptions(java.util.Map<java.lang.String,java.lang.String> newOptions)protected DeleteOrphanFilesSparkActionself()protected org.apache.spark.sql.SparkSessionspark()protected org.apache.spark.api.java.JavaSparkContextsparkContext()protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>withFileType(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds, java.lang.String type)protected <T> TwithJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
-
-
Field Detail
-
CONTENT_FILE
protected static final java.lang.String CONTENT_FILE
- See Also:
- Constant Field Values
-
MANIFEST
protected static final java.lang.String MANIFEST
- See Also:
- Constant Field Values
-
MANIFEST_LIST
protected static final java.lang.String MANIFEST_LIST
- See Also:
- Constant Field Values
-
OTHERS
protected static final java.lang.String OTHERS
- See Also:
- Constant Field Values
-
FILE_PATH
protected static final java.lang.String FILE_PATH
- See Also:
- Constant Field Values
-
FILE_TYPE
protected static final java.lang.String FILE_TYPE
- See Also:
- Constant Field Values
-
LAST_MODIFIED
protected static final java.lang.String LAST_MODIFIED
- See Also:
- Constant Field Values
-
-
Method Detail
-
self
protected DeleteOrphanFilesSparkAction self()
-
executeDeleteWith
public DeleteOrphanFilesSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Description copied from interface:DeleteOrphanFilesPasses an alternative executor service that will be used for removing orphaned files.If this method is not called, orphaned manifests and data files will still be deleted in the current thread.
- Specified by:
executeDeleteWithin interfaceDeleteOrphanFiles- Parameters:
executorService- the service to use- Returns:
- this for method chaining
-
location
public DeleteOrphanFilesSparkAction location(java.lang.String newLocation)
Description copied from interface:DeleteOrphanFilesPasses a location which should be scanned for orphan files.If not set, the root table location will be scanned potentially removing both orphan data and metadata files.
- Specified by:
locationin interfaceDeleteOrphanFiles- Parameters:
newLocation- the location where to look for orphan files- Returns:
- this for method chaining
-
olderThan
public DeleteOrphanFilesSparkAction olderThan(long newOlderThanTimestamp)
Description copied from interface:DeleteOrphanFilesRemoves orphan files only if they are older than the given timestamp.This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
If not set, defaults to a timestamp 3 days ago.
- Specified by:
olderThanin interfaceDeleteOrphanFiles- Parameters:
newOlderThanTimestamp- a long timestamp, as returned bySystem.currentTimeMillis()- Returns:
- this for method chaining
-
deleteWith
public DeleteOrphanFilesSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Description copied from interface:DeleteOrphanFilesPasses an alternative delete implementation that will be used for orphan files.This method allows users to customize the delete func. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.
If not set, defaults to using the table's
ioimplementation.- Specified by:
deleteWithin interfaceDeleteOrphanFiles- Parameters:
newDeleteFunc- a function that will be called to delete files- Returns:
- this for method chaining
-
compareToFileList
public DeleteOrphanFilesSparkAction compareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files)
-
execute
public DeleteOrphanFiles.Result execute()
Description copied from interface:ActionExecutes this action.- Specified by:
executein interfaceAction<DeleteOrphanFiles,DeleteOrphanFiles.Result>- Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark()
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
-
option
public ThisT option(java.lang.String name, java.lang.String value)
-
options
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
-
options
protected java.util.Map<java.lang.String,java.lang.String> options()
-
withJobGroupInfo
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
newJobGroupInfo
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
-
newStaticTable
protected Table newStaticTable(TableMetadata metadata, FileIO io)
-
buildValidContentFileWithTypeDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileWithTypeDF(Table table)
-
buildValidContentFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileDF(Table table)
-
buildManifestFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(Table table)
-
buildManifestListDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(Table table)
-
buildOtherMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(Table table)
-
buildAllReachableOtherMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildAllReachableOtherMetadataFileDF(Table table)
-
buildValidMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(Table table)
-
withFileType
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> withFileType(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds, java.lang.String type)
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
-
-