Class BaseDeleteOrphanFilesSparkAction
- java.lang.Object
-
- org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction
-
- All Implemented Interfaces:
Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>
,DeleteOrphanFiles
public class BaseDeleteOrphanFilesSparkAction extends java.lang.Object implements DeleteOrphanFiles
An action that removes orphan metadata and data files by listing a given location and comparing the actual files in that location with data and metadata files referenced by all valid snapshots. The location must be accessible for listing via the HadoopFileSystem
.By default, this action cleans up the table location returned by
Table.location()
and removes unreachable files that are older than 3 days usingTable.io()
. The behavior can be modified by passing a custom location tolocation
and a custom timestamp toolderThan(long)
. For example, someone might point this action to the data folder to clean up only orphan data files. In addition, there is a way to configure an alternative delete method viadeleteWith(Consumer)
.Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.DeleteOrphanFiles
DeleteOrphanFiles.Result
-
-
Constructor Summary
Constructors Constructor Description BaseDeleteOrphanFilesSparkAction(org.apache.spark.sql.SparkSession spark, Table table)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
buildManifestFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
buildManifestListDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
buildOtherMetadataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
buildValidDataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
buildValidMetadataFileDF(Table table)
BaseDeleteOrphanFilesSparkAction
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Passes an alternative delete implementation that will be used for orphan files.DeleteOrphanFiles.Result
execute()
Executes this action.BaseDeleteOrphanFilesSparkAction
executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Passes an alternative executor service that will be used for removing orphaned files.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
loadMetadataTable(Table table, MetadataTableType type)
BaseDeleteOrphanFilesSparkAction
location(java.lang.String newLocation)
Passes a location which should be scanned for orphan files.protected JobGroupInfo
newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
protected Table
newStaticTable(TableMetadata metadata, FileIO io)
BaseDeleteOrphanFilesSparkAction
olderThan(long newOlderThanTimestamp)
Removes orphan files only if they are older than the given timestamp.ThisT
option(java.lang.String name, java.lang.String value)
Configures this action with an extra option.protected java.util.Map<java.lang.String,java.lang.String>
options()
ThisT
options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Configures this action with extra options.protected DeleteOrphanFiles
self()
protected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
sparkContext()
protected <T> T
withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
-
-
Constructor Detail
-
BaseDeleteOrphanFilesSparkAction
public BaseDeleteOrphanFilesSparkAction(org.apache.spark.sql.SparkSession spark, Table table)
-
-
Method Detail
-
self
protected DeleteOrphanFiles self()
-
executeDeleteWith
public BaseDeleteOrphanFilesSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Description copied from interface:DeleteOrphanFiles
Passes an alternative executor service that will be used for removing orphaned files.If this method is not called, orphaned manifests and data files will still be deleted in the current thread.
- Specified by:
executeDeleteWith
in interfaceDeleteOrphanFiles
- Parameters:
executorService
- the service to use- Returns:
- this for method chaining
-
location
public BaseDeleteOrphanFilesSparkAction location(java.lang.String newLocation)
Description copied from interface:DeleteOrphanFiles
Passes a location which should be scanned for orphan files.If not set, the root table location will be scanned potentially removing both orphan data and metadata files.
- Specified by:
location
in interfaceDeleteOrphanFiles
- Parameters:
newLocation
- the location where to look for orphan files- Returns:
- this for method chaining
-
olderThan
public BaseDeleteOrphanFilesSparkAction olderThan(long newOlderThanTimestamp)
Description copied from interface:DeleteOrphanFiles
Removes orphan files only if they are older than the given timestamp.This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
If not set, defaults to a timestamp 3 days ago.
- Specified by:
olderThan
in interfaceDeleteOrphanFiles
- Parameters:
newOlderThanTimestamp
- a long timestamp, as returned bySystem.currentTimeMillis()
- Returns:
- this for method chaining
-
deleteWith
public BaseDeleteOrphanFilesSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Description copied from interface:DeleteOrphanFiles
Passes an alternative delete implementation that will be used for orphan files.This method allows users to customize the delete func. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.
If not set, defaults to using the table's
io
implementation.- Specified by:
deleteWith
in interfaceDeleteOrphanFiles
- Parameters:
newDeleteFunc
- a function that will be called to delete files- Returns:
- this for method chaining
-
execute
public DeleteOrphanFiles.Result execute()
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<DeleteOrphanFiles,DeleteOrphanFiles.Result>
- Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark()
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
-
option
public ThisT option(java.lang.String name, java.lang.String value)
Description copied from interface:Action
Configures this action with an extra option.Certain actions allow users to control internal details of their execution via options.
-
options
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Description copied from interface:Action
Configures this action with extra options.Certain actions allow users to control internal details of their execution via options.
-
options
protected java.util.Map<java.lang.String,java.lang.String> options()
-
withJobGroupInfo
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
newJobGroupInfo
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
-
newStaticTable
protected Table newStaticTable(TableMetadata metadata, FileIO io)
-
buildValidDataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(Table table)
-
buildManifestFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(Table table)
-
buildManifestListDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(Table table)
-
buildOtherMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(Table table)
-
buildValidMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(Table table)
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
-
-