public class BaseDeleteOrphanFilesSparkAction extends java.lang.Object implements DeleteOrphanFiles
FileSystem
.
By default, this action cleans up the table location returned by Table.location()
and
removes unreachable files that are older than 3 days using Table.io()
. The behavior can be modified
by passing a custom location to location
and a custom timestamp to olderThan(long)
.
For example, someone might point this action to the data folder to clean up only orphan data files.
In addition, there is a way to configure an alternative delete method via deleteWith(Consumer)
.
Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.
DeleteOrphanFiles.Result
Constructor and Description |
---|
BaseDeleteOrphanFilesSparkAction(org.apache.spark.sql.SparkSession spark,
Table table) |
Modifier and Type | Method and Description |
---|---|
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildOtherMetadataFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidMetadataFileDF(Table table) |
BaseDeleteOrphanFilesSparkAction |
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Passes an alternative delete implementation that will be used for orphan files.
|
DeleteOrphanFiles.Result |
execute()
Executes this action.
|
BaseDeleteOrphanFilesSparkAction |
executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Passes an alternative executor service that will be used for removing orphaned files.
|
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
loadMetadataTable(Table table,
MetadataTableType type) |
BaseDeleteOrphanFilesSparkAction |
location(java.lang.String newLocation)
Passes a location which should be scanned for orphan files.
|
protected JobGroupInfo |
newJobGroupInfo(java.lang.String groupId,
java.lang.String desc) |
protected Table |
newStaticTable(TableMetadata metadata,
FileIO io) |
BaseDeleteOrphanFilesSparkAction |
olderThan(long newOlderThanTimestamp)
Removes orphan files only if they are older than the given timestamp.
|
ThisT |
option(java.lang.String name,
java.lang.String value)
Configures this action with an extra option.
|
protected java.util.Map<java.lang.String,java.lang.String> |
options() |
ThisT |
options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Configures this action with extra options.
|
protected DeleteOrphanFiles |
self() |
protected org.apache.spark.sql.SparkSession |
spark() |
protected org.apache.spark.api.java.JavaSparkContext |
sparkContext() |
protected <T> T |
withJobGroupInfo(JobGroupInfo info,
java.util.function.Supplier<T> supplier) |
public BaseDeleteOrphanFilesSparkAction(org.apache.spark.sql.SparkSession spark, Table table)
protected DeleteOrphanFiles self()
public BaseDeleteOrphanFilesSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
DeleteOrphanFiles
If this method is not called, orphaned manifests and data files will still be deleted in the current thread.
executeDeleteWith
in interface DeleteOrphanFiles
executorService
- the service to usepublic BaseDeleteOrphanFilesSparkAction location(java.lang.String newLocation)
DeleteOrphanFiles
If not set, the root table location will be scanned potentially removing both orphan data and metadata files.
location
in interface DeleteOrphanFiles
newLocation
- the location where to look for orphan filespublic BaseDeleteOrphanFilesSparkAction olderThan(long newOlderThanTimestamp)
DeleteOrphanFiles
This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
If not set, defaults to a timestamp 3 days ago.
olderThan
in interface DeleteOrphanFiles
newOlderThanTimestamp
- a long timestamp, as returned by System.currentTimeMillis()
public BaseDeleteOrphanFilesSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
DeleteOrphanFiles
This method allows users to customize the delete func. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them.
If not set, defaults to using the table's io
implementation.
deleteWith
in interface DeleteOrphanFiles
newDeleteFunc
- a function that will be called to delete filespublic DeleteOrphanFiles.Result execute()
Action
execute
in interface Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>
protected org.apache.spark.sql.SparkSession spark()
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
public ThisT option(java.lang.String name, java.lang.String value)
Action
Certain actions allow users to control internal details of their execution via options.
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Action
Certain actions allow users to control internal details of their execution via options.
protected java.util.Map<java.lang.String,java.lang.String> options()
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
protected Table newStaticTable(TableMetadata metadata, FileIO io)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)