public class RemoveOrphanFilesAction
extends java.lang.Object
FileSystem
.
By default, this action cleans up the table location returned by Table.location()
and
removes unreachable files that are older than 3 days using Table.io()
. The behavior can be modified
by passing a custom location to location
and a custom timestamp to olderThan(long)
.
For example, someone might point this action to the data folder to clean up only orphan data files.
In addition, there is a way to configure an alternative delete method via deleteWith(Consumer)
.
Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time.
Modifier and Type | Method and Description |
---|---|
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestFileDF(org.apache.spark.sql.SparkSession spark,
java.lang.String tableName) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(org.apache.spark.sql.SparkSession spark,
java.lang.String metadataFileLocation) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(org.apache.spark.sql.SparkSession spark,
Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildOtherMetadataFileDF(org.apache.spark.sql.SparkSession spark,
TableOperations ops) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(org.apache.spark.sql.SparkSession spark) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(org.apache.spark.sql.SparkSession spark,
java.lang.String tableName) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidMetadataFileDF(org.apache.spark.sql.SparkSession spark,
Table table,
TableOperations ops) |
RemoveOrphanFilesAction |
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Passes an alternative delete implementation that will be used to delete orphan files.
|
java.util.List<java.lang.String> |
execute()
Executes this action.
|
protected java.util.List<java.lang.String> |
getManifestListPaths(java.lang.Iterable<Snapshot> snapshots)
Returns all the path locations of all Manifest Lists for a given list of snapshots
|
protected java.util.List<java.lang.String> |
getOtherMetadataFilePaths(TableOperations ops)
Returns all Metadata file paths which may not be in the current metadata.
|
RemoveOrphanFilesAction |
location(java.lang.String newLocation)
Removes orphan files in the given location.
|
protected java.lang.String |
metadataTableName(MetadataTableType type) |
protected java.lang.String |
metadataTableName(java.lang.String tableName,
MetadataTableType type) |
RemoveOrphanFilesAction |
olderThan(long newOlderThanTimestamp)
Removes orphan files that are older than the given timestamp.
|
protected Table |
table() |
protected Table table()
public RemoveOrphanFilesAction location(java.lang.String newLocation)
newLocation
- a locationpublic RemoveOrphanFilesAction olderThan(long newOlderThanTimestamp)
newOlderThanTimestamp
- a timestamp in millisecondspublic RemoveOrphanFilesAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
newDeleteFunc
- a delete funcpublic java.util.List<java.lang.String> execute()
Action
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(org.apache.spark.sql.SparkSession spark)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(org.apache.spark.sql.SparkSession spark, java.lang.String tableName)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(org.apache.spark.sql.SparkSession spark, java.lang.String tableName)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(org.apache.spark.sql.SparkSession spark, Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(org.apache.spark.sql.SparkSession spark, java.lang.String metadataFileLocation)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(org.apache.spark.sql.SparkSession spark, TableOperations ops)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(org.apache.spark.sql.SparkSession spark, Table table, TableOperations ops)
protected java.lang.String metadataTableName(MetadataTableType type)
protected java.lang.String metadataTableName(java.lang.String tableName, MetadataTableType type)
protected java.util.List<java.lang.String> getManifestListPaths(java.lang.Iterable<Snapshot> snapshots)
snapshots
- snapshotsprotected java.util.List<java.lang.String> getOtherMetadataFilePaths(TableOperations ops)
ops
- TableOperations for the table we will be getting paths from