public class ExpireSnapshotsAction
extends java.lang.Object
ExpireSnapshots
but uses Spark
to determine the delta in files between the pre and post-expiration table metadata. All of the same
restrictions of Remove Snapshots also apply to this action.
This implementation uses the metadata tables for the table being expired to list all Manifest and DataFiles. This is made into a Dataframe which are anti-joined with the same list read after the expiration. This operation will require a shuffle so parallelism can be controlled through spark.sql.shuffle.partitions. The expiration is done locally using a direct call to RemoveSnapshots. The snapshot expiration will be fully committed before any deletes are issued. Deletes are still performed locally after retrieving the results from the Spark executors.
Modifier and Type | Method and Description |
---|---|
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestFileDF(org.apache.spark.sql.SparkSession spark,
java.lang.String tableName) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(org.apache.spark.sql.SparkSession spark,
java.lang.String metadataFileLocation) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(org.apache.spark.sql.SparkSession spark,
Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildOtherMetadataFileDF(org.apache.spark.sql.SparkSession spark,
TableOperations ops) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(org.apache.spark.sql.SparkSession spark) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(org.apache.spark.sql.SparkSession spark,
java.lang.String tableName) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidMetadataFileDF(org.apache.spark.sql.SparkSession spark,
Table table,
TableOperations ops) |
ExpireSnapshotsAction |
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
The Consumer used on files which have been determined to be expired.
|
ExpireSnapshotsActionResult |
execute()
Executes this action.
|
ExpireSnapshotsAction |
executeDeleteWith(java.util.concurrent.ExecutorService executorService)
An executor service used when deleting files.
|
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
expire()
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
|
ExpireSnapshotsAction |
expireOlderThan(long timestampMillis)
Expire all snapshots older than a given timestamp.
|
ExpireSnapshotsAction |
expireSnapshotId(long expireSnapshotId)
A specific snapshot to expire.
|
protected static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
loadMetadataTable(org.apache.spark.sql.SparkSession spark,
java.lang.String tableName,
java.lang.String tableLocation,
MetadataTableType type) |
ExpireSnapshotsAction |
retainLast(int numSnapshots)
Retain at least x snapshots when expiring
Identical to
ExpireSnapshots.retainLast(int) |
ExpireSnapshotsAction |
streamDeleteResults(boolean stream)
By default, all files to delete are brought to the driver at once which may be an issue with very long file lists.
|
protected Table |
table() |
protected Table table()
public ExpireSnapshotsAction streamDeleteResults(boolean stream)
stream
- whether to use toLocalIterator to stream results instead of collect.public ExpireSnapshotsAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
ExpireSnapshots.executeDeleteWith(ExecutorService)
executorService
- the service to usepublic ExpireSnapshotsAction expireSnapshotId(long expireSnapshotId)
ExpireSnapshots.expireSnapshotId(long)
expireSnapshotId
- Id of the snapshot to expirepublic ExpireSnapshotsAction expireOlderThan(long timestampMillis)
ExpireSnapshots.expireOlderThan(long)
timestampMillis
- all snapshots before this time will be expiredpublic ExpireSnapshotsAction retainLast(int numSnapshots)
ExpireSnapshots.retainLast(int)
numSnapshots
- number of snapshots to leavepublic ExpireSnapshotsAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
ExpireSnapshots.deleteWith(Consumer)
newDeleteFunc
- Consumer which takes a path and deletes itpublic org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> expire()
This does not delete data files. To delete data files, run execute()
.
This may be called before or after execute()
is called to return the expired file list.
public ExpireSnapshotsActionResult execute()
Action
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(org.apache.spark.sql.SparkSession spark)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(org.apache.spark.sql.SparkSession spark, java.lang.String tableName)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(org.apache.spark.sql.SparkSession spark, java.lang.String tableName)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(org.apache.spark.sql.SparkSession spark, Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(org.apache.spark.sql.SparkSession spark, java.lang.String metadataFileLocation)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(org.apache.spark.sql.SparkSession spark, TableOperations ops)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(org.apache.spark.sql.SparkSession spark, Table table, TableOperations ops)
protected static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(org.apache.spark.sql.SparkSession spark, java.lang.String tableName, java.lang.String tableLocation, MetadataTableType type)