public class BaseExpireSnapshotsSparkAction extends java.lang.Object implements ExpireSnapshots
ExpireSnapshots
but uses Spark
to determine the delta in files between the pre and post-expiration table metadata. All of the same
restrictions of ExpireSnapshots
also apply to this action.
This action first leverages ExpireSnapshots
to expire snapshots and then
uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
that contain all manifest and data files before and after the expiration. The snapshot expiration
will be fully committed before any deletes are issued.
This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
Deletes are still performed locally after retrieving the results from the Spark executors.
ExpireSnapshots.Result
Constructor and Description |
---|
BaseExpireSnapshotsSparkAction(org.apache.spark.sql.SparkSession spark,
Table table) |
Modifier and Type | Method and Description |
---|---|
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildManifestListDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildOtherMetadataFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidDataFileDF(Table table) |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
buildValidMetadataFileDF(Table table) |
BaseExpireSnapshotsSparkAction |
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Passes an alternative delete implementation that will be used for manifests and data files.
|
ExpireSnapshots.Result |
execute()
Executes this action.
|
BaseExpireSnapshotsSparkAction |
executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Passes an alternative executor service that will be used for manifests and data files deletion.
|
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
expire()
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
|
BaseExpireSnapshotsSparkAction |
expireOlderThan(long timestampMillis)
Expires all snapshots older than the given timestamp.
|
BaseExpireSnapshotsSparkAction |
expireSnapshotId(long snapshotId)
Expires a specific
Snapshot identified by id. |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
loadMetadataTable(Table table,
MetadataTableType type) |
protected JobGroupInfo |
newJobGroupInfo(java.lang.String groupId,
java.lang.String desc) |
protected Table |
newStaticTable(TableMetadata metadata,
FileIO io) |
ThisT |
option(java.lang.String name,
java.lang.String value)
Configures this action with an extra option.
|
protected java.util.Map<java.lang.String,java.lang.String> |
options() |
ThisT |
options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Configures this action with extra options.
|
BaseExpireSnapshotsSparkAction |
retainLast(int numSnapshots)
Retains the most recent ancestors of the current snapshot.
|
protected ExpireSnapshots |
self() |
protected org.apache.spark.sql.SparkSession |
spark() |
protected org.apache.spark.api.java.JavaSparkContext |
sparkContext() |
protected <T> T |
withJobGroupInfo(JobGroupInfo info,
java.util.function.Supplier<T> supplier) |
public BaseExpireSnapshotsSparkAction(org.apache.spark.sql.SparkSession spark, Table table)
protected ExpireSnapshots self()
public BaseExpireSnapshotsSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
ExpireSnapshots
If this method is not called, unnecessary manifests and data files will still be deleted in the current thread.
Identical to ExpireSnapshots.executeDeleteWith(ExecutorService)
executeDeleteWith
in interface ExpireSnapshots
executorService
- the service to usepublic BaseExpireSnapshotsSparkAction expireSnapshotId(long snapshotId)
ExpireSnapshots
Snapshot
identified by id.
Identical to ExpireSnapshots.expireSnapshotId(long)
expireSnapshotId
in interface ExpireSnapshots
snapshotId
- id of the snapshot to expirepublic BaseExpireSnapshotsSparkAction expireOlderThan(long timestampMillis)
ExpireSnapshots
Identical to ExpireSnapshots.expireOlderThan(long)
expireOlderThan
in interface ExpireSnapshots
timestampMillis
- a long timestamp, as returned by System.currentTimeMillis()
public BaseExpireSnapshotsSparkAction retainLast(int numSnapshots)
ExpireSnapshots
If a snapshot would be expired because it is older than the expiration timestamp, but is one of
the numSnapshots
most recent ancestors of the current state, it will be retained. This
will not cause snapshots explicitly identified by id from expiring.
Identical to ExpireSnapshots.retainLast(int)
retainLast
in interface ExpireSnapshots
numSnapshots
- the number of snapshots to retainpublic BaseExpireSnapshotsSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
ExpireSnapshots
Manifest files that are no longer used by valid snapshots will be deleted. Data files that were deleted by snapshots that are expired will be deleted.
If this method is not called, unnecessary manifests and data files will still be deleted.
Identical to ExpireSnapshots.deleteWith(Consumer)
deleteWith
in interface ExpireSnapshots
newDeleteFunc
- a function that will be called to delete manifests and data filespublic org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> expire()
This does not delete data files. To delete data files, run execute()
.
This may be called before or after execute()
is called to return the expired file list.
public ExpireSnapshots.Result execute()
Action
execute
in interface Action<ExpireSnapshots,ExpireSnapshots.Result>
protected org.apache.spark.sql.SparkSession spark()
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
public ThisT option(java.lang.String name, java.lang.String value)
Action
Certain actions allow users to control internal details of their execution via options.
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
Action
Certain actions allow users to control internal details of their execution via options.
protected java.util.Map<java.lang.String,java.lang.String> options()
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
protected Table newStaticTable(TableMetadata metadata, FileIO io)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidDataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)