public class ExpireSnapshotsSparkAction extends java.lang.Object implements ExpireSnapshots
ExpireSnapshots
but uses
Spark to determine the delta in files between the pre and post-expiration table metadata. All of
the same restrictions of ExpireSnapshots
also apply to this action.
This action first leverages ExpireSnapshots
to expire snapshots and
then uses metadata tables to find files that can be safely deleted. This is done by anti-joining
two Datasets that contain all manifest and content files before and after the expiration. The
snapshot expiration will be fully committed before any deletes are issued.
This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
Deletes are still performed locally after retrieving the results from the Spark executors.
ExpireSnapshots.Result
Modifier and Type | Field and Description |
---|---|
protected static org.apache.iceberg.relocated.com.google.common.base.Joiner |
COMMA_JOINER |
protected static org.apache.iceberg.relocated.com.google.common.base.Splitter |
COMMA_SPLITTER |
protected static java.lang.String |
FILE_PATH |
protected static java.lang.String |
LAST_MODIFIED |
protected static java.lang.String |
MANIFEST |
protected static java.lang.String |
MANIFEST_LIST |
protected static java.lang.String |
OTHERS |
protected static java.lang.String |
STATISTICS_FILES |
static java.lang.String |
STREAM_RESULTS |
static boolean |
STREAM_RESULTS_DEFAULT |
Modifier and Type | Method and Description |
---|---|
protected org.apache.spark.sql.Dataset<FileInfo> |
allReachableOtherMetadataFileDS(Table table) |
protected org.apache.spark.sql.Dataset<FileInfo> |
contentFileDS(Table table) |
protected org.apache.spark.sql.Dataset<FileInfo> |
contentFileDS(Table table,
java.util.Set<java.lang.Long> snapshotIds) |
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary |
deleteFiles(java.util.concurrent.ExecutorService executorService,
java.util.function.Consumer<java.lang.String> deleteFunc,
java.util.Iterator<FileInfo> files)
Deletes files and keeps track of how many files were removed for each file type.
|
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary |
deleteFiles(SupportsBulkOperations io,
java.util.Iterator<FileInfo> files) |
ExpireSnapshotsSparkAction |
deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Passes an alternative delete implementation that will be used for manifests, data and delete
files.
|
ExpireSnapshots.Result |
execute()
Executes this action.
|
ExpireSnapshotsSparkAction |
executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Passes an alternative executor service that will be used for files removal.
|
org.apache.spark.sql.Dataset<FileInfo> |
expireFiles()
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
|
ExpireSnapshotsSparkAction |
expireOlderThan(long timestampMillis)
Expires all snapshots older than the given timestamp.
|
ExpireSnapshotsSparkAction |
expireSnapshotId(long snapshotId)
Expires a specific
Snapshot identified by id. |
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
loadMetadataTable(Table table,
MetadataTableType type) |
protected org.apache.spark.sql.Dataset<FileInfo> |
manifestDS(Table table) |
protected org.apache.spark.sql.Dataset<FileInfo> |
manifestDS(Table table,
java.util.Set<java.lang.Long> snapshotIds) |
protected org.apache.spark.sql.Dataset<FileInfo> |
manifestListDS(Table table) |
protected org.apache.spark.sql.Dataset<FileInfo> |
manifestListDS(Table table,
java.util.Set<java.lang.Long> snapshotIds) |
protected JobGroupInfo |
newJobGroupInfo(java.lang.String groupId,
java.lang.String desc) |
protected Table |
newStaticTable(TableMetadata metadata,
FileIO io) |
ThisT |
option(java.lang.String name,
java.lang.String value) |
protected java.util.Map<java.lang.String,java.lang.String> |
options() |
ThisT |
options(java.util.Map<java.lang.String,java.lang.String> newOptions) |
protected org.apache.spark.sql.Dataset<FileInfo> |
otherMetadataFileDS(Table table) |
ExpireSnapshotsSparkAction |
retainLast(int numSnapshots)
Retains the most recent ancestors of the current snapshot.
|
protected ExpireSnapshotsSparkAction |
self() |
protected org.apache.spark.sql.SparkSession |
spark() |
protected org.apache.spark.api.java.JavaSparkContext |
sparkContext() |
protected org.apache.spark.sql.Dataset<FileInfo> |
statisticsFileDS(Table table,
java.util.Set<java.lang.Long> snapshotIds) |
protected <T> T |
withJobGroupInfo(JobGroupInfo info,
java.util.function.Supplier<T> supplier) |
public static final java.lang.String STREAM_RESULTS
public static final boolean STREAM_RESULTS_DEFAULT
protected static final java.lang.String MANIFEST
protected static final java.lang.String MANIFEST_LIST
protected static final java.lang.String STATISTICS_FILES
protected static final java.lang.String OTHERS
protected static final java.lang.String FILE_PATH
protected static final java.lang.String LAST_MODIFIED
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
protected ExpireSnapshotsSparkAction self()
public ExpireSnapshotsSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
ExpireSnapshots
ExpireSnapshots.deleteWith(Consumer)
or if the
FileIO does not support bulk deletes
. Otherwise, parallelism
should be controlled by the IO specific deleteFiles
method.
If this method is not called and bulk deletes are not supported, unnecessary manifests and content files will still be deleted in the current thread.
Identical to ExpireSnapshots.executeDeleteWith(ExecutorService)
executeDeleteWith
in interface ExpireSnapshots
executorService
- the service to usepublic ExpireSnapshotsSparkAction expireSnapshotId(long snapshotId)
ExpireSnapshots
Snapshot
identified by id.
Identical to ExpireSnapshots.expireSnapshotId(long)
expireSnapshotId
in interface ExpireSnapshots
snapshotId
- id of the snapshot to expirepublic ExpireSnapshotsSparkAction expireOlderThan(long timestampMillis)
ExpireSnapshots
Identical to ExpireSnapshots.expireOlderThan(long)
expireOlderThan
in interface ExpireSnapshots
timestampMillis
- a long timestamp, as returned by System.currentTimeMillis()
public ExpireSnapshotsSparkAction retainLast(int numSnapshots)
ExpireSnapshots
If a snapshot would be expired because it is older than the expiration timestamp, but is one
of the numSnapshots
most recent ancestors of the current state, it will be retained.
This will not cause snapshots explicitly identified by id from expiring.
Identical to ExpireSnapshots.retainLast(int)
retainLast
in interface ExpireSnapshots
numSnapshots
- the number of snapshots to retainpublic ExpireSnapshotsSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
ExpireSnapshots
Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.
If this method is not called, unnecessary manifests and content files will still be deleted.
Identical to ExpireSnapshots.deleteWith(Consumer)
deleteWith
in interface ExpireSnapshots
newDeleteFunc
- a function that will be called to delete manifests and data filespublic org.apache.spark.sql.Dataset<FileInfo> expireFiles()
This does not delete data files. To delete data files, run execute()
.
This may be called before or after execute()
to return the expired files.
public ExpireSnapshots.Result execute()
Action
execute
in interface Action<ExpireSnapshots,ExpireSnapshots.Result>
protected org.apache.spark.sql.SparkSession spark()
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
public ThisT option(java.lang.String name, java.lang.String value)
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
protected java.util.Map<java.lang.String,java.lang.String> options()
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
protected Table newStaticTable(TableMetadata metadata, FileIO io)
protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)