Class ExpireSnapshotsSparkAction
- java.lang.Object
-
- org.apache.iceberg.spark.actions.ExpireSnapshotsSparkAction
-
- All Implemented Interfaces:
Action<ExpireSnapshots,ExpireSnapshots.Result>,ExpireSnapshots
public class ExpireSnapshotsSparkAction extends java.lang.Object implements ExpireSnapshots
An action that performs the same operation asExpireSnapshotsbut uses Spark to determine the delta in files between the pre and post-expiration table metadata. All of the same restrictions ofExpireSnapshotsalso apply to this action.This action first leverages
ExpireSnapshotsto expire snapshots and then uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets that contain all manifest and content files before and after the expiration. The snapshot expiration will be fully committed before any deletes are issued.This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
Deletes are still performed locally after retrieving the results from the Spark executors.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.ExpireSnapshots
ExpireSnapshots.Result
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.StringCONTENT_FILEprotected static java.lang.StringFILE_PATHprotected static java.lang.StringFILE_TYPEprotected static java.lang.StringLAST_MODIFIEDprotected static java.lang.StringMANIFESTprotected static java.lang.StringMANIFEST_LISTprotected static java.lang.StringOTHERSstatic java.lang.StringSTREAM_RESULTSstatic booleanSTREAM_RESULTS_DEFAULT
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildAllReachableOtherMetadataFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildManifestFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildManifestListDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildOtherMetadataFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidContentFileDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidContentFileWithTypeDF(Table table)protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>buildValidMetadataFileDF(Table table)ExpireSnapshotsSparkActiondeleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)Passes an alternative delete implementation that will be used for manifests, data and delete files.ExpireSnapshots.Resultexecute()Executes this action.ExpireSnapshotsSparkActionexecuteDeleteWith(java.util.concurrent.ExecutorService executorService)Passes an alternative executor service that will be used for manifests, data and delete files deletion.org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>expire()Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.ExpireSnapshotsSparkActionexpireOlderThan(long timestampMillis)Expires all snapshots older than the given timestamp.ExpireSnapshotsSparkActionexpireSnapshotId(long snapshotId)Expires a specificSnapshotidentified by id.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>loadMetadataTable(Table table, MetadataTableType type)protected JobGroupInfonewJobGroupInfo(java.lang.String groupId, java.lang.String desc)protected TablenewStaticTable(TableMetadata metadata, FileIO io)ThisToption(java.lang.String name, java.lang.String value)protected java.util.Map<java.lang.String,java.lang.String>options()ThisToptions(java.util.Map<java.lang.String,java.lang.String> newOptions)ExpireSnapshotsSparkActionretainLast(int numSnapshots)Retains the most recent ancestors of the current snapshot.protected ExpireSnapshotsSparkActionself()protected org.apache.spark.sql.SparkSessionspark()protected org.apache.spark.api.java.JavaSparkContextsparkContext()protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>withFileType(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds, java.lang.String type)protected <T> TwithJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
-
-
Field Detail
-
STREAM_RESULTS
public static final java.lang.String STREAM_RESULTS
- See Also:
- Constant Field Values
-
STREAM_RESULTS_DEFAULT
public static final boolean STREAM_RESULTS_DEFAULT
- See Also:
- Constant Field Values
-
CONTENT_FILE
protected static final java.lang.String CONTENT_FILE
- See Also:
- Constant Field Values
-
MANIFEST
protected static final java.lang.String MANIFEST
- See Also:
- Constant Field Values
-
MANIFEST_LIST
protected static final java.lang.String MANIFEST_LIST
- See Also:
- Constant Field Values
-
OTHERS
protected static final java.lang.String OTHERS
- See Also:
- Constant Field Values
-
FILE_PATH
protected static final java.lang.String FILE_PATH
- See Also:
- Constant Field Values
-
FILE_TYPE
protected static final java.lang.String FILE_TYPE
- See Also:
- Constant Field Values
-
LAST_MODIFIED
protected static final java.lang.String LAST_MODIFIED
- See Also:
- Constant Field Values
-
-
Method Detail
-
self
protected ExpireSnapshotsSparkAction self()
-
executeDeleteWith
public ExpireSnapshotsSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService)
Description copied from interface:ExpireSnapshotsPasses an alternative executor service that will be used for manifests, data and delete files deletion.If this method is not called, unnecessary manifests and content files will still be deleted in the current thread.
Identical to
ExpireSnapshots.executeDeleteWith(ExecutorService)- Specified by:
executeDeleteWithin interfaceExpireSnapshots- Parameters:
executorService- the service to use- Returns:
- this for method chaining
-
expireSnapshotId
public ExpireSnapshotsSparkAction expireSnapshotId(long snapshotId)
Description copied from interface:ExpireSnapshotsExpires a specificSnapshotidentified by id.Identical to
ExpireSnapshots.expireSnapshotId(long)- Specified by:
expireSnapshotIdin interfaceExpireSnapshots- Parameters:
snapshotId- id of the snapshot to expire- Returns:
- this for method chaining
-
expireOlderThan
public ExpireSnapshotsSparkAction expireOlderThan(long timestampMillis)
Description copied from interface:ExpireSnapshotsExpires all snapshots older than the given timestamp.Identical to
ExpireSnapshots.expireOlderThan(long)- Specified by:
expireOlderThanin interfaceExpireSnapshots- Parameters:
timestampMillis- a long timestamp, as returned bySystem.currentTimeMillis()- Returns:
- this for method chaining
-
retainLast
public ExpireSnapshotsSparkAction retainLast(int numSnapshots)
Description copied from interface:ExpireSnapshotsRetains the most recent ancestors of the current snapshot.If a snapshot would be expired because it is older than the expiration timestamp, but is one of the
numSnapshotsmost recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.Identical to
ExpireSnapshots.retainLast(int)- Specified by:
retainLastin interfaceExpireSnapshots- Parameters:
numSnapshots- the number of snapshots to retain- Returns:
- this for method chaining
-
deleteWith
public ExpireSnapshotsSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)
Description copied from interface:ExpireSnapshotsPasses an alternative delete implementation that will be used for manifests, data and delete files.Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.
If this method is not called, unnecessary manifests and content files will still be deleted.
Identical to
ExpireSnapshots.deleteWith(Consumer)- Specified by:
deleteWithin interfaceExpireSnapshots- Parameters:
newDeleteFunc- a function that will be called to delete manifests and data files- Returns:
- this for method chaining
-
expire
public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> expire()
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.This does not delete data files. To delete data files, run
execute().This may be called before or after
execute()is called to return the expired file list.- Returns:
- a Dataset of files that are no longer referenced by the table
-
execute
public ExpireSnapshots.Result execute()
Description copied from interface:ActionExecutes this action.- Specified by:
executein interfaceAction<ExpireSnapshots,ExpireSnapshots.Result>- Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark()
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
-
option
public ThisT option(java.lang.String name, java.lang.String value)
-
options
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
-
options
protected java.util.Map<java.lang.String,java.lang.String> options()
-
withJobGroupInfo
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
newJobGroupInfo
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
-
newStaticTable
protected Table newStaticTable(TableMetadata metadata, FileIO io)
-
buildValidContentFileWithTypeDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileWithTypeDF(Table table)
-
buildValidContentFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidContentFileDF(Table table)
-
buildManifestFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestFileDF(Table table)
-
buildManifestListDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildManifestListDF(Table table)
-
buildOtherMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildOtherMetadataFileDF(Table table)
-
buildAllReachableOtherMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildAllReachableOtherMetadataFileDF(Table table)
-
buildValidMetadataFileDF
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> buildValidMetadataFileDF(Table table)
-
withFileType
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> withFileType(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds, java.lang.String type)
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
-
-