Class ExpireSnapshotsSparkAction
- All Implemented Interfaces:
Action<ExpireSnapshots,,ExpireSnapshots.Result> ExpireSnapshots
ExpireSnapshots but uses
Spark to determine the delta in files between the pre and post-expiration table metadata. All of
the same restrictions of ExpireSnapshots also apply to this action.
This action first leverages ExpireSnapshots to expire snapshots and
then uses metadata tables to find files that can be safely deleted. This is done by anti-joining
two Datasets that contain all manifest and content files before and after the expiration. The
snapshot expiration will be fully committed before any deletes are issued.
This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
Deletes are still performed locally after retrieving the results from the Spark executors.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.ExpireSnapshots
ExpireSnapshots.Result -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final org.apache.iceberg.relocated.com.google.common.base.Joinerprotected static final org.apache.iceberg.relocated.com.google.common.base.Splitterprotected static final Stringprotected static final Stringprotected static final Stringprotected static final Stringprotected static final Stringprotected static final Stringstatic final Stringstatic final boolean -
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.spark.sql.Dataset<FileInfo> cleanExpiredMetadata(boolean clean) Expires unused table metadata such as partition specs and schemas.protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table) protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds) protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files) deleteWith(Consumer<String> newDeleteFunc) Passes an alternative delete implementation that will be used for manifests, data and delete files.execute()Executes this action.executeDeleteWith(ExecutorService executorService) Passes an alternative executor service that will be used for files removal.org.apache.spark.sql.Dataset<FileInfo> Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.expireOlderThan(long timestampMillis) Expires all snapshots older than the given timestamp.expireSnapshotId(long snapshotId) Expires a specificSnapshotidentified by id.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table) protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds) protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table) protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, Set<Long> snapshotIds) protected JobGroupInfonewJobGroupInfo(String groupId, String desc) protected TablenewStaticTable(String metadataFileLocation, FileIO io) protected TablenewStaticTable(TableMetadata metadata, FileIO io) options()protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table) retainLast(int numSnapshots) Retains the most recent ancestors of the current snapshot.protected ExpireSnapshotsSparkActionself()protected org.apache.spark.sql.SparkSessionspark()protected org.apache.spark.api.java.JavaSparkContextprotected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, Set<Long> snapshotIds) protected <T> TwithJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)
-
Field Details
-
STREAM_RESULTS
- See Also:
-
STREAM_RESULTS_DEFAULT
public static final boolean STREAM_RESULTS_DEFAULT- See Also:
-
MANIFEST
- See Also:
-
MANIFEST_LIST
- See Also:
-
STATISTICS_FILES
- See Also:
-
OTHERS
- See Also:
-
FILE_PATH
- See Also:
-
LAST_MODIFIED
- See Also:
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER -
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Details
-
self
-
executeDeleteWith
Description copied from interface:ExpireSnapshotsPasses an alternative executor service that will be used for files removal. This service will only be used if a custom delete function is provided byExpireSnapshots.deleteWith(Consumer)or if the FileIO does notsupport bulk deletes. Otherwise, parallelism should be controlled by the IO specificdeleteFilesmethod.If this method is not called and bulk deletes are not supported, unnecessary manifests and content files will still be deleted in the current thread.
Identical to
ExpireSnapshots.executeDeleteWith(ExecutorService)- Specified by:
executeDeleteWithin interfaceExpireSnapshots- Parameters:
executorService- the service to use- Returns:
- this for method chaining
-
expireSnapshotId
Description copied from interface:ExpireSnapshotsExpires a specificSnapshotidentified by id.Identical to
ExpireSnapshots.expireSnapshotId(long)- Specified by:
expireSnapshotIdin interfaceExpireSnapshots- Parameters:
snapshotId- id of the snapshot to expire- Returns:
- this for method chaining
-
expireOlderThan
Description copied from interface:ExpireSnapshotsExpires all snapshots older than the given timestamp.Identical to
ExpireSnapshots.expireOlderThan(long)- Specified by:
expireOlderThanin interfaceExpireSnapshots- Parameters:
timestampMillis- a long timestamp, as returned bySystem.currentTimeMillis()- Returns:
- this for method chaining
-
retainLast
Description copied from interface:ExpireSnapshotsRetains the most recent ancestors of the current snapshot.If a snapshot would be expired because it is older than the expiration timestamp, but is one of the
numSnapshotsmost recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.Identical to
ExpireSnapshots.retainLast(int)- Specified by:
retainLastin interfaceExpireSnapshots- Parameters:
numSnapshots- the number of snapshots to retain- Returns:
- this for method chaining
-
deleteWith
Description copied from interface:ExpireSnapshotsPasses an alternative delete implementation that will be used for manifests, data and delete files.Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.
If this method is not called, unnecessary manifests and content files will still be deleted.
Identical to
ExpireSnapshots.deleteWith(Consumer)- Specified by:
deleteWithin interfaceExpireSnapshots- Parameters:
newDeleteFunc- a function that will be called to delete manifests and data files- Returns:
- this for method chaining
-
cleanExpiredMetadata
Description copied from interface:ExpireSnapshotsExpires unused table metadata such as partition specs and schemas.Metadata such as partition specs or schemas that are no longer referenced by snapshots will be removed.
Identical to
ExpireSnapshots.cleanExpiredMetadata(boolean)- Specified by:
cleanExpiredMetadatain interfaceExpireSnapshots- Parameters:
clean- remove unused partition specs, schemas, or other metadata when true- Returns:
- this for method chaining
-
expireFiles
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.This does not delete data files. To delete data files, run
execute().This may be called before or after
execute()to return the expired files.- Returns:
- a Dataset of files that are no longer referenced by the table
-
execute
Description copied from interface:ActionExecutes this action.- Specified by:
executein interfaceAction<ExpireSnapshots,ExpireSnapshots.Result> - Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark() -
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext() -
option
-
options
-
options
-
withJobGroupInfo
-
newJobGroupInfo
-
newStaticTable
-
newStaticTable
-
contentFileDS
-
contentFileDS
-
manifestDS
-
manifestDS
-
manifestListDS
-
manifestListDS
-
statisticsFileDS
-
otherMetadataFileDS
-
allReachableOtherMetadataFileDS
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) -
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService- an executor service to use for parallel deletesdeleteFunc- a delete funcfiles- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)
-