Package org.apache.iceberg.spark.actions
Class RewriteManifestsSparkAction
java.lang.Object
org.apache.iceberg.spark.actions.RewriteManifestsSparkAction
- All Implemented Interfaces:
Action<RewriteManifests,
,RewriteManifests.Result> RewriteManifests
,SnapshotUpdate<RewriteManifests,
RewriteManifests.Result>
An action that rewrites manifests in a distributed manner and co-locates metadata for partitions.
By default, this action rewrites all manifests for the current partition spec and writes the
result to the metadata folder. The behavior can be modified by passing a custom predicate to
rewriteIf(Predicate)
and a custom spec ID to specId(int)
. In addition, there is
a way to configure a custom location for staged manifests via stagingLocation(String)
.
The provided staging location will be ignored if snapshot ID inheritance is enabled. In such
cases, the manifests are always written to the metadata folder and committed without staging.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteManifests
RewriteManifests.Result
-
Field Summary
Modifier and TypeFieldDescriptionprotected static final org.apache.iceberg.relocated.com.google.common.base.Joiner
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
static final String
static final boolean
-
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.spark.sql.Dataset
<FileInfo> protected void
commit
(SnapshotUpdate<?> update) protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table, Set<Long> snapshotIds) protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(SupportsBulkOperations io, Iterator<FileInfo> files) execute()
Executes this action.protected org.apache.spark.sql.Dataset
<org.apache.spark.sql.Row> loadMetadataTable
(Table table, MetadataTableType type) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table, Set<Long> snapshotIds) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table, Set<Long> snapshotIds) protected JobGroupInfo
newJobGroupInfo
(String groupId, String desc) protected Table
newStaticTable
(TableMetadata metadata, FileIO io) options()
protected org.apache.spark.sql.Dataset
<FileInfo> otherMetadataFileDS
(Table table) rewriteIf
(Predicate<ManifestFile> newPredicate) Rewrites only manifests that match the given predicate.protected RewriteManifestsSparkAction
self()
snapshotProperty
(String property, String value) protected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
specId
(int specId) Rewrites manifests for a given spec id.stagingLocation
(String newStagingLocation) Passes a location where the staged manifests should be written.protected org.apache.spark.sql.Dataset
<FileInfo> statisticsFileDS
(Table table, Set<Long> snapshotIds) protected <T> T
withJobGroupInfo
(JobGroupInfo info, Supplier<T> supplier) Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty
-
Field Details
-
USE_CACHING
- See Also:
-
USE_CACHING_DEFAULT
public static final boolean USE_CACHING_DEFAULT- See Also:
-
MANIFEST
- See Also:
-
MANIFEST_LIST
- See Also:
-
STATISTICS_FILES
- See Also:
-
OTHERS
- See Also:
-
FILE_PATH
- See Also:
-
LAST_MODIFIED
- See Also:
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER -
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Details
-
self
-
specId
Description copied from interface:RewriteManifests
Rewrites manifests for a given spec id.If not set, defaults to the table's default spec ID.
- Specified by:
specId
in interfaceRewriteManifests
- Parameters:
specId
- a spec id- Returns:
- this for method chaining
-
rewriteIf
Description copied from interface:RewriteManifests
Rewrites only manifests that match the given predicate.If not set, all manifests will be rewritten.
- Specified by:
rewriteIf
in interfaceRewriteManifests
- Parameters:
newPredicate
- a predicate- Returns:
- this for method chaining
-
stagingLocation
Description copied from interface:RewriteManifests
Passes a location where the staged manifests should be written.If not set, defaults to the table's metadata location.
- Specified by:
stagingLocation
in interfaceRewriteManifests
- Parameters:
newStagingLocation
- a staging location- Returns:
- this for method chaining
-
execute
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<RewriteManifests,
RewriteManifests.Result> - Returns:
- the result of this action
-
snapshotProperty
-
commit
-
commitSummary
-
spark
protected org.apache.spark.sql.SparkSession spark() -
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext() -
option
-
options
-
options
-
withJobGroupInfo
-
newJobGroupInfo
-
newStaticTable
-
contentFileDS
-
contentFileDS
-
manifestDS
-
manifestDS
-
manifestListDS
-
manifestListDS
-
statisticsFileDS
-
otherMetadataFileDS
-
allReachableOtherMetadataFileDS
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) -
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)
-