Package org.apache.iceberg.spark.actions
Class RewriteDataFilesSparkAction
java.lang.Object
org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction
- All Implemented Interfaces:
Action<RewriteDataFiles,
,RewriteDataFiles.Result> RewriteDataFiles
,SnapshotUpdate<RewriteDataFiles,
RewriteDataFiles.Result>
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteDataFiles
RewriteDataFiles.FileGroupFailureResult, RewriteDataFiles.FileGroupInfo, RewriteDataFiles.FileGroupRewriteResult, RewriteDataFiles.Result
-
Field Summary
Modifier and TypeFieldDescriptionprotected static final org.apache.iceberg.relocated.com.google.common.base.Joiner
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
Fields inherited from interface org.apache.iceberg.actions.RewriteDataFiles
MAX_CONCURRENT_FILE_GROUP_REWRITES, MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT, MAX_FILE_GROUP_SIZE_BYTES, MAX_FILE_GROUP_SIZE_BYTES_DEFAULT, OUTPUT_SPEC_ID, PARTIAL_PROGRESS_ENABLED, PARTIAL_PROGRESS_ENABLED_DEFAULT, PARTIAL_PROGRESS_MAX_COMMITS, PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT, PARTIAL_PROGRESS_MAX_FAILED_COMMITS, REMOVE_DANGLING_DELETES, REMOVE_DANGLING_DELETES_DEFAULT, REWRITE_JOB_ORDER, REWRITE_JOB_ORDER_DEFAULT, TARGET_FILE_SIZE_BYTES, USE_STARTING_SEQUENCE_NUMBER, USE_STARTING_SEQUENCE_NUMBER_DEFAULT
-
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.spark.sql.Dataset
<FileInfo> binPack()
Choose BINPACK as a strategy for this rewrite operationprotected void
commit
(SnapshotUpdate<?> update) protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table, Set<Long> snapshotIds) protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(SupportsBulkOperations io, Iterator<FileInfo> files) execute()
Executes this action.filter
(Expression expression) A user provided filter for determining which files will be considered by the rewrite strategy.protected org.apache.spark.sql.Dataset
<org.apache.spark.sql.Row> loadMetadataTable
(Table table, MetadataTableType type) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table, Set<Long> snapshotIds) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table, Set<Long> snapshotIds) protected JobGroupInfo
newJobGroupInfo
(String groupId, String desc) protected Table
newStaticTable
(TableMetadata metadata, FileIO io) options()
protected org.apache.spark.sql.Dataset
<FileInfo> otherMetadataFileDS
(Table table) protected RewriteDataFilesSparkAction
self()
snapshotProperty
(String property, String value) sort()
Choose SORT as a strategy for this rewrite operation using the table's sortOrderChoose SORT as a strategy for this rewrite operation and manually specify the sortOrder to useprotected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
protected org.apache.spark.sql.Dataset
<FileInfo> statisticsFileDS
(Table table, Set<Long> snapshotIds) protected <T> T
withJobGroupInfo
(JobGroupInfo info, Supplier<T> supplier) Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to useMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty
-
Field Details
-
MANIFEST
- See Also:
-
MANIFEST_LIST
- See Also:
-
STATISTICS_FILES
- See Also:
-
OTHERS
- See Also:
-
FILE_PATH
- See Also:
-
LAST_MODIFIED
- See Also:
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER -
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Details
-
self
-
binPack
Description copied from interface:RewriteDataFiles
Choose BINPACK as a strategy for this rewrite operation- Specified by:
binPack
in interfaceRewriteDataFiles
- Returns:
- this for method chaining
-
sort
Description copied from interface:RewriteDataFiles
Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use- Specified by:
sort
in interfaceRewriteDataFiles
- Parameters:
sortOrder
- user defined sortOrder- Returns:
- this for method chaining
-
sort
Description copied from interface:RewriteDataFiles
Choose SORT as a strategy for this rewrite operation using the table's sortOrder- Specified by:
sort
in interfaceRewriteDataFiles
- Returns:
- this for method chaining
-
zOrder
Description copied from interface:RewriteDataFiles
Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use- Specified by:
zOrder
in interfaceRewriteDataFiles
- Parameters:
columnNames
- Columns to be used to generate Z-Values- Returns:
- this for method chaining
-
filter
Description copied from interface:RewriteDataFiles
A user provided filter for determining which files will be considered by the rewrite strategy. This will be used in addition to whatever rules the rewrite strategy generates. For example this would be used for providing a restriction to only run rewrite on a specific partition.- Specified by:
filter
in interfaceRewriteDataFiles
- Parameters:
expression
- An iceberg expression used to determine which files will be considered for rewriting- Returns:
- this for chaining
-
execute
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<RewriteDataFiles,
RewriteDataFiles.Result> - Returns:
- the result of this action
-
snapshotProperty
-
commit
-
commitSummary
-
spark
protected org.apache.spark.sql.SparkSession spark() -
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext() -
option
-
options
-
options
-
withJobGroupInfo
-
newJobGroupInfo
-
newStaticTable
-
contentFileDS
-
contentFileDS
-
manifestDS
-
manifestDS
-
manifestListDS
-
manifestListDS
-
statisticsFileDS
-
otherMetadataFileDS
-
allReachableOtherMetadataFileDS
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) -
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)
-