Package org.apache.iceberg.spark.actions
Class RewriteDataFilesSparkAction
- java.lang.Object
-
- org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction
-
- All Implemented Interfaces:
Action<RewriteDataFiles,RewriteDataFiles.Result>
,RewriteDataFiles
,SnapshotUpdate<RewriteDataFiles,RewriteDataFiles.Result>
public class RewriteDataFilesSparkAction extends java.lang.Object implements RewriteDataFiles
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteDataFiles
RewriteDataFiles.FileGroupFailureResult, RewriteDataFiles.FileGroupInfo, RewriteDataFiles.FileGroupRewriteResult, RewriteDataFiles.Result
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.apache.iceberg.relocated.com.google.common.base.Joiner
COMMA_JOINER
protected static org.apache.iceberg.relocated.com.google.common.base.Splitter
COMMA_SPLITTER
protected static java.lang.String
FILE_PATH
protected static java.lang.String
LAST_MODIFIED
protected static java.lang.String
MANIFEST
protected static java.lang.String
MANIFEST_LIST
protected static java.lang.String
OTHERS
protected static java.lang.String
STATISTICS_FILES
-
Fields inherited from interface org.apache.iceberg.actions.RewriteDataFiles
MAX_CONCURRENT_FILE_GROUP_REWRITES, MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT, MAX_FILE_GROUP_SIZE_BYTES, MAX_FILE_GROUP_SIZE_BYTES_DEFAULT, OUTPUT_SPEC_ID, PARTIAL_PROGRESS_ENABLED, PARTIAL_PROGRESS_ENABLED_DEFAULT, PARTIAL_PROGRESS_MAX_COMMITS, PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT, PARTIAL_PROGRESS_MAX_FAILED_COMMITS, REMOVE_DANGLING_DELETES, REMOVE_DANGLING_DELETES_DEFAULT, REWRITE_JOB_ORDER, REWRITE_JOB_ORDER_DEFAULT, TARGET_FILE_SIZE_BYTES, USE_STARTING_SEQUENCE_NUMBER, USE_STARTING_SEQUENCE_NUMBER_DEFAULT
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<FileInfo>
allReachableOtherMetadataFileDS(Table table)
RewriteDataFilesSparkAction
binPack()
Choose BINPACK as a strategy for this rewrite operationprotected void
commit(SnapshotUpdate<?> update)
protected java.util.Map<java.lang.String,java.lang.String>
commitSummary()
protected org.apache.spark.sql.Dataset<FileInfo>
contentFileDS(Table table)
protected org.apache.spark.sql.Dataset<FileInfo>
contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)
Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)
RewriteDataFiles.Result
execute()
Executes this action.RewriteDataFilesSparkAction
filter(Expression expression)
A user provided filter for determining which files will be considered by the rewrite strategy.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
loadMetadataTable(Table table, MetadataTableType type)
protected org.apache.spark.sql.Dataset<FileInfo>
manifestDS(Table table)
protected org.apache.spark.sql.Dataset<FileInfo>
manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected org.apache.spark.sql.Dataset<FileInfo>
manifestListDS(Table table)
protected org.apache.spark.sql.Dataset<FileInfo>
manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected JobGroupInfo
newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
protected Table
newStaticTable(TableMetadata metadata, FileIO io)
ThisT
option(java.lang.String name, java.lang.String value)
protected java.util.Map<java.lang.String,java.lang.String>
options()
ThisT
options(java.util.Map<java.lang.String,java.lang.String> newOptions)
protected org.apache.spark.sql.Dataset<FileInfo>
otherMetadataFileDS(Table table)
protected RewriteDataFilesSparkAction
self()
ThisT
snapshotProperty(java.lang.String property, java.lang.String value)
RewriteDataFilesSparkAction
sort()
Choose SORT as a strategy for this rewrite operation using the table's sortOrderRewriteDataFilesSparkAction
sort(SortOrder sortOrder)
Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to useprotected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
sparkContext()
protected org.apache.spark.sql.Dataset<FileInfo>
statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
protected <T> T
withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
RewriteDataFilesSparkAction
zOrder(java.lang.String... columnNames)
Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty
-
-
-
-
Field Detail
-
MANIFEST
protected static final java.lang.String MANIFEST
- See Also:
- Constant Field Values
-
MANIFEST_LIST
protected static final java.lang.String MANIFEST_LIST
- See Also:
- Constant Field Values
-
STATISTICS_FILES
protected static final java.lang.String STATISTICS_FILES
- See Also:
- Constant Field Values
-
OTHERS
protected static final java.lang.String OTHERS
- See Also:
- Constant Field Values
-
FILE_PATH
protected static final java.lang.String FILE_PATH
- See Also:
- Constant Field Values
-
LAST_MODIFIED
protected static final java.lang.String LAST_MODIFIED
- See Also:
- Constant Field Values
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
-
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Detail
-
self
protected RewriteDataFilesSparkAction self()
-
binPack
public RewriteDataFilesSparkAction binPack()
Description copied from interface:RewriteDataFiles
Choose BINPACK as a strategy for this rewrite operation- Specified by:
binPack
in interfaceRewriteDataFiles
- Returns:
- this for method chaining
-
sort
public RewriteDataFilesSparkAction sort(SortOrder sortOrder)
Description copied from interface:RewriteDataFiles
Choose SORT as a strategy for this rewrite operation and manually specify the sortOrder to use- Specified by:
sort
in interfaceRewriteDataFiles
- Parameters:
sortOrder
- user defined sortOrder- Returns:
- this for method chaining
-
sort
public RewriteDataFilesSparkAction sort()
Description copied from interface:RewriteDataFiles
Choose SORT as a strategy for this rewrite operation using the table's sortOrder- Specified by:
sort
in interfaceRewriteDataFiles
- Returns:
- this for method chaining
-
zOrder
public RewriteDataFilesSparkAction zOrder(java.lang.String... columnNames)
Description copied from interface:RewriteDataFiles
Choose Z-ORDER as a strategy for this rewrite operation with a specified list of columns to use- Specified by:
zOrder
in interfaceRewriteDataFiles
- Parameters:
columnNames
- Columns to be used to generate Z-Values- Returns:
- this for method chaining
-
filter
public RewriteDataFilesSparkAction filter(Expression expression)
Description copied from interface:RewriteDataFiles
A user provided filter for determining which files will be considered by the rewrite strategy. This will be used in addition to whatever rules the rewrite strategy generates. For example this would be used for providing a restriction to only run rewrite on a specific partition.- Specified by:
filter
in interfaceRewriteDataFiles
- Parameters:
expression
- An iceberg expression used to determine which files will be considered for rewriting- Returns:
- this for chaining
-
execute
public RewriteDataFiles.Result execute()
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<RewriteDataFiles,RewriteDataFiles.Result>
- Returns:
- the result of this action
-
snapshotProperty
public ThisT snapshotProperty(java.lang.String property, java.lang.String value)
-
commit
protected void commit(SnapshotUpdate<?> update)
-
commitSummary
protected java.util.Map<java.lang.String,java.lang.String> commitSummary()
-
spark
protected org.apache.spark.sql.SparkSession spark()
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
-
option
public ThisT option(java.lang.String name, java.lang.String value)
-
options
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
-
options
protected java.util.Map<java.lang.String,java.lang.String> options()
-
withJobGroupInfo
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
newJobGroupInfo
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
-
newStaticTable
protected Table newStaticTable(TableMetadata metadata, FileIO io)
-
contentFileDS
protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
manifestDS
protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
manifestListDS
protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
statisticsFileDS
protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
otherMetadataFileDS
protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
-
allReachableOtherMetadataFileDS
protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)
Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)
-
-