Package org.apache.iceberg.spark.actions
Class ComputeTableStatsSparkAction
- java.lang.Object
-
- org.apache.iceberg.spark.actions.ComputeTableStatsSparkAction
-
- All Implemented Interfaces:
Action<ComputeTableStats,ComputeTableStats.Result>,ComputeTableStats
public class ComputeTableStatsSparkAction extends java.lang.Object implements ComputeTableStats
Computes the statistics of the given columns and stores it as Puffin files.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.ComputeTableStats
ComputeTableStats.Result
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.apache.iceberg.relocated.com.google.common.base.JoinerCOMMA_JOINERprotected static org.apache.iceberg.relocated.com.google.common.base.SplitterCOMMA_SPLITTERprotected static java.lang.StringFILE_PATHprotected static java.lang.StringLAST_MODIFIEDprotected static java.lang.StringMANIFESTprotected static java.lang.StringMANIFEST_LISTprotected static java.lang.StringOTHERSprotected static java.lang.StringSTATISTICS_FILES
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<FileInfo>allReachableOtherMetadataFileDS(Table table)ComputeTableStatscolumns(java.lang.String... newColumns)Choose the set of columns to collect stats, by default all columns are chosen.protected org.apache.spark.sql.Dataset<FileInfo>contentFileDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)ComputeTableStats.Resultexecute()Executes this action.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>loadMetadataTable(Table table, MetadataTableType type)protected org.apache.spark.sql.Dataset<FileInfo>manifestDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected org.apache.spark.sql.Dataset<FileInfo>manifestListDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected JobGroupInfonewJobGroupInfo(java.lang.String groupId, java.lang.String desc)protected TablenewStaticTable(TableMetadata metadata, FileIO io)ThisToption(java.lang.String name, java.lang.String value)protected java.util.Map<java.lang.String,java.lang.String>options()ThisToptions(java.util.Map<java.lang.String,java.lang.String> newOptions)protected org.apache.spark.sql.Dataset<FileInfo>otherMetadataFileDS(Table table)protected ComputeTableStatsSparkActionself()ComputeTableStatssnapshot(long newSnapshotId)Choose the table snapshot to compute stats, by default the current snapshot is used.protected org.apache.spark.sql.SparkSessionspark()protected org.apache.spark.api.java.JavaSparkContextsparkContext()protected org.apache.spark.sql.Dataset<FileInfo>statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected <T> TwithJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
-
-
Field Detail
-
MANIFEST
protected static final java.lang.String MANIFEST
- See Also:
- Constant Field Values
-
MANIFEST_LIST
protected static final java.lang.String MANIFEST_LIST
- See Also:
- Constant Field Values
-
STATISTICS_FILES
protected static final java.lang.String STATISTICS_FILES
- See Also:
- Constant Field Values
-
OTHERS
protected static final java.lang.String OTHERS
- See Also:
- Constant Field Values
-
FILE_PATH
protected static final java.lang.String FILE_PATH
- See Also:
- Constant Field Values
-
LAST_MODIFIED
protected static final java.lang.String LAST_MODIFIED
- See Also:
- Constant Field Values
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
-
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Detail
-
self
protected ComputeTableStatsSparkAction self()
-
columns
public ComputeTableStats columns(java.lang.String... newColumns)
Description copied from interface:ComputeTableStatsChoose the set of columns to collect stats, by default all columns are chosen.- Specified by:
columnsin interfaceComputeTableStats- Parameters:
newColumns- a set of column names to be analyzed- Returns:
- this for method chaining
-
snapshot
public ComputeTableStats snapshot(long newSnapshotId)
Description copied from interface:ComputeTableStatsChoose the table snapshot to compute stats, by default the current snapshot is used.- Specified by:
snapshotin interfaceComputeTableStats- Parameters:
newSnapshotId- long ID of the snapshot for which stats need to be computed- Returns:
- this for method chaining
-
execute
public ComputeTableStats.Result execute()
Description copied from interface:ActionExecutes this action.- Specified by:
executein interfaceAction<ComputeTableStats,ComputeTableStats.Result>- Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark()
-
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext()
-
option
public ThisT option(java.lang.String name, java.lang.String value)
-
options
public ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions)
-
options
protected java.util.Map<java.lang.String,java.lang.String> options()
-
withJobGroupInfo
protected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
-
newJobGroupInfo
protected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc)
-
newStaticTable
protected Table newStaticTable(TableMetadata metadata, FileIO io)
-
contentFileDS
protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
manifestDS
protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
manifestListDS
protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
statisticsFileDS
protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)
-
otherMetadataFileDS
protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
-
allReachableOtherMetadataFileDS
protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService- an executor service to use for parallel deletesdeleteFunc- a delete funcfiles- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)
-
-