Package org.apache.iceberg.spark.actions
Class ComputeTableStatsSparkAction
java.lang.Object
org.apache.iceberg.spark.actions.ComputeTableStatsSparkAction
- All Implemented Interfaces:
Action<ComputeTableStats,
,ComputeTableStats.Result> ComputeTableStats
Computes the statistics of the given columns and stores it as Puffin files.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.ComputeTableStats
ComputeTableStats.Result
-
Field Summary
Modifier and TypeFieldDescriptionprotected static final org.apache.iceberg.relocated.com.google.common.base.Joiner
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
-
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.spark.sql.Dataset
<FileInfo> Choose the set of columns to collect stats, by default all columns are chosen.protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> contentFileDS
(Table table, Set<Long> snapshotIds) protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary
deleteFiles
(SupportsBulkOperations io, Iterator<FileInfo> files) execute()
Executes this action.protected org.apache.spark.sql.Dataset
<org.apache.spark.sql.Row> loadMetadataTable
(Table table, MetadataTableType type) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestDS
(Table table, Set<Long> snapshotIds) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table) protected org.apache.spark.sql.Dataset
<FileInfo> manifestListDS
(Table table, Set<Long> snapshotIds) protected JobGroupInfo
newJobGroupInfo
(String groupId, String desc) protected Table
newStaticTable
(TableMetadata metadata, FileIO io) options()
protected org.apache.spark.sql.Dataset
<FileInfo> otherMetadataFileDS
(Table table) protected ComputeTableStatsSparkAction
self()
snapshot
(long newSnapshotId) Choose the table snapshot to compute stats, by default the current snapshot is used.protected org.apache.spark.sql.SparkSession
spark()
protected org.apache.spark.api.java.JavaSparkContext
protected org.apache.spark.sql.Dataset
<FileInfo> statisticsFileDS
(Table table, Set<Long> snapshotIds) protected <T> T
withJobGroupInfo
(JobGroupInfo info, Supplier<T> supplier)
-
Field Details
-
MANIFEST
- See Also:
-
MANIFEST_LIST
- See Also:
-
STATISTICS_FILES
- See Also:
-
OTHERS
- See Also:
-
FILE_PATH
- See Also:
-
LAST_MODIFIED
- See Also:
-
COMMA_SPLITTER
protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER -
COMMA_JOINER
protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
-
-
Method Details
-
self
-
columns
Description copied from interface:ComputeTableStats
Choose the set of columns to collect stats, by default all columns are chosen.- Specified by:
columns
in interfaceComputeTableStats
- Parameters:
newColumns
- a set of column names to be analyzed- Returns:
- this for method chaining
-
snapshot
Description copied from interface:ComputeTableStats
Choose the table snapshot to compute stats, by default the current snapshot is used.- Specified by:
snapshot
in interfaceComputeTableStats
- Parameters:
newSnapshotId
- long ID of the snapshot for which stats need to be computed- Returns:
- this for method chaining
-
execute
Description copied from interface:Action
Executes this action.- Specified by:
execute
in interfaceAction<ComputeTableStats,
ComputeTableStats.Result> - Returns:
- the result of this action
-
spark
protected org.apache.spark.sql.SparkSession spark() -
sparkContext
protected org.apache.spark.api.java.JavaSparkContext sparkContext() -
option
-
options
-
options
-
withJobGroupInfo
-
newJobGroupInfo
-
newStaticTable
-
contentFileDS
-
contentFileDS
-
manifestDS
-
manifestDS
-
manifestListDS
-
manifestListDS
-
statisticsFileDS
-
otherMetadataFileDS
-
allReachableOtherMetadataFileDS
-
loadMetadataTable
protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) -
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files) Deletes files and keeps track of how many files were removed for each file type.- Parameters:
executorService
- an executor service to use for parallel deletesdeleteFunc
- a delete funcfiles
- an iterator of Spark rows of the structure (path: String, type: String)- Returns:
- stats on which files were deleted
-
deleteFiles
protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)
-