Class SparkDistributedDataScan
- java.lang.Object
- 
- org.apache.iceberg.SnapshotScan<ThisT,T,G>
- 
- org.apache.iceberg.SparkDistributedDataScan
 
 
- 
 public class SparkDistributedDataScan extends SnapshotScan<ThisT,T,G> A batch data scan that can utilize Spark cluster resources for planning.This scan remotely filters manifests, fetching only the relevant data and delete files to the driver. The delete file assignment is done locally after the remote filtering step. Such approach is beneficial if the remote parallelism is much higher than the number of driver cores. This scan is best suited for queries with selective filters on lower/upper bounds across all partitions, or against poorly clustered metadata. This allows job planning to benefit from highly concurrent remote filtering while not incurring high serialization and data transfer costs. This class is also useful for full table scans over large tables but the cost of bringing data and delete file details to the driver may become noticeable. Make sure to follow the performance tips below in such cases. Ensure the filtered metadata size doesn't exceed the driver's max result size. For large table scans, consider increasing `spark.driver.maxResultSize` to avoid job failures. Performance tips: - Enable Kryo serialization (`spark.serializer`)
- Increase the number of driver cores (`spark.driver.cores`)
- Tune the number of threads used to fetch task results (`spark.resultGetter.threads`)
 
- 
- 
Field SummaryFields Modifier and Type Field Description protected static java.util.List<java.lang.String>DELETE_SCAN_COLUMNSprotected static java.util.List<java.lang.String>DELETE_SCAN_WITH_STATS_COLUMNSprotected static booleanPLAN_SCANS_WITH_WORKER_POOLprotected static java.util.List<java.lang.String>SCAN_COLUMNSprotected static java.util.List<java.lang.String>SCAN_WITH_STATS_COLUMNS
 - 
Constructor SummaryConstructors Constructor Description SparkDistributedDataScan(org.apache.spark.sql.SparkSession spark, Table table, SparkReadConf readConf)
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description ThisTcaseSensitive(boolean caseSensitive)Create a new scan from this that, if data columns where selected viaScan.select(java.util.Collection), controls whether the match to the schema will be done with case sensitivity.protected java.util.Set<java.lang.Integer>columnsToKeepStats()protected org.apache.iceberg.TableScanContextcontext()protected PlanningModedataPlanningMode()Returns which planning mode to use for data.protected PlanningModedeletePlanningMode()Returns which planning mode to use for deletes.protected CloseableIterable<ScanTask>doPlanFiles()Expressionfilter()Returns this scan's filterExpression.ThisTfilter(Expression expr)Create a new scan from the results of this filtered by theExpression.ThisTignoreResiduals()Create a new scan from this that applies data filtering to files but not to rows in those files.ThisTincludeColumnStats()Create a new scan from this that loads the column stats with each data file.ThisTincludeColumnStats(java.util.Collection<java.lang.String> requestedColumns)Create a new scan from this that loads the column stats for the specific columns with each data file.protected FileIOio()booleanisCaseSensitive()Returns whether this scan is case-sensitive with respect to column names.ThisTmetricsReporter(MetricsReporter reporter)Create a new scan that will report scan metrics to the provided reporter in addition to reporters maintained by the scan.protected org.apache.iceberg.ManifestGroupnewManifestGroup(java.util.List<ManifestFile> dataManifests, boolean withColumnStats)protected org.apache.iceberg.ManifestGroupnewManifestGroup(java.util.List<ManifestFile> dataManifests, java.util.List<ManifestFile> deleteManifests)protected org.apache.iceberg.ManifestGroupnewManifestGroup(java.util.List<ManifestFile> dataManifests, java.util.List<ManifestFile> deleteManifests, boolean withColumnStats)protected BatchScannewRefinedScan(Table newTable, Schema newSchema, org.apache.iceberg.TableScanContext newContext)ThisToption(java.lang.String property, java.lang.String value)Create a new scan from this scan's configuration that will override theTable's behavior based on the incoming pair.protected java.util.Map<java.lang.String,java.lang.String>options()protected java.lang.Iterable<CloseableIterable<DataFile>>planDataRemotely(java.util.List<ManifestFile> dataManifests, boolean withColumnStats)Plans data remotely.protected org.apache.iceberg.DeleteFileIndexplanDeletesRemotely(java.util.List<ManifestFile> deleteManifests)Plans deletes remotely.protected java.util.concurrent.ExecutorServiceplanExecutor()CloseableIterable<ScanTaskGroup<ScanTask>>planTasks()Plan balanced task groups for this scan by splitting large and combining small tasks.ThisTplanWith(java.util.concurrent.ExecutorService executorService)Create a new scan to use a particular executor to plan.ThisTproject(Schema projectedSchema)Create a new scan from this with the schema as its projection.protected intremoteParallelism()Returns the cluster parallelism.protected ExpressionresidualFilter()protected java.util.List<java.lang.String>scanColumns()Schemaschema()Returns this scan's projectionSchema.ThisTselect(java.util.Collection<java.lang.String> columns)Create a new scan from this that will read the given data columns.protected booleanshouldCopyRemotelyPlannedDataFiles()Controls whether defensive copies are created for remotely planned data files.protected booleanshouldIgnoreResiduals()protected booleanshouldPlanWithExecutor()protected booleanshouldReturnColumnStats()intsplitLookback()Returns the split lookback for this scan.longsplitOpenFileCost()Returns the split open file cost for this scan.Tabletable()protected SchematableSchema()longtargetSplitSize()Returns the target split size for this scan.protected booleanuseSnapshotSchema()- 
Methods inherited from class org.apache.iceberg.SnapshotScanasOfTime, planFiles, scanMetrics, snapshot, snapshotId, toString, useRef, useSnapshot
 - 
Methods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 - 
Methods inherited from interface org.apache.iceberg.BatchScanasOfTime, snapshot, table, useRef, useSnapshot
 - 
Methods inherited from interface org.apache.iceberg.ScancaseSensitive, filter, filter, ignoreResiduals, includeColumnStats, includeColumnStats, isCaseSensitive, metricsReporter, option, planFiles, planWith, project, schema, select, select, splitLookback, splitOpenFileCost, targetSplitSize
 
- 
 
- 
- 
- 
Field Detail- 
SCAN_COLUMNSprotected static final java.util.List<java.lang.String> SCAN_COLUMNS 
 - 
SCAN_WITH_STATS_COLUMNSprotected static final java.util.List<java.lang.String> SCAN_WITH_STATS_COLUMNS 
 - 
DELETE_SCAN_COLUMNSprotected static final java.util.List<java.lang.String> DELETE_SCAN_COLUMNS 
 - 
DELETE_SCAN_WITH_STATS_COLUMNSprotected static final java.util.List<java.lang.String> DELETE_SCAN_WITH_STATS_COLUMNS 
 - 
PLAN_SCANS_WITH_WORKER_POOLprotected static final boolean PLAN_SCANS_WITH_WORKER_POOL 
 
- 
 - 
Constructor Detail- 
SparkDistributedDataScanpublic SparkDistributedDataScan(org.apache.spark.sql.SparkSession spark, Table table, SparkReadConf readConf)
 
- 
 - 
Method Detail- 
newRefinedScanprotected BatchScan newRefinedScan(Table newTable, Schema newSchema, org.apache.iceberg.TableScanContext newContext) 
 - 
remoteParallelismprotected int remoteParallelism() Returns the cluster parallelism.This value indicates the maximum number of manifests that can be processed concurrently by the cluster. Implementations should take into account both the currently available processing slots and potential dynamic allocation, if applicable. The remote parallelism is compared against the size of the thread pool available locally to determine the feasibility of remote planning. This value is ignored if the planning mode is set explicitly as local or distributed. 
 - 
dataPlanningModeprotected PlanningMode dataPlanningMode() Returns which planning mode to use for data.
 - 
shouldCopyRemotelyPlannedDataFilesprotected boolean shouldCopyRemotelyPlannedDataFiles() Controls whether defensive copies are created for remotely planned data files.By default, this class creates defensive copies for each data file that is planned remotely, assuming the provided iterable can be lazy and may reuse objects. If unnecessary and data file objects can be safely added into a collection, implementations can override this behavior. 
 - 
planDataRemotelyprotected java.lang.Iterable<CloseableIterable<DataFile>> planDataRemotely(java.util.List<ManifestFile> dataManifests, boolean withColumnStats) Plans data remotely.Implementations are encouraged to return groups of matching data files, enabling this class to process multiple groups concurrently to speed up the remaining work. This is particularly useful when dealing with equality deletes, as delete index lookups with such delete files require comparing bounds and typically benefit from parallelization. If the result iterable reuses objects, shouldCopyRemotelyPlannedDataFiles()must return true.The input data manifests have been already filtered to include only potential matches based on the scan filter. Implementations are expected to further filter these manifests and only return files that may hold data matching the scan filter. - Parameters:
- dataManifests- data manifests that may contain files matching the scan filter
- withColumnStats- a flag whether to load column stats
- Returns:
- groups of data files planned remotely
 
 - 
deletePlanningModeprotected PlanningMode deletePlanningMode() Returns which planning mode to use for deletes.
 - 
planDeletesRemotelyprotected org.apache.iceberg.DeleteFileIndex planDeletesRemotely(java.util.List<ManifestFile> deleteManifests) Plans deletes remotely.The input delete manifests have been already filtered to include only potential matches based on the scan filter. Implementations are expected to further filter these manifests and return files that may hold deletes matching the scan filter. - Parameters:
- deleteManifests- delete manifests that may contain files matching the scan filter
- Returns:
- a delete file index planned remotely
 
 - 
doPlanFilesprotected CloseableIterable<ScanTask> doPlanFiles() - Specified by:
- doPlanFilesin class- SnapshotScan<BatchScan,ScanTask,ScanTaskGroup<ScanTask>>
 
 - 
planTaskspublic CloseableIterable<ScanTaskGroup<ScanTask>> planTasks() Description copied from interface:ScanPlan balanced task groups for this scan by splitting large and combining small tasks.Task groups created by this method may read partial input files, multiple input files or both. 
 - 
useSnapshotSchemaprotected boolean useSnapshotSchema() - Overrides:
- useSnapshotSchemain class- SnapshotScan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
 
 - 
newManifestGroupprotected org.apache.iceberg.ManifestGroup newManifestGroup(java.util.List<ManifestFile> dataManifests, java.util.List<ManifestFile> deleteManifests) 
 - 
newManifestGroupprotected org.apache.iceberg.ManifestGroup newManifestGroup(java.util.List<ManifestFile> dataManifests, boolean withColumnStats) 
 - 
newManifestGroupprotected org.apache.iceberg.ManifestGroup newManifestGroup(java.util.List<ManifestFile> dataManifests, java.util.List<ManifestFile> deleteManifests, boolean withColumnStats) 
 - 
tablepublic Table table() 
 - 
ioprotected FileIO io() 
 - 
tableSchemaprotected Schema tableSchema() 
 - 
contextprotected org.apache.iceberg.TableScanContext context() 
 - 
optionsprotected java.util.Map<java.lang.String,java.lang.String> options() 
 - 
scanColumnsprotected java.util.List<java.lang.String> scanColumns() 
 - 
shouldReturnColumnStatsprotected boolean shouldReturnColumnStats() 
 - 
columnsToKeepStatsprotected java.util.Set<java.lang.Integer> columnsToKeepStats() 
 - 
shouldIgnoreResidualsprotected boolean shouldIgnoreResiduals() 
 - 
residualFilterprotected Expression residualFilter() 
 - 
shouldPlanWithExecutorprotected boolean shouldPlanWithExecutor() 
 - 
planExecutorprotected java.util.concurrent.ExecutorService planExecutor() 
 - 
optionpublic ThisT option(java.lang.String property, java.lang.String value)Description copied from interface:ScanCreate a new scan from this scan's configuration that will override theTable's behavior based on the incoming pair. Unknown properties will be ignored.- Specified by:
- optionin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- property- name of the table property to be overridden
- value- value to override with
- Returns:
- a new scan based on this with overridden behavior
 
 - 
projectpublic ThisT project(Schema projectedSchema) Description copied from interface:ScanCreate a new scan from this with the schema as its projection.- Specified by:
- projectin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- projectedSchema- a projection schema
- Returns:
- a new scan based on this with the given projection
 
 - 
caseSensitivepublic ThisT caseSensitive(boolean caseSensitive) Description copied from interface:ScanCreate a new scan from this that, if data columns where selected viaScan.select(java.util.Collection), controls whether the match to the schema will be done with case sensitivity. Default is true.- Specified by:
- caseSensitivein interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- a new scan based on this with case sensitivity as stated
 
 - 
isCaseSensitivepublic boolean isCaseSensitive() Description copied from interface:ScanReturns whether this scan is case-sensitive with respect to column names.- Specified by:
- isCaseSensitivein interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- true if case-sensitive, false otherwise.
 
 - 
includeColumnStatspublic ThisT includeColumnStats() Description copied from interface:ScanCreate a new scan from this that loads the column stats with each data file.Column stats include: value count, null value count, lower bounds, and upper bounds. - Specified by:
- includeColumnStatsin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- a new scan based on this that loads column stats.
 
 - 
includeColumnStatspublic ThisT includeColumnStats(java.util.Collection<java.lang.String> requestedColumns) Description copied from interface:ScanCreate a new scan from this that loads the column stats for the specific columns with each data file.Column stats include: value count, null value count, lower bounds, and upper bounds. - Specified by:
- includeColumnStatsin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- requestedColumns- column names for which to keep the stats.
- Returns:
- a new scan based on this that loads column stats for specific columns.
 
 - 
selectpublic ThisT select(java.util.Collection<java.lang.String> columns) Description copied from interface:ScanCreate a new scan from this that will read the given data columns. This produces an expected schema that includes all fields that are either selected or used by this scan's filter expression.- Specified by:
- selectin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- columns- column names from the table's schema
- Returns:
- a new scan based on this with the given projection columns
 
 - 
filterpublic ThisT filter(Expression expr) Description copied from interface:ScanCreate a new scan from the results of this filtered by theExpression.- Specified by:
- filterin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- expr- a filter expression
- Returns:
- a new scan based on this with results filtered by the expression
 
 - 
filterpublic Expression filter() Description copied from interface:ScanReturns this scan's filterExpression.- Specified by:
- filterin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- this scan's filter expression
 
 - 
ignoreResidualspublic ThisT ignoreResiduals() Description copied from interface:ScanCreate a new scan from this that applies data filtering to files but not to rows in those files.- Specified by:
- ignoreResidualsin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- a new scan based on this that does not filter rows in files.
 
 - 
planWithpublic ThisT planWith(java.util.concurrent.ExecutorService executorService) Description copied from interface:ScanCreate a new scan to use a particular executor to plan. The default worker pool will be used by default.- Specified by:
- planWithin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Parameters:
- executorService- the provided executor
- Returns:
- a table scan that uses the provided executor to access manifests
 
 - 
schemapublic Schema schema() Description copied from interface:ScanReturns this scan's projectionSchema.If the projection schema was set directly using Scan.project(Schema), returns that schema.If the projection schema was set by calling Scan.select(Collection), returns a projection schema that includes the selected data fields and any fields used in the filter expression.- Specified by:
- schemain interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
- Returns:
- this scan's projection schema
 
 - 
targetSplitSizepublic long targetSplitSize() Description copied from interface:ScanReturns the target split size for this scan.- Specified by:
- targetSplitSizein interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
 
 - 
splitLookbackpublic int splitLookback() Description copied from interface:ScanReturns the split lookback for this scan.- Specified by:
- splitLookbackin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
 
 - 
splitOpenFileCostpublic long splitOpenFileCost() Description copied from interface:ScanReturns the split open file cost for this scan.- Specified by:
- splitOpenFileCostin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
 
 - 
metricsReporterpublic ThisT metricsReporter(MetricsReporter reporter) Description copied from interface:ScanCreate a new scan that will report scan metrics to the provided reporter in addition to reporters maintained by the scan.- Specified by:
- metricsReporterin interface- Scan<ThisT,T extends ScanTask,G extends ScanTaskGroup<T>>
 
 
- 
 
-