Class BinPackStrategy
- java.lang.Object
- 
- org.apache.iceberg.actions.BinPackStrategy
 
- 
- All Implemented Interfaces:
- java.io.Serializable,- RewriteStrategy
 - Direct Known Subclasses:
- SortStrategy,- Spark3BinPackStrategy
 
 public abstract class BinPackStrategy extends java.lang.Object implements RewriteStrategy A rewrite strategy for data files which determines which files to rewrite based on their size. If files are either smaller than theMIN_FILE_SIZE_BYTESthreshold or larger than theMAX_FILE_SIZE_BYTESthreshold, they are considered targets for being rewritten.Once selected files are grouped based on a BinPackinginto groups defined byRewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES. Groups will be considered for rewriting if they contain more files thanMIN_INPUT_FILESor would produce at least one file ofRewriteDataFiles.TARGET_FILE_SIZE_BYTES.- See Also:
- Serialized Form
 
- 
- 
Field SummaryFields Modifier and Type Field Description static java.lang.StringDELETE_FILE_THRESHOLDThe minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting.static intDELETE_FILE_THRESHOLD_DEFAULTstatic java.lang.StringMAX_FILE_SIZE_BYTESAdjusts files which will be considered for rewriting.static doubleMAX_FILE_SIZE_DEFAULT_RATIOstatic java.lang.StringMIN_FILE_SIZE_BYTESAdjusts files which will be considered for rewriting.static doubleMIN_FILE_SIZE_DEFAULT_RATIOstatic java.lang.StringMIN_INPUT_FILESThe minimum number of files that need to be in a file group for it to be considered for compaction if the total size of that group is less than theRewriteDataFiles.TARGET_FILE_SIZE_BYTES.static intMIN_INPUT_FILES_DEFAULT
 - 
Constructor SummaryConstructors Constructor Description BinPackStrategy()
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description protected longinputFileSize(java.util.List<FileScanTask> fileToRewrite)protected longmaxGroupSize()java.lang.Stringname()Returns the name of this rewrite strategyprotected longnumOutputFiles(long totalSizeInBytes)Determine how many output files to create when rewriting.RewriteStrategyoptions(java.util.Map<java.lang.String,java.lang.String> options)Sets options to be used with this strategyjava.lang.Iterable<java.util.List<FileScanTask>>planFileGroups(java.lang.Iterable<FileScanTask> dataFiles)Groups file scans into lists which will be processed in a single executable unit.java.lang.Iterable<FileScanTask>selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles)Selects files which this strategy believes are valid targets to be rewritten.protected longsplitSize(long totalSizeInBytes)Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate.protected longtargetFileSize()java.util.Set<java.lang.String>validOptions()Returns a set of options which this rewrite strategy can use.protected longwriteMaxFileSize()Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.- 
Methods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 - 
Methods inherited from interface org.apache.iceberg.actions.RewriteStrategyrewriteFiles, table
 
- 
 
- 
- 
- 
Field Detail- 
MIN_INPUT_FILESpublic static final java.lang.String MIN_INPUT_FILES The minimum number of files that need to be in a file group for it to be considered for compaction if the total size of that group is less than theRewriteDataFiles.TARGET_FILE_SIZE_BYTES. This can also be thought of as the maximum number of non-target-size files that could remain in a file group (partition) after rewriting.- See Also:
- Constant Field Values
 
 - 
MIN_INPUT_FILES_DEFAULTpublic static final int MIN_INPUT_FILES_DEFAULT - See Also:
- Constant Field Values
 
 - 
MIN_FILE_SIZE_BYTESpublic static final java.lang.String MIN_FILE_SIZE_BYTES Adjusts files which will be considered for rewriting. Files smaller thanMIN_FILE_SIZE_BYTESwill be considered for rewriting. This functions independently ofMAX_FILE_SIZE_BYTES.Defaults to 75% of the target file size - See Also:
- Constant Field Values
 
 - 
MIN_FILE_SIZE_DEFAULT_RATIOpublic static final double MIN_FILE_SIZE_DEFAULT_RATIO - See Also:
- Constant Field Values
 
 - 
MAX_FILE_SIZE_BYTESpublic static final java.lang.String MAX_FILE_SIZE_BYTES Adjusts files which will be considered for rewriting. Files larger thanMAX_FILE_SIZE_BYTESwill be considered for rewriting. This functions independently ofMIN_FILE_SIZE_BYTES.Defaults to 180% of the target file size - See Also:
- Constant Field Values
 
 - 
MAX_FILE_SIZE_DEFAULT_RATIOpublic static final double MAX_FILE_SIZE_DEFAULT_RATIO - See Also:
- Constant Field Values
 
 - 
DELETE_FILE_THRESHOLDpublic static final java.lang.String DELETE_FILE_THRESHOLD The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting. If a data file has this number of deletes or more, it will be rewritten regardless of its file size determined byMIN_FILE_SIZE_BYTESandMAX_FILE_SIZE_BYTES. If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of the number of files in the file group determined byMIN_INPUT_FILESDefaults to Integer.MAX_VALUE, which means this feature is not enabled by default. - See Also:
- Constant Field Values
 
 - 
DELETE_FILE_THRESHOLD_DEFAULTpublic static final int DELETE_FILE_THRESHOLD_DEFAULT - See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
namepublic java.lang.String name() Description copied from interface:RewriteStrategyReturns the name of this rewrite strategy- Specified by:
- namein interface- RewriteStrategy
 
 - 
validOptionspublic java.util.Set<java.lang.String> validOptions() Description copied from interface:RewriteStrategyReturns a set of options which this rewrite strategy can use. This is an allowed-list and any options not specified here will be rejected at runtime.- Specified by:
- validOptionsin interface- RewriteStrategy
 
 - 
optionspublic RewriteStrategy options(java.util.Map<java.lang.String,java.lang.String> options) Description copied from interface:RewriteStrategySets options to be used with this strategy- Specified by:
- optionsin interface- RewriteStrategy
 
 - 
selectFilesToRewritepublic java.lang.Iterable<FileScanTask> selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles) Description copied from interface:RewriteStrategySelects files which this strategy believes are valid targets to be rewritten.- Specified by:
- selectFilesToRewritein interface- RewriteStrategy
- Parameters:
- dataFiles- iterable of FileScanTasks for files in a given partition
- Returns:
- iterable containing only FileScanTasks to be rewritten
 
 - 
planFileGroupspublic java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups(java.lang.Iterable<FileScanTask> dataFiles) Description copied from interface:RewriteStrategyGroups file scans into lists which will be processed in a single executable unit. Each group will end up being committed as an independent set of changes. This creates the jobs which will eventually be run as by the underlying Action.- Specified by:
- planFileGroupsin interface- RewriteStrategy
- Parameters:
- dataFiles- iterable of FileScanTasks to be rewritten
- Returns:
- iterable of lists of FileScanTasks which will be processed together
 
 - 
targetFileSizeprotected long targetFileSize() 
 - 
numOutputFilesprotected long numOutputFiles(long totalSizeInBytes) Determine how many output files to create when rewriting. We use this to determine the split-size we want to use when actually writing files to avoid the following situation.If we are writing 10.1 G of data with a target file size of 1G we would end up with 11 files, one of which would only have 0.1g. This would most likely be less preferable to 10 files each of which was 1.01g. So here we decide whether to round up or round down based on what the estimated average file size will be if we ignore the remainder (0.1g). If the new file size is less than 10% greater than the target file size then we will round down when determining the number of output files. - Parameters:
- totalSizeInBytes- total data size for a file group
- Returns:
- the number of files this strategy should create
 
 - 
splitSizeprotected long splitSize(long totalSizeInBytes) Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate. Add a overhead onto the estimated splitSize to try to avoid small errors in size creating brand-new files.
 - 
inputFileSizeprotected long inputFileSize(java.util.List<FileScanTask> fileToRewrite) 
 - 
maxGroupSizeprotected long maxGroupSize() 
 - 
writeMaxFileSizeprotected long writeMaxFileSize() Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.While we create tasks that should all be smaller than our target size there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization and other factors outside our control. If this occurs, instead of making a single file that is close in size to our target we would end up producing one file of the target size, and then a small extra file with the remaining data. For example, if our target is 512 MB we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing we would produced a 512 MB file and a 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file. - Returns:
- the target size plus one half of the distance between max and target
 
 
- 
 
-