Class BinPackStrategy
- java.lang.Object
-
- org.apache.iceberg.actions.BinPackStrategy
-
- All Implemented Interfaces:
java.io.Serializable
,RewriteStrategy
- Direct Known Subclasses:
SortStrategy
,SparkBinPackStrategy
public abstract class BinPackStrategy extends java.lang.Object implements RewriteStrategy
A rewrite strategy for data files which determines which files to rewrite based on their size. If files are either smaller than theMIN_FILE_SIZE_BYTES
threshold or larger than theMAX_FILE_SIZE_BYTES
threshold, they are considered targets for being rewritten.Once selected files are grouped based on a
BinPacking
into groups defined byRewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES
. Groups will be considered for rewriting if they contain more files thanMIN_INPUT_FILES
or would produce at least one file ofRewriteDataFiles.TARGET_FILE_SIZE_BYTES
.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
DELETE_FILE_THRESHOLD
The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting.static int
DELETE_FILE_THRESHOLD_DEFAULT
static java.lang.String
MAX_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting.static double
MAX_FILE_SIZE_DEFAULT_RATIO
static java.lang.String
MIN_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting.static double
MIN_FILE_SIZE_DEFAULT_RATIO
static java.lang.String
MIN_INPUT_FILES
The minimum number of files that need to be in a file group for it to be considered for compaction if the total size of that group is less than theRewriteDataFiles.TARGET_FILE_SIZE_BYTES
.static int
MIN_INPUT_FILES_DEFAULT
static java.lang.String
REWRITE_ALL
Rewrites all files, regardless of their size.static boolean
REWRITE_ALL_DEFAULT
-
Constructor Summary
Constructors Constructor Description BinPackStrategy()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected long
inputFileSize(java.util.List<FileScanTask> fileToRewrite)
java.lang.String
name()
Returns the name of this rewrite strategyprotected long
numOutputFiles(long totalSizeInBytes)
Determine how many output files to create when rewriting.RewriteStrategy
options(java.util.Map<java.lang.String,java.lang.String> options)
Sets options to be used with this strategyjava.lang.Iterable<java.util.List<FileScanTask>>
planFileGroups(java.lang.Iterable<FileScanTask> dataFiles)
Groups file scans into lists which will be processed in a single executable unit.java.lang.Iterable<FileScanTask>
selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles)
Selects files which this strategy believes are valid targets to be rewritten.protected long
splitSize(long totalSizeInBytes)
Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate.protected long
targetFileSize()
java.util.Set<java.lang.String>
validOptions()
Returns a set of options which this rewrite strategy can use.protected long
writeMaxFileSize()
Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.iceberg.actions.RewriteStrategy
rewriteFiles, table
-
-
-
-
Field Detail
-
MIN_INPUT_FILES
public static final java.lang.String MIN_INPUT_FILES
The minimum number of files that need to be in a file group for it to be considered for compaction if the total size of that group is less than theRewriteDataFiles.TARGET_FILE_SIZE_BYTES
. This can also be thought of as the maximum number of non-target-size files that could remain in a file group (partition) after rewriting.- See Also:
- Constant Field Values
-
MIN_INPUT_FILES_DEFAULT
public static final int MIN_INPUT_FILES_DEFAULT
- See Also:
- Constant Field Values
-
MIN_FILE_SIZE_BYTES
public static final java.lang.String MIN_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting. Files smaller thanMIN_FILE_SIZE_BYTES
will be considered for rewriting. This functions independently ofMAX_FILE_SIZE_BYTES
.Defaults to 75% of the target file size
- See Also:
- Constant Field Values
-
MIN_FILE_SIZE_DEFAULT_RATIO
public static final double MIN_FILE_SIZE_DEFAULT_RATIO
- See Also:
- Constant Field Values
-
MAX_FILE_SIZE_BYTES
public static final java.lang.String MAX_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting. Files larger thanMAX_FILE_SIZE_BYTES
will be considered for rewriting. This functions independently ofMIN_FILE_SIZE_BYTES
.Defaults to 180% of the target file size
- See Also:
- Constant Field Values
-
MAX_FILE_SIZE_DEFAULT_RATIO
public static final double MAX_FILE_SIZE_DEFAULT_RATIO
- See Also:
- Constant Field Values
-
DELETE_FILE_THRESHOLD
public static final java.lang.String DELETE_FILE_THRESHOLD
The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting. If a data file has this number of deletes or more, it will be rewritten regardless of its file size determined byMIN_FILE_SIZE_BYTES
andMAX_FILE_SIZE_BYTES
. If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of the number of files in the file group determined byMIN_INPUT_FILES
Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default.
- See Also:
- Constant Field Values
-
DELETE_FILE_THRESHOLD_DEFAULT
public static final int DELETE_FILE_THRESHOLD_DEFAULT
- See Also:
- Constant Field Values
-
REWRITE_ALL
public static final java.lang.String REWRITE_ALL
Rewrites all files, regardless of their size. Defaults to false, rewriting only mis-sized files;- See Also:
- Constant Field Values
-
REWRITE_ALL_DEFAULT
public static final boolean REWRITE_ALL_DEFAULT
- See Also:
- Constant Field Values
-
-
Method Detail
-
name
public java.lang.String name()
Description copied from interface:RewriteStrategy
Returns the name of this rewrite strategy- Specified by:
name
in interfaceRewriteStrategy
-
validOptions
public java.util.Set<java.lang.String> validOptions()
Description copied from interface:RewriteStrategy
Returns a set of options which this rewrite strategy can use. This is an allowed-list and any options not specified here will be rejected at runtime.- Specified by:
validOptions
in interfaceRewriteStrategy
-
options
public RewriteStrategy options(java.util.Map<java.lang.String,java.lang.String> options)
Description copied from interface:RewriteStrategy
Sets options to be used with this strategy- Specified by:
options
in interfaceRewriteStrategy
-
selectFilesToRewrite
public java.lang.Iterable<FileScanTask> selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles)
Description copied from interface:RewriteStrategy
Selects files which this strategy believes are valid targets to be rewritten.- Specified by:
selectFilesToRewrite
in interfaceRewriteStrategy
- Parameters:
dataFiles
- iterable of FileScanTasks for files in a given partition- Returns:
- iterable containing only FileScanTasks to be rewritten
-
planFileGroups
public java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups(java.lang.Iterable<FileScanTask> dataFiles)
Description copied from interface:RewriteStrategy
Groups file scans into lists which will be processed in a single executable unit. Each group will end up being committed as an independent set of changes. This creates the jobs which will eventually be run as by the underlying Action.- Specified by:
planFileGroups
in interfaceRewriteStrategy
- Parameters:
dataFiles
- iterable of FileScanTasks to be rewritten- Returns:
- iterable of lists of FileScanTasks which will be processed together
-
targetFileSize
protected long targetFileSize()
-
numOutputFiles
protected long numOutputFiles(long totalSizeInBytes)
Determine how many output files to create when rewriting. We use this to determine the split-size we want to use when actually writing files to avoid the following situation.If we are writing 10.1 G of data with a target file size of 1G we would end up with 11 files, one of which would only have 0.1g. This would most likely be less preferable to 10 files each of which was 1.01g. So here we decide whether to round up or round down based on what the estimated average file size will be if we ignore the remainder (0.1g). If the new file size is less than 10% greater than the target file size then we will round down when determining the number of output files.
- Parameters:
totalSizeInBytes
- total data size for a file group- Returns:
- the number of files this strategy should create
-
splitSize
protected long splitSize(long totalSizeInBytes)
Returns the smallest of our max write file threshold, and our estimated split size based on the number of output files we want to generate. Add a overhead onto the estimated splitSize to try to avoid small errors in size creating brand-new files.
-
inputFileSize
protected long inputFileSize(java.util.List<FileScanTask> fileToRewrite)
-
writeMaxFileSize
protected long writeMaxFileSize()
Estimates a larger max target file size than our target size used in task creation to avoid tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating tiny remainder files.While we create tasks that should all be smaller than our target size there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization and other factors outside our control. If this occurs, instead of making a single file that is close in size to our target we would end up producing one file of the target size, and then a small extra file with the remaining data. For example, if our target is 512 MB we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing we would produced a 512 MB file and a 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.
- Returns:
- the target size plus one half of the distance between max and target
-
-