public abstract class BinPackStrategy extends java.lang.Object implements RewriteStrategy
MIN_FILE_SIZE_BYTES
threshold or
larger than the MAX_FILE_SIZE_BYTES
threshold, they are considered targets for being rewritten.
Once selected files are grouped based on a BinPacking
into groups defined
by RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES
. Groups will be considered for rewriting if they contain
more files than MIN_INPUT_FILES
or would produce at least one file of
RewriteDataFiles.TARGET_FILE_SIZE_BYTES
.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DELETE_FILE_THRESHOLD
The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting.
|
static int |
DELETE_FILE_THRESHOLD_DEFAULT |
static java.lang.String |
MAX_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting.
|
static double |
MAX_FILE_SIZE_DEFAULT_RATIO |
static java.lang.String |
MIN_FILE_SIZE_BYTES
Adjusts files which will be considered for rewriting.
|
static double |
MIN_FILE_SIZE_DEFAULT_RATIO |
static java.lang.String |
MIN_INPUT_FILES
The minimum number of files that need to be in a file group for it to be considered for
compaction if the total size of that group is less than the
RewriteDataFiles.TARGET_FILE_SIZE_BYTES . |
static int |
MIN_INPUT_FILES_DEFAULT |
Constructor and Description |
---|
BinPackStrategy() |
Modifier and Type | Method and Description |
---|---|
protected long |
inputFileSize(java.util.List<FileScanTask> fileToRewrite) |
protected long |
maxGroupSize() |
java.lang.String |
name()
Returns the name of this rewrite strategy
|
protected long |
numOutputFiles(long totalSizeInBytes)
Determine how many output files to create when rewriting.
|
RewriteStrategy |
options(java.util.Map<java.lang.String,java.lang.String> options)
Sets options to be used with this strategy
|
java.lang.Iterable<java.util.List<FileScanTask>> |
planFileGroups(java.lang.Iterable<FileScanTask> dataFiles)
Groups file scans into lists which will be processed in a single executable unit.
|
java.lang.Iterable<FileScanTask> |
selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles)
Selects files which this strategy believes are valid targets to be rewritten.
|
protected long |
splitSize(long totalSizeInBytes)
Returns the smallest of our max write file threshold, and our estimated split size based on
the number of output files we want to generate.
|
protected long |
targetFileSize() |
java.util.Set<java.lang.String> |
validOptions()
Returns a set of options which this rewrite strategy can use.
|
protected long |
writeMaxFileSize()
Estimates a larger max target file size than our target size used in task creation to avoid
tasks which are predicted to have a certain size, but exceed that target size when serde is complete creating
tiny remainder files.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
rewriteFiles, table
public static final java.lang.String MIN_INPUT_FILES
RewriteDataFiles.TARGET_FILE_SIZE_BYTES
.
This can also be thought of as the maximum number of non-target-size files that could remain in a file
group (partition) after rewriting.public static final int MIN_INPUT_FILES_DEFAULT
public static final java.lang.String MIN_FILE_SIZE_BYTES
MIN_FILE_SIZE_BYTES
will be considered for rewriting. This functions independently
of MAX_FILE_SIZE_BYTES
.
Defaults to 75% of the target file size
public static final double MIN_FILE_SIZE_DEFAULT_RATIO
public static final java.lang.String MAX_FILE_SIZE_BYTES
MAX_FILE_SIZE_BYTES
will be considered for rewriting. This functions independently
of MIN_FILE_SIZE_BYTES
.
Defaults to 180% of the target file size
public static final double MAX_FILE_SIZE_DEFAULT_RATIO
public static final java.lang.String DELETE_FILE_THRESHOLD
MIN_FILE_SIZE_BYTES
and MAX_FILE_SIZE_BYTES
.
If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of
the number of files in the file group determined by MIN_INPUT_FILES
Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default.
public static final int DELETE_FILE_THRESHOLD_DEFAULT
public java.lang.String name()
RewriteStrategy
name
in interface RewriteStrategy
public java.util.Set<java.lang.String> validOptions()
RewriteStrategy
validOptions
in interface RewriteStrategy
public RewriteStrategy options(java.util.Map<java.lang.String,java.lang.String> options)
RewriteStrategy
options
in interface RewriteStrategy
public java.lang.Iterable<FileScanTask> selectFilesToRewrite(java.lang.Iterable<FileScanTask> dataFiles)
RewriteStrategy
selectFilesToRewrite
in interface RewriteStrategy
dataFiles
- iterable of FileScanTasks for files in a given partitionpublic java.lang.Iterable<java.util.List<FileScanTask>> planFileGroups(java.lang.Iterable<FileScanTask> dataFiles)
RewriteStrategy
planFileGroups
in interface RewriteStrategy
dataFiles
- iterable of FileScanTasks to be rewrittenprotected long targetFileSize()
protected long numOutputFiles(long totalSizeInBytes)
If we are writing 10.1 G of data with a target file size of 1G we would end up with 11 files, one of which would only have 0.1g. This would most likely be less preferable to 10 files each of which was 1.01g. So here we decide whether to round up or round down based on what the estimated average file size will be if we ignore the remainder (0.1g). If the new file size is less than 10% greater than the target file size then we will round down when determining the number of output files.
totalSizeInBytes
- total data size for a file groupprotected long splitSize(long totalSizeInBytes)
protected long inputFileSize(java.util.List<FileScanTask> fileToRewrite)
protected long maxGroupSize()
protected long writeMaxFileSize()
While we create tasks that should all be smaller than our target size there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization and other factors outside our control. If this occurs, instead of making a single file that is close in size to our target we would end up producing one file of the target size, and then a small extra file with the remaining data. For example, if our target is 512 MB we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing we would produced a 512 MB file and a 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.