public abstract class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>> extends java.lang.Object implements FileRewriter<T,F>
If files are smaller than the MIN_FILE_SIZE_BYTES
threshold or larger than the MAX_FILE_SIZE_BYTES
threshold, they are considered targets for being rewritten.
Once selected, files are grouped based on the bin-packing algorithm
into
groups of no more than MAX_FILE_GROUP_SIZE_BYTES
. Groups will be actually rewritten if
they contain more than MIN_INPUT_FILES
or if they would produce at least one file of
TARGET_FILE_SIZE_BYTES
.
Note that implementations may add extra conditions for selecting files or filtering groups.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
MAX_FILE_GROUP_SIZE_BYTES
This option controls the largest amount of data that should be rewritten in a single file
group.
|
static long |
MAX_FILE_GROUP_SIZE_BYTES_DEFAULT |
static java.lang.String |
MAX_FILE_SIZE_BYTES
Controls which files will be considered for rewriting.
|
static double |
MAX_FILE_SIZE_DEFAULT_RATIO |
static java.lang.String |
MIN_FILE_SIZE_BYTES
Controls which files will be considered for rewriting.
|
static double |
MIN_FILE_SIZE_DEFAULT_RATIO |
static java.lang.String |
MIN_INPUT_FILES
Any file group exceeding this number of files will be rewritten regardless of other criteria.
|
static int |
MIN_INPUT_FILES_DEFAULT |
static java.lang.String |
REWRITE_ALL
Overrides other options and forces rewriting of all provided files.
|
static boolean |
REWRITE_ALL_DEFAULT |
static java.lang.String |
TARGET_FILE_SIZE_BYTES
The target output file size that this file rewriter will attempt to generate.
|
Modifier | Constructor and Description |
---|---|
protected |
SizeBasedFileRewriter(Table table) |
Modifier and Type | Method and Description |
---|---|
protected abstract long |
defaultTargetFileSize() |
protected boolean |
enoughContent(java.util.List<T> group) |
protected boolean |
enoughInputFiles(java.util.List<T> group) |
protected abstract java.lang.Iterable<java.util.List<T>> |
filterFileGroups(java.util.List<java.util.List<T>> groups) |
protected abstract java.lang.Iterable<T> |
filterFiles(java.lang.Iterable<T> tasks) |
void |
init(java.util.Map<java.lang.String,java.lang.String> options)
Initializes this rewriter using provided options.
|
protected long |
inputSize(java.util.List<T> group) |
protected long |
numOutputFiles(long inputSize)
Determines the preferable number of output files when rewriting a particular file group.
|
java.lang.Iterable<java.util.List<T>> |
planFileGroups(java.lang.Iterable<T> tasks)
Selects files which this rewriter believes are valid targets to be rewritten based on their
scan tasks and groups those scan tasks into file groups.
|
protected long |
splitSize(long inputSize)
Calculates the split size to use in bin-packing rewrites.
|
protected Table |
table() |
protected boolean |
tooMuchContent(java.util.List<T> group) |
java.util.Set<java.lang.String> |
validOptions()
Returns a set of supported options for this rewriter.
|
protected long |
writeMaxFileSize()
Estimates a larger max target file size than the target size used in task creation to avoid
creating tiny remainder files.
|
protected boolean |
wronglySized(T task) |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
description, rewrite
public static final java.lang.String TARGET_FILE_SIZE_BYTES
public static final java.lang.String MIN_FILE_SIZE_BYTES
Defaults to 75% of the target file size.
public static final double MIN_FILE_SIZE_DEFAULT_RATIO
public static final java.lang.String MAX_FILE_SIZE_BYTES
Defaults to 180% of the target file size.
public static final double MAX_FILE_SIZE_DEFAULT_RATIO
public static final java.lang.String MIN_INPUT_FILES
public static final int MIN_INPUT_FILES_DEFAULT
public static final java.lang.String REWRITE_ALL
public static final boolean REWRITE_ALL_DEFAULT
public static final java.lang.String MAX_FILE_GROUP_SIZE_BYTES
public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
protected SizeBasedFileRewriter(Table table)
protected abstract long defaultTargetFileSize()
protected abstract java.lang.Iterable<java.util.List<T>> filterFileGroups(java.util.List<java.util.List<T>> groups)
protected Table table()
public java.util.Set<java.lang.String> validOptions()
FileRewriter
validOptions
in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
public void init(java.util.Map<java.lang.String,java.lang.String> options)
FileRewriter
init
in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
options
- options to initialize this rewriterprotected boolean wronglySized(T task)
public java.lang.Iterable<java.util.List<T>> planFileGroups(java.lang.Iterable<T> tasks)
FileRewriter
planFileGroups
in interface FileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
tasks
- an iterable of scan task for files in a partitionprotected boolean enoughInputFiles(java.util.List<T> group)
protected boolean enoughContent(java.util.List<T> group)
protected boolean tooMuchContent(java.util.List<T> group)
protected long inputSize(java.util.List<T> group)
protected long splitSize(long inputSize)
This method determines the target split size as the input size divided by the desired number of output files. The final split size is adjusted to be at least as big as the target file size but less than the max write file size.
protected long numOutputFiles(long inputSize)
If the rewriter is handling 10.1 GB of data with a target file size of 1 GB, it could produce 11 files, one of which would only have 0.1 GB. This would most likely be less preferable to 10 files with 1.01 GB each. So this method decides whether to round up or round down based on what the estimated average file size will be if the remainder (0.1 GB) is distributed amongst other files. If the new average file size is no more than 10% greater than the target file size, then this method will round down when determining the number of output files. Otherwise, the remainder will be written into a separate file.
inputSize
- a total input size for a file groupprotected long writeMaxFileSize()
While we create tasks that should all be smaller than our target size, there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization, which are outside our control. If this occurs, instead of making a single file that is close in size to our target, we would end up producing one file of the target size, and then a small extra file with the remaining data.
For example, if our target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing, we would produce a 512 MB file and an 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.