Class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
- All Implemented Interfaces:
FileRewriter<T,
F>
- Direct Known Subclasses:
SizeBasedDataRewriter
,SizeBasedPositionDeletesRewriter
If files are smaller than the MIN_FILE_SIZE_BYTES
threshold or larger than the MAX_FILE_SIZE_BYTES
threshold, they are considered targets for being rewritten.
Once selected, files are grouped based on the bin-packing algorithm
into
groups of no more than MAX_FILE_GROUP_SIZE_BYTES
. Groups will be actually rewritten if
they contain more than MIN_INPUT_FILES
or if they would produce at least one file of
TARGET_FILE_SIZE_BYTES
.
Note that implementations may add extra conditions for selecting files or filtering groups.
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
This option controls the largest amount of data that should be rewritten in a single file group.static final long
static final String
Controls which files will be considered for rewriting.static final double
static final String
Controls which files will be considered for rewriting.static final double
static final String
Any file group exceeding this number of files will be rewritten regardless of other criteria.static final int
static final String
Overrides other options and forces rewriting of all provided files.static final boolean
static final String
The target output file size that this file rewriter will attempt to generate. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected abstract long
protected boolean
enoughContent
(List<T> group) protected boolean
enoughInputFiles
(List<T> group) filterFileGroups
(List<List<T>> groups) filterFiles
(Iterable<T> tasks) void
Initializes this rewriter using provided options.protected long
protected long
numOutputFiles
(long inputSize) Determines the preferable number of output files when rewriting a particular file group.protected PartitionSpec
protected int
planFileGroups
(Iterable<T> tasks) Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.protected long
splitSize
(long inputSize) Calculates the split size to use in bin-packing rewrites.protected Table
table()
protected boolean
tooMuchContent
(List<T> group) Returns a set of supported options for this rewriter.protected long
Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.protected boolean
wronglySized
(T task) Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.iceberg.actions.FileRewriter
description, rewrite
-
Field Details
-
TARGET_FILE_SIZE_BYTES
The target output file size that this file rewriter will attempt to generate.- See Also:
-
MIN_FILE_SIZE_BYTES
Controls which files will be considered for rewriting. Files with sizes under this threshold will be considered for rewriting regardless of any other criteria.Defaults to 75% of the target file size.
- See Also:
-
MIN_FILE_SIZE_DEFAULT_RATIO
public static final double MIN_FILE_SIZE_DEFAULT_RATIO- See Also:
-
MAX_FILE_SIZE_BYTES
Controls which files will be considered for rewriting. Files with sizes above this threshold will be considered for rewriting regardless of any other criteria.Defaults to 180% of the target file size.
- See Also:
-
MAX_FILE_SIZE_DEFAULT_RATIO
public static final double MAX_FILE_SIZE_DEFAULT_RATIO- See Also:
-
MIN_INPUT_FILES
Any file group exceeding this number of files will be rewritten regardless of other criteria. This config ensures file groups that contain many files are compacted even if the total size of that group is less than the target file size. This can also be thought of as the maximum number of wrongly sized files that could remain in a partition after rewriting.- See Also:
-
MIN_INPUT_FILES_DEFAULT
public static final int MIN_INPUT_FILES_DEFAULT- See Also:
-
REWRITE_ALL
Overrides other options and forces rewriting of all provided files.- See Also:
-
REWRITE_ALL_DEFAULT
public static final boolean REWRITE_ALL_DEFAULT- See Also:
-
MAX_FILE_GROUP_SIZE_BYTES
This option controls the largest amount of data that should be rewritten in a single file group. It helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example, a sort-based rewrite may not scale to TB-sized partitions, and those partitions need to be worked on in small subsections to avoid exhaustion of resources.- See Also:
-
MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT- See Also:
-
-
Constructor Details
-
SizeBasedFileRewriter
-
-
Method Details
-
defaultTargetFileSize
protected abstract long defaultTargetFileSize() -
filterFiles
-
filterFileGroups
-
table
-
validOptions
Description copied from interface:FileRewriter
Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.- Specified by:
validOptions
in interfaceFileRewriter<T extends ContentScanTask<F>,
F extends ContentFile<F>>
-
init
Description copied from interface:FileRewriter
Initializes this rewriter using provided options.- Specified by:
init
in interfaceFileRewriter<T extends ContentScanTask<F>,
F extends ContentFile<F>> - Parameters:
options
- options to initialize this rewriter
-
wronglySized
-
planFileGroups
Description copied from interface:FileRewriter
Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.- Specified by:
planFileGroups
in interfaceFileRewriter<T extends ContentScanTask<F>,
F extends ContentFile<F>> - Parameters:
tasks
- an iterable of scan task for files in a partition- Returns:
- groups of scan tasks for files to be rewritten in a single executable unit
-
enoughInputFiles
-
enoughContent
-
tooMuchContent
-
inputSize
-
splitSize
protected long splitSize(long inputSize) Calculates the split size to use in bin-packing rewrites.This method determines the target split size as the input size divided by the desired number of output files. The final split size is adjusted to be at least as big as the target file size but less than the max write file size.
-
numOutputFiles
protected long numOutputFiles(long inputSize) Determines the preferable number of output files when rewriting a particular file group.If the rewriter is handling 10.1 GB of data with a target file size of 1 GB, it could produce 11 files, one of which would only have 0.1 GB. This would most likely be less preferable to 10 files with 1.01 GB each. So this method decides whether to round up or round down based on what the estimated average file size will be if the remainder (0.1 GB) is distributed amongst other files. If the new average file size is no more than 10% greater than the target file size, then this method will round down when determining the number of output files. Otherwise, the remainder will be written into a separate file.
- Parameters:
inputSize
- a total input size for a file group- Returns:
- the number of files this rewriter should create
-
writeMaxFileSize
protected long writeMaxFileSize()Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.While we create tasks that should all be smaller than our target size, there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization, which are outside our control. If this occurs, instead of making a single file that is close in size to our target, we would end up producing one file of the target size, and then a small extra file with the remaining data.
For example, if our target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing, we would produce a 512 MB file and an 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.
- Returns:
- the target size plus one half of the distance between max and target
-
outputSpec
-
outputSpecId
protected int outputSpecId()
-