Class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
- java.lang.Object
-
- org.apache.iceberg.actions.SizeBasedFileRewriter<T,F>
-
- All Implemented Interfaces:
FileRewriter<T,F>
- Direct Known Subclasses:
SizeBasedDataRewriter
,SizeBasedPositionDeletesRewriter
public abstract class SizeBasedFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>> extends java.lang.Object implements FileRewriter<T,F>
A file rewriter that determines which files to rewrite based on their size.If files are smaller than the
MIN_FILE_SIZE_BYTES
threshold or larger than theMAX_FILE_SIZE_BYTES
threshold, they are considered targets for being rewritten.Once selected, files are grouped based on the
bin-packing algorithm
into groups of no more thanMAX_FILE_GROUP_SIZE_BYTES
. Groups will be actually rewritten if they contain more thanMIN_INPUT_FILES
or if they would produce at least one file ofTARGET_FILE_SIZE_BYTES
.Note that implementations may add extra conditions for selecting files or filtering groups.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
MAX_FILE_GROUP_SIZE_BYTES
This option controls the largest amount of data that should be rewritten in a single file group.static long
MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
static java.lang.String
MAX_FILE_SIZE_BYTES
Controls which files will be considered for rewriting.static double
MAX_FILE_SIZE_DEFAULT_RATIO
static java.lang.String
MIN_FILE_SIZE_BYTES
Controls which files will be considered for rewriting.static double
MIN_FILE_SIZE_DEFAULT_RATIO
static java.lang.String
MIN_INPUT_FILES
Any file group exceeding this number of files will be rewritten regardless of other criteria.static int
MIN_INPUT_FILES_DEFAULT
static java.lang.String
REWRITE_ALL
Overrides other options and forces rewriting of all provided files.static boolean
REWRITE_ALL_DEFAULT
static java.lang.String
TARGET_FILE_SIZE_BYTES
The target output file size that this file rewriter will attempt to generate.
-
Constructor Summary
Constructors Modifier Constructor Description protected
SizeBasedFileRewriter(Table table)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract long
defaultTargetFileSize()
protected boolean
enoughContent(java.util.List<T> group)
protected boolean
enoughInputFiles(java.util.List<T> group)
protected abstract java.lang.Iterable<java.util.List<T>>
filterFileGroups(java.util.List<java.util.List<T>> groups)
protected abstract java.lang.Iterable<T>
filterFiles(java.lang.Iterable<T> tasks)
void
init(java.util.Map<java.lang.String,java.lang.String> options)
Initializes this rewriter using provided options.protected long
inputSize(java.util.List<T> group)
protected long
numOutputFiles(long inputSize)
Determines the preferable number of output files when rewriting a particular file group.java.lang.Iterable<java.util.List<T>>
planFileGroups(java.lang.Iterable<T> tasks)
Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups.protected long
splitSize(long inputSize)
Returns the smallest of our max write file threshold and our estimated split size based on the number of output files we want to generate.protected Table
table()
protected boolean
tooMuchContent(java.util.List<T> group)
java.util.Set<java.lang.String>
validOptions()
Returns a set of supported options for this rewriter.protected long
writeMaxFileSize()
Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.protected boolean
wronglySized(T task)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.iceberg.actions.FileRewriter
description, rewrite
-
-
-
-
Field Detail
-
TARGET_FILE_SIZE_BYTES
public static final java.lang.String TARGET_FILE_SIZE_BYTES
The target output file size that this file rewriter will attempt to generate.- See Also:
- Constant Field Values
-
MIN_FILE_SIZE_BYTES
public static final java.lang.String MIN_FILE_SIZE_BYTES
Controls which files will be considered for rewriting. Files with sizes under this threshold will be considered for rewriting regardless of any other criteria.Defaults to 75% of the target file size.
- See Also:
- Constant Field Values
-
MIN_FILE_SIZE_DEFAULT_RATIO
public static final double MIN_FILE_SIZE_DEFAULT_RATIO
- See Also:
- Constant Field Values
-
MAX_FILE_SIZE_BYTES
public static final java.lang.String MAX_FILE_SIZE_BYTES
Controls which files will be considered for rewriting. Files with sizes above this threshold will be considered for rewriting regardless of any other criteria.Defaults to 180% of the target file size.
- See Also:
- Constant Field Values
-
MAX_FILE_SIZE_DEFAULT_RATIO
public static final double MAX_FILE_SIZE_DEFAULT_RATIO
- See Also:
- Constant Field Values
-
MIN_INPUT_FILES
public static final java.lang.String MIN_INPUT_FILES
Any file group exceeding this number of files will be rewritten regardless of other criteria. This config ensures file groups that contain many files are compacted even if the total size of that group is less than the target file size. This can also be thought of as the maximum number of wrongly sized files that could remain in a partition after rewriting.- See Also:
- Constant Field Values
-
MIN_INPUT_FILES_DEFAULT
public static final int MIN_INPUT_FILES_DEFAULT
- See Also:
- Constant Field Values
-
REWRITE_ALL
public static final java.lang.String REWRITE_ALL
Overrides other options and forces rewriting of all provided files.- See Also:
- Constant Field Values
-
REWRITE_ALL_DEFAULT
public static final boolean REWRITE_ALL_DEFAULT
- See Also:
- Constant Field Values
-
MAX_FILE_GROUP_SIZE_BYTES
public static final java.lang.String MAX_FILE_GROUP_SIZE_BYTES
This option controls the largest amount of data that should be rewritten in a single file group. It helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example, a sort-based rewrite may not scale to TB-sized partitions, and those partitions need to be worked on in small subsections to avoid exhaustion of resources.- See Also:
- Constant Field Values
-
MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
public static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SizeBasedFileRewriter
protected SizeBasedFileRewriter(Table table)
-
-
Method Detail
-
defaultTargetFileSize
protected abstract long defaultTargetFileSize()
-
filterFileGroups
protected abstract java.lang.Iterable<java.util.List<T>> filterFileGroups(java.util.List<java.util.List<T>> groups)
-
table
protected Table table()
-
validOptions
public java.util.Set<java.lang.String> validOptions()
Description copied from interface:FileRewriter
Returns a set of supported options for this rewriter. Only options specified in this list will be accepted at runtime. Any other options will be rejected.- Specified by:
validOptions
in interfaceFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
-
init
public void init(java.util.Map<java.lang.String,java.lang.String> options)
Description copied from interface:FileRewriter
Initializes this rewriter using provided options.- Specified by:
init
in interfaceFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
- Parameters:
options
- options to initialize this rewriter
-
wronglySized
protected boolean wronglySized(T task)
-
planFileGroups
public java.lang.Iterable<java.util.List<T>> planFileGroups(java.lang.Iterable<T> tasks)
Description copied from interface:FileRewriter
Selects files which this rewriter believes are valid targets to be rewritten based on their scan tasks and groups those scan tasks into file groups. The file groups are then rewritten in a single executable unit, such as a Spark job.- Specified by:
planFileGroups
in interfaceFileRewriter<T extends ContentScanTask<F>,F extends ContentFile<F>>
- Parameters:
tasks
- an iterable of scan task for files in a partition- Returns:
- groups of scan tasks for files to be rewritten in a single executable unit
-
enoughInputFiles
protected boolean enoughInputFiles(java.util.List<T> group)
-
enoughContent
protected boolean enoughContent(java.util.List<T> group)
-
tooMuchContent
protected boolean tooMuchContent(java.util.List<T> group)
-
inputSize
protected long inputSize(java.util.List<T> group)
-
splitSize
protected long splitSize(long inputSize)
Returns the smallest of our max write file threshold and our estimated split size based on the number of output files we want to generate. Add an overhead onto the estimated split size to try to avoid small errors in size creating brand-new files.
-
numOutputFiles
protected long numOutputFiles(long inputSize)
Determines the preferable number of output files when rewriting a particular file group.If the rewriter is handling 10.1 GB of data with a target file size of 1 GB, it could produce 11 files, one of which would only have 0.1 GB. This would most likely be less preferable to 10 files with 1.01 GB each. So this method decides whether to round up or round down based on what the estimated average file size will be if the remainder (0.1 GB) is distributed amongst other files. If the new average file size is no more than 10% greater than the target file size, then this method will round down when determining the number of output files. Otherwise, the remainder will be written into a separate file.
- Parameters:
inputSize
- a total input size for a file group- Returns:
- the number of files this rewriter should create
-
writeMaxFileSize
protected long writeMaxFileSize()
Estimates a larger max target file size than the target size used in task creation to avoid creating tiny remainder files.While we create tasks that should all be smaller than our target size, there is a chance that the actual data will end up being larger than our target size due to various factors of compression, serialization, which are outside our control. If this occurs, instead of making a single file that is close in size to our target, we would end up producing one file of the target size, and then a small extra file with the remaining data.
For example, if our target is 512 MB, we may generate a rewrite task that should be 500 MB. When we write the data we may find we actually have to write out 530 MB. If we use the target size while writing, we would produce a 512 MB file and an 18 MB file. If instead we use a larger size estimated by this method, then we end up writing a single file.
- Returns:
- the target size plus one half of the distance between max and target
-
-