Puffin file format🔗
This is a specification for Puffin, a file format designed to store information such as indexes and statistics about data managed in an Iceberg table that cannot be stored directly within the Iceberg manifest. A Puffin file contains arbitrary pieces of information (here called "blobs"), along with metadata necessary to interpret them. The blobs supported by Iceberg are documented at Blob types.
Format specification🔗
A file conforming to the Puffin file format specification should have the structure as described below.
Versions🔗
Currently, there is a single version of the Puffin file format, described below.
File structure🔗
The Puffin file has the following structure
where
Magic
is four bytes 0x50, 0x46, 0x41, 0x31 (short for: Puffin Fratercula arctica, version 1),Blobᵢ
is i-th blob contained in the file, to be interpreted by application according to the footer,Footer
is defined below.
Footer structure🔗
Footer has the following structure
where
Magic
: four bytes, same as at the beginning of the fileFooterPayload
: optionally compressed, UTF-8 encoded JSON payload describing the blobs in the file, with the structure described belowFooterPayloadSize
: a length in bytes of theFooterPayload
(after compression, if compressed), stored as 4 byte integerFlags
: 4 bytes for boolean flags- byte 0 (first)
- bit 0 (lowest bit): whether
FooterPayload
is compressed - all other bits are reserved for future use and should be set to 0 on write
- bit 0 (lowest bit): whether
- all other bytes are reserved for future use and should be set to 0 on write
A 4 byte integer is always signed, in a two's complement representation, stored little-endian.
Footer Payload🔗
Footer payload bytes is either uncompressed or LZ4-compressed (as a single
LZ4 compression frame
with content size present), UTF-8 encoded JSON payload representing a single
FileMetadata
object.
FileMetadata🔗
FileMetadata
has the following fields
Field Name | Field Type | Required | Description |
---|---|---|---|
blobs | list of BlobMetadata objects | yes | |
properties | JSON object with string property values | no | storage for arbitrary meta-information, like writer identification/version. See Common properties for properties that are recommended to be set by a writer. |
BlobMetadata🔗
BlobMetadata
has the following fields
Field Name | Field Type | Required | Description |
---|---|---|---|
type | JSON string | yes | See Blob types |
fields | JSON list of ints | yes | List of field IDs the blob was computed for; the order of items is used to compute sketches stored in the blob. |
snapshot-id | JSON long | yes | ID of the Iceberg table's snapshot the blob was computed from. |
sequence-number | JSON long | yes | Sequence number of the Iceberg table's snapshot the blob was computed from. |
offset | JSON long | yes | The offset in the file where the blob contents start |
length | JSON long | yes | The length of the blob stored in the file (after compression, if compressed) |
compression-codec | JSON string | no | See Compression codecs. If omitted, the data is assumed to be uncompressed. |
properties | JSON object with string property values | no | storage for arbitrary meta-information about the blob |
Blob types🔗
The blobs can be of a type listed below
apache-datasketches-theta-v1
blob type🔗
A serialized form of a "compact" Theta sketch produced by the Apache DataSketches library. The sketch is obtained by constructing Alpha family sketch with default seed, and feeding it with individual distinct values converted to bytes using Iceberg's single-value serialization.
The blob metadata for this blob may include following properties:
ndv
: estimate of number of distinct values, derived from the sketch.
Compression codecs🔗
The data can also be uncompressed. If it is compressed the codec should be one of codecs listed below. For maximal interoperability, other codecs are not supported.
Codec name | Description |
---|---|
lz4 | Single LZ4 compression frame, with content size present |
zstd | Single Zstandard compression frame, with content size present |
__ |
Common properties🔗
When writing a Puffin file it is recommended to set the following fields in the
FileMetadata's properties
field.
created-by
- human-readable identification of the application writing the file, along with its version. Example "Trino version 381".