Data partitioning in S3 (Yandex Object Storage)

In S3 (Yandex Object Storage), it is possible to store very large volumes of data. At the same time, queries to this data may not need to touch all the data but only a part of it. If you describe the rules for marking the storage structure of your data in YDB, then data that is not needed for the query can even be skipped from being read from S3 (Yandex Object Storage). This mechanism significantly speeds up query execution without affecting the result.

For example, data is stored in the following directory structure:

year=2021
    month=01
    month=02
    month=03
year=2022
    month=01

The query below explicitly implies that only the data for February 2021 needs to be processed, and other data is not needed.

SELECT
    *
FROM
    objectstorage.'/'
WITH
(
    SCHEMA =
    (
        data String,
        year Int32,
        month Int32
    )
)
WHERE
    year=2021
    AND month=02

If the data partitioning scheme is not specified, then all stored data will be read from S3 (Yandex Object Storage), but as a result of processing, data for all other dates will be discarded.

If you explicitly describe the storage structure, specifying that the data in S3 (Yandex Object Storage) is placed in directories by years and months

SELECT
    *
FROM
    objectstorage.'/'
WITH
(
    SCHEMA =
    (
        data String,
        year Int32,
        month Int32
    ),
    PARTITIONED_BY = "['year', 'month']"
)
WHERE
    year=2021
    AND month=02

then during the query execution, not all data will be read from S3 (Yandex Object Storage), but only the data for February 2021. This will significantly reduce the volume of data processed and speed up processing, while the results of both queries will be identical.

Note

The example above shows working with data at the level of connections. This example is chosen for illustrative purposes only. We strongly recommend using "data bindings" to work with data and not using direct work with connections.

Syntax

When working at the connection level, partitioning is set using the partitioned_by parameter, where the list of columns is specified in JSON format.

SELECT
    *
FROM
    <connection>.<path>
WITH
(
    SCHEMA=(<field1>, <field2>, <field3>),
    PARTITIONED_BY="['field2', 'field3']"
)

In the partitioned_by parameter, the columns of the data schema by which the data stored in S3 (Yandex Object Storage) are partitioned are listed. The order of specifying fields in the partitioned_by parameter determines the nesting of S3 (Yandex Object Storage) directories within each other.

For example, PARTITIONED_BY=['year', 'month'] defines the directory structure

year=2021
    month=01
    month=02
    month=03
year=2022
    month=01

And partitioned_by=['month', 'year'] defines another directory structure

month=01
    year=2021
    year=2022
month=02
    year=2021
month=03
    year=2021

Supported data types

Partitioning is possible only with the following set of YQL data types:

  • Uint16, Uint32, Uint64
  • Int16, Int32, Int64
  • String, Utf8

When using other types for specifying partitioning, an error is returned.

Supported storage path formats

The storage path format, where the name of each directory explicitly specifies the column name, is called the "Hive-Metastore format" or simply the "Hive format."

This format looks as follows:

month=01
    year=2021
    year=2022
month=02
    year=2021
month=03
    year=2021

Warning

The basic partitioning mode in YDB supports only the Hive format.

Use the Extended Data Partitioning mode to specify arbitrary storage paths.