Data formats and compression algorithms

This section describes the data formats supported in YDB for storage in S3 and the supported compression algorithms.

Supported data formats

The table below lists the data formats supported in YDB.

Format Read Write
csv_with_names
tsv_with_names
json_list
json_each_row
json_as_string
parquet
raw

Format csv_with_names

This format is based on the CSV format. Data is placed in columns separated by commas, with column names in the file's first row.

Example data:

Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00
Example query
SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "csv_with_names",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

# Manufacturer Model Price Year
1 Man_1 Model_1 3000 1997
2 Man_2 Model_2 4900 1999

Format tsv_with_names

This format is based on the TSV format. Data is placed in columns separated by tab characters (code 0x9), with column names in the file's first row.

Example data:

Year    Manufacturer    Model   Price
1997    Man_1   Model_1    3000.00
1999    Man_2   Model_2    4900.00
Example query
SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "tsv_with_names",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

# Manufacturer Model Price Year
1 Man_1 Model_1 3000 1997
2 Man_2 Model_2 4900 1999

Format json_list

This format is based on the JSON representation of data. Each file must contain an array of JSON objects.

Example of valid data (data presented as a list of JSON objects):

[
    { "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 },
    { "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
]

Example of INVALID data (each line contains a separate JSON object, but they are not combined into a list):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }

Format json_each_row

This format is based on the JSON representation of data. Each file must contain a JSON object on each line without combining them into a JSON array. This format is used for data streaming systems like Apache Kafka or YDB Topics.

Example of valid data (each line contains a separate JSON object):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "json_each_row",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

# Manufacturer Model Price Year
1 Man_1 Model_1 3000 1997
2 Man_2 Model_2 4900 1999

Format json_as_string

This format is based on the JSON representation of data. The json_as_string format does not split the input JSON document into fields but represents each file line as a single JSON object (or string). This format is helpful when the list of fields is not the same in all rows and may vary.

Each file must contain:

  • a JSON object on each line, or
  • JSON objects combined into an array.

Example of valid data (data presented as a list of JSON objects):

{ "Year": 1997, "Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900.00 }
Example query
SELECT
    *
FROM `connection`.`path`
WITH
(
    FORMAT = "json_as_string",
    SCHEMA =
    (
        Data Json
    )
)

Query result:

# Data
1 {"Manufacturer": "Man_1", "Model": "Model_1", "Price": 3000, "Year": 1997}
2 {"Manufacturer": "Man_2", "Model": "Model_2", "Price": 4900, "Year": 1999}

Format parquet

This format allows reading the contents of files in Apache Parquet format.

Supported data compression algorithms for Parquet files:

  • Uncompressed
  • SNAPPY
  • GZIP
  • LZO
  • BROTLI
  • LZ4
  • ZSTD
  • LZ4_RAW
Example query
SELECT
    AVG(Price)
FROM `connection`.`path`
WITH
(
    FORMAT = "parquet",
    SCHEMA =
    (
        Year Int32,
        Manufacturer Utf8,
        Model Utf8,
        Price Double
    )
)

Query result:

# Manufacturer Model Price Year
1 Man_1 Model_1 3000 1997
2 Man_2 Model_2 4900 1999

Format raw

This format allows reading the contents of files as is, in raw form. The data read in this way can be processed using YQL tools, splitting into rows and columns.

This format should be used if the built-in parsing capabilities in YDB are insufficient.

Example query
SELECT
    *
FROM `connection`.`path`
WITH
(
    FORMAT = "raw",
    SCHEMA =
    (
        Data String
    )
)

Query result:

Year,Manufacturer,Model,Price
1997,Man_1,Model_1,3000.00
1999,Man_2,Model_2,4900.00

Supported compression algorithms

The use of compression algorithms depends on the file formats. For all file formats except Parquet, the following compression algorithms can be used:

Algorithm Name in YDB Read Write
Gzip gzip
Zstd zstd
LZ4 lz4
Brotli brotli
Bzip2 bzip2
Xz xz

For Parquet file format, the following internal compression algorithms are supported:

Compression format Name in YDB Read Write
Raw raw
Snappy snappy

YDB does not support working with externally compressed Parquet files, such as files named <myfile>.parquet.gz or similar. All files in Parquet format must be without external compression.