ClickBench load

The load is based on data and queries from the https://github.com/ClickHouse/ClickBench repository, and the queries and table layout are adapted to YDB.

The benchmark generates typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and event data. It covers typical queries in analytics and real-time dashboards.

The dataset for this benchmark was obtained from an actual traffic recording of one of the world's largest web analytics platforms. It has been anonymized while keeping all the essential data distributions. The query set was improvised to reflect realistic workloads, while the queries are not directly from production.

Common command options

All commands support the common option --path, which specifies the path to a table in the database:

ydb workload clickbench --path clickbench/hits ...

Available options

Name Description Default value
--path or -p Specifies the table path. clickbench/hits

Initializing a load test

Before running the benchmark, create a table:

ydb workload clickbench --path clickbench/hits init

See the description of the command to init the data load:

ydb workload clickbench init --help

Available parameters

Name Description Default value
--store <value> Table storage type. Possible values: row, column, external-s3. row.
--external-s3-prefix <value> Only relevant for external tables. Root path to the dataset in S3 storage.
--external-s3-endpoint <value> or -e <value> Only relevant for external tables. Link to S3 Bucket with data.
--string Use String type for text fields. Utf8 is used by default.
--datetime Use Date, Datetime and Timestamp type for time-related fields. Date32, Datetime64 and Timestamp64
--clear If the table at the specified path has already been created, it will be deleted.

Loading data into a table

Download the data archive, then load the data into the table:

wget https://datasets.clickhouse.com/hits_compatible/hits.csv.gz
ydb workload clickbench --path clickbench/hits import files --input hits.csv.gz

For source files, you can use CSV and TSV files, as well as directories containing such files. They can be either compressed or not.

Available parameters

Name Description Default value
--input <path> or -i <path> Path to the source data files. Both unpacked and packed CSV and TSV files, as well as directories containing such files, are supported. Data can be downloaded from the official ClickBench website: csv.gz, tsv.gz. To speed up the process, these files can be split into smaller parts, allowing parallel downloads.
--state <path> Path to the download state file. If the download is interrupted, it will resume from the same point when restarted.
--clear-state Relevant if the --state parameter is specified. Clears the state file and restarts the download from the beginning.

Common parameters of the import command

Name Description Default value
--upload-threads <value> or -t <value> The number of execution threads for data preparation. The number of available cores on the client.
--bulk-size <value> The size of the chunk for sending data, in rows. 10000
--max-in-flight <value> The maximum number of data chunks that can be processed simultaneously. 128

Run a load test

Run the load:

ydb workload clickbench --path clickbench/hits run

During the test, load statistics are displayed for each request.

See the command description to run the load:

ydb workload clickbench run --help

Common parameters for all load types

Name Description Default value
--output <value> The name of the file where the query execution results will be saved. results.out
--iterations <value> The number of times each load query will be executed. 1
--json <name> The name of the file where query execution statistics will be saved in json format. Not saved by default
--ministat <name> The name of the file where query execution statistics will be saved in ministat format. Not saved by default
--plan <name> The name of the file to save the query plan. Files like <name>.<query number>.explain and <name>.<query number>.<iteration number> will be saved in formats: ast, json, svg. Not saved by default
--query-settings <setting> Query execution settings. Each setting is added as a separate line at the beginning of each query. Use multiple times for multiple settings. Not specified by default
--include Query numbers or segments to be executed as part of the load. All queries executed
--exclude Query numbers or segments to be excluded from the load. None excluded by default
--executer Query execution engine. Available values: scan, generic. generic
--verbose or -v Print additional information to the screen during query execution.

ClickBench-specific options

Name Description Default value
--ext-queries <queries> or -q <queries> External queries to execute during the load, separated by semicolons.
--ext-queries-file <name> Name of the file containing external queries to execute during the load, separated by semicolons.
--ext-query-dir <name> Directory containing external queries for the load. Queries should be in files named q[0-42].sql.
--ext-results-dir <name> Directory containing external query results for comparison. Results should be in files named q[0-42].sql.
--check-canonical or -c Use special deterministic internal queries and compare the results against canonical ones.

Cleanup test data

Run cleanup:

ydb workload clickbench --path clickbench/hits clean

The command has no parameters.