Skip to main content

· 3 min read

How to install and test ClickHouse on Microsoft Windows

When ClickHouse installing on Windows 10 you may receive errors when inserting data, for example:

DB::Exception: std::__1::__fs::filesystem::filesystem_error: filesystem error: in rename: Permission denied ["./store/711/71144174-d098-4056-8976-6ad1204205ec/tmp_insert_all_1_1_0/"] ["./store/711/71144174-d098-4056-8976-6ad1204205ec/all_1_1_0/"]. Stack trace:

On Windows 10, WSL needs to be upgraded to WSL 2.

wsl
  • For testing follow these instructions, you should have similar output: Since this is for testing, I logged in as root to avoid permissions issues:
sudo -i
  • Create a ClickHouse directory:
root@marspc2:~# mkdir /clickhouse
  • From the new directory, download clickhouse:
root@marspc2:/# cd clickhouse

root@marspc2:/clickhouse# curl https://clickhouse.com | sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2739 0 2739 0 0 5515 0 --:--:-- --:--:-- --:--:-- 5511

Will download https://builds.clickhouse.com/master/amd64/clickhouse into clickhouse

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 530M 100 530M 0 0 8859k 0 0:01:01 0:01:01 --:--:-- 8549k

Successfully downloaded the ClickHouse binary, you can run it as:
./clickhouse

You can also install it:
sudo ./clickhouse install
  • Start the clickhouse server:
root@marspc2:/clickhouse# ./clickhouse server
Processing configuration file 'config.xml'.
There is no file 'config.xml', will use embedded config.
Cannot set max size of core file to 1073741824
2023.04.17 19:19:23.155323 [ 500 ] {} <Information> SentryWriter: Sending crash reports is disabled
2023.04.17 19:19:23.165447 [ 500 ] {} <Trace> Pipe: Pipe capacity is 1.00 MiB
2023.04.17 19:19:23.271147 [ 500 ] {} <Information> Application: Starting ClickHouse 23.4.1.1222 (revision: 54473, git hash: 3993aef8e281815ac4269d44e27bb1dcdcff21cb, build id: AF16AA59B689841860F39ACDBED30AC8F9AB70FA), PID 500
2023.04.17 19:19:23.271208 [ 500 ] {} <Information> Application: starting up
2023.04.17 19:19:23.271237 [ 500 ] {} <Information> Application: OS name: Linux, version: 5.15.90.1-microsoft-standard-WSL2, architecture: x86_64
...
  • In another WSL window, start the client:
root@marspc2:/clickhouse# ./clickhouse client
ClickHouse client version 23.4.1.1222 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.4.1 revision 54462.

Warnings:
* Linux transparent hugepages are set to "always". Check /sys/kernel/mm/transparent_hugepage/enabled

marspc2. :)
  • Create the database and table:
marspc2. :) create database db1;

CREATE DATABASE db1

Query id: 688f79e2-8132-44ed-98d6-0581abe9903a

Ok.

0 rows in set. Elapsed: 0.007 sec.

marspc2. :) create table db1.table1 (id Int64, string_column String) engine = MergeTree() order by id;

CREATE TABLE db1.table1
(
`id` Int64,
`string_column` String
)
ENGINE = MergeTree
ORDER BY id

Query id: d91a93b4-e13f-4e17-8201-f329223287d0

Ok.

0 rows in set. Elapsed: 0.010 sec.
  • Insert sample rows:
marspc2. :) insert into db1.table1 (id, string_column) values (1, 'a'), (2,'b');

INSERT INTO db1.table1 (id, string_column) FORMAT Values

Query id: 2b274eef-09af-434b-88e0-c25799649910

Ok.

2 rows in set. Elapsed: 0.003 sec.
  • View the rows:
marspc2. :) select * from db1.table1;

SELECT *
FROM db1.table1

Query id: 74c76bf1-d944-4b21-a384-cc0b5e6aa579

┌─id─┬─string_column─┐
│ 1 │ a │
│ 2 │ b │
└────┴───────────────┘

2 rows in set. Elapsed: 0.002 sec.

· 6 min read

The query_log table in the system database keeps track of all your queries, including:

  • how much memory the query consumed, and
  • how much CPU time was needed

The following query returns the top 10 queries, where "top" means the queries that used the most memory:

SELECT
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
ORDER BY memory_usage DESC
LIMIT 10;

The response looks like:

┌─type────────┬──────────event_time─┬─initial_query_id─────────────────────┬─memory─────┬─────userCPU─┬──systemCPU─┬─normalized_query_hash─┐
│ QueryFinish │ 2023-03-26 21:36:07 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 3.45 GiB │ 28147128901 │ 8590897697 │ 178963678599600243 │
│ QueryFinish │ 2023-03-26 21:36:04 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.18 GiB │ 10194162387 │ 1183376457 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:36:06 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.16 GiB │ 10516510952 │ 1484303318 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:35:59 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.14 GiB │ 11484580963 │ 1464145099 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:47:01 │ 8119e682-a343-4847-96e7-d34ad8a748a1 │ 455.29 MiB │ 123340498 │ 8234304 │ 10687606311941357470 │
│ QueryFinish │ 2023-03-26 22:07:05 │ f2690e48-fe1e-4367-ae9d-435d962003a5 │ 377.94 MiB │ 2358130001 │ 668098391 │ 5988812223780974416 │
│ QueryFinish │ 2023-03-26 20:45:42 │ 04618222-40a1-4299-8c3d-9f050a82d849 │ 18.48 MiB │ 24676 │ 16620 │ 3205198713665290475 │
│ QueryFinish │ 2023-03-26 22:14:37 │ badf1097-5f8f-4486-88e9-3a5ac2e4734c │ 17.41 MiB │ 186234 │ 148739 │ 1910846996890686559 │
│ QueryFinish │ 2023-03-26 21:39:42 │ 8d373327-f566-4cd5-9f2c-cec75f534751 │ 16.19 MiB │ 23169 │ 12365 │ 3205198713665290475 │
│ QueryFinish │ 2023-03-26 21:35:42 │ ea672dba-7c10-4dd4-b819-cad9dccbf5d0 │ 13.97 MiB │ 20696 │ 8001 │ 3205198713665290475 │
└─────────────┴─────────────────────┴──────────────────────────────────────┴────────────┴─────────────┴────────────┴───────────────────────┘
note

The initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request. The query_id contains the ID of the child query executed on a different node. See this article for more details.

You can use the query ID to extract more details about the query. Let's research our longest running query above (the first one):

SELECT query
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = '7fc488a5-838f-410d-88ee-2f492825a26b'

It turns out to be the query we used to insert a few billion rows of data into a table named youtube (see the YouTube dislikes dataset):

INSERT INTO youtube
SETTINGS input_format_null_as_default = 1
SELECT
id,
parseDateTimeBestEffortUS(toString(fetch_date)) AS fetch_date,
upload_date,
ifNull(title, '') AS title,
uploader_id,
ifNull(uploader, '') AS uploader,
uploader_sub_count,
is_age_limit,
view_count,
like_count,
dislike_count,
is_crawlable,
has_subtitles,
is_ads_enabled,
is_comments_enabled,
ifNull(description, '') AS description,
rich_metadata,
super_titles,
ifNull(uploader_badges, '') AS uploader_badges,
ifNull(video_badges, '') AS video_badges
FROM s3Cluster('default','https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst', 'JSONLines')

initial_query_id VS query_id

Note that in a clustered ClickHouse environment (like ClickHouse Cloud) initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request; then query_id field will contain the ID of the child query executed on a different node.

If we add query_id to the above query we pin our search around initial_query_id = a7262fa2-bd8b-4b51-a359-621ccf282417 and hostname():

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417'
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

we will get:

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-0 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 125.13 MiB │ 1754290 │ 133344 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 123.08 MiB │ 1849115 │ 123412 │ 4258439895846105173 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-1 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 93.77 MiB │ 1890981 │ 101724 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼───────────┼─────────┼───────────┼───────────────────────┤
│ server-1 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴───────────┴─────────┴───────────┴───────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-0 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
├────────────────────────┼────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 0.00 B │ 0 │ 0 │ 4258439895846105173 │
└────────────────────────┴────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────┴─────────┴───────────┴───────────────────────┘

Note we have several results from several hosts (the different cluster nodes).

To refine further and get only the child queries we could also add the query_id != initial_query_id condition to the WHERE clause:

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE (query_id = initial_query_id) AND (initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417')
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

returns all the child queries executed on the remote nodes (remote to the node where the query was first thrown at):

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-1 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 93.77 MiB │ 1890981 │ 101724 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-0 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 125.13 MiB │ 1754290 │ 133344 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-1 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-0 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘

conversely, query_id = initial_query_id will return only the queries executed on the local node where the distributed query was first thrown at:

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE (query_id = initial_query_id) AND (initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417')
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

giving:

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-2 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 123.08 MiB │ 1849115 │ 123412 │ 4258439895846105173 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 0.00 B │ 0 │ 0 │ 4258439895846105173 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘

As for other System Tables, you can find more details about the meaning of each field in our docs.

· 2 min read

Suppose you create a table that uses the File table engine with the Parquet format:

CREATE TABLE parquet_test
(
`x` UInt32,
`y` String
)
ENGINE = File(Parquet)

You can write to the table once:

INSERT INTO parquet_test VALUES
(1, 'Hello'),
(2, 'Hi')

This creates a file named data.Parquet in the data/default/parquet_test folder. If you try to insert another batch:

INSERT INTO parquet_test VALUES
(3, 'World'),
(4, 'Bye')

...you get the following error:

Code: 641. DB::Exception: Received from localhost:9000. DB::Exception: Cannot append data in format Parquet to file, because this format doesn't support appends. You can allow to create a new file on each insert by enabling setting engine_file_allow_create_multiple_files. (CANNOT_APPEND_TO_FILE)

You can not append to Parquet files in ClickHouse. But you can tell ClickHouse to create a new file for every INSERT by enabling the engine_file_allow_create_multiple_files setting. If enabled, on each insert a new file will be created with a name following this pattern:

`data.Parquet` -> `data.1.Parquet` -> `data.2.Parquet`, etc.:

Let's give it a try. We will put our two commands into a single file named parquet.sql:

SET engine_file_allow_create_multiple_files = 1;

INSERT INTO default.parquet_test VALUES (3, 'World'), (4, 'Bye');

Run it using clickhouse-client:

./clickhouse client --queries-file parquet.sql

And now you will see two files in data/default/parquet_test (and a new file for each subsequent insert).

note

The engine_file_allow_create_multiple_files setting applies to other data formats that are not appendable, like JSON and ORC.

· 2 min read

Question: How do I show all queries involving materialized views in the last 60m?

Answer:

This query will display all queries directed towards Materialized Views considering that:

  • we can leverage the create_table_query field in system.tables table to identify what tables are explicit (TO) recipient of MVs;
  • we can track back (using uuid and the name convention .inner_id.<uuid>) what tables are implicit recipient of MVs;

We can also configure how long back in time we want to look, by changing the value (60 m by default) in the initial query CTE

WITH(60) -- default 60m
AS timeRange,
(
--prepare names of possible implicit MV hidden target tables for *any* table with NON NULL uuid
SELECT groupArray(
concat('default.`.inner_id.', toString(uuid), '`')
)
FROM clusterAllReplicas(default, system.tables)
WHERE notEmpty(uuid)
) AS MV_implicit_possible_hidden_target_tables_names_array,
(
--captures MV name and target tables (if TO is specified)
--TODO it seems that extract will return just the first capturing group :( replace with regexpExtract once available
SELECT arrayFilter(
x->x != '',
--remove empty captures
groupArray(
extract(
create_table_query,
'^CREATE MATERIALIZED VIEW\s(\w+\.\w+)\s(?:TO\s(\S+))?'
)
)
)
FROM clusterAllReplicas(default, system.tables)
WHERE engine = 'MaterializedView'
) AS MV_explicit_target_tables_names_array
SELECT event_time,
query,
tables as "MVs tables"
FROM clusterAllReplicas(default, system.query_log)
WHERE (
-- only SELECT within 60m
event_time > now() - toIntervalMinute(timeRange)
AND startsWith(query, 'SELECT')
) -- check either that query involves implicit MV target table names
AND (
hasAny(
tables,
MV_implicit_possible_hidden_target_tables_names_array
)
OR -- check that query involves explicit MV target table
hasAny(tables, MV_explicit_target_tables_names_array)
)
ORDER BY event_time DESC;

expected output:

| event_time          | query                                                                                          | MVs tables                                                            |
| ------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| 2023-02-23 08:14:14 | SELECT rand(),* FROM default.sum_of_volumes, default.big_changes, system.users | ["default.big_changes_mv","default.sum_of_volumes_mv","system.users"] |
| 2023-02-23 08:04:47 | SELECT price,* FROM default.sum_of_volumes, default.big_changes | ["default.big_changes_mv","default.sum_of_volumes_mv"] |

In this example results above default.big_changes_mv and default.sum_of_volumes_mv are both materialized views.

· 6 min read

Question: How do I import JSON arrays and how can I query the inner objects?

Answer:

Dump this 1 line JSON array to sample.json

{"_id":"1","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.527Z"},{"eventType":"close","time":"2021-06-18T09:48:05.646Z"}]},{"_id":"2","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.535Z"},{"eventType":"edit","time":"2021-06-18T09:42:41.317Z"}]},{"_id":"3","channel":"questions","events":[{"eventType":"close","time":"2021-06-18T09:42:39.543Z"},{"eventType":"create","time":"2021-06-18T09:52:51.299Z"}]},{"_id":"4","channel":"general","events":[{"eventType":"create","time":"2021-06-18T09:42:39.552Z"},{"eventType":"edit","time":"2021-06-18T09:47:29.109Z"}]},{"_id":"5","channel":"general","events":[{"eventType":"edit","time":"2021-06-18T09:42:39.560Z"},{"eventType":"open","time":"2021-06-18T09:42:39.680Z"},{"eventType":"close","time":"2021-06-18T09:42:41.207Z"},{"eventType":"edit","time":"2021-06-18T09:42:43.372Z"},{"eventType":"edit","time":"2021-06-18T09:42:45.642Z"}]}

Check the data:

clickhousebook.local :) SELECT * FROM file('/path/to/sample.json','JSONEachRow');

SELECT *
FROM file('/path/to/sample.json', 'JSONEachRow')

Query id: 0bbfa09f-ac7f-4a1e-9227-2961b5ffc2d4

┌─_id─┬─channel───┬─events─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.527Z'},{'eventType':'close','time':'2021-06-18T09:48:05.646Z'}]
2 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.535Z'},{'eventType':'edit','time':'2021-06-18T09:42:41.317Z'}]
3 │ questions │ [{'eventType':'close','time':'2021-06-18T09:42:39.543Z'},{'eventType':'create','time':'2021-06-18T09:52:51.299Z'}]
4 │ general │ [{'eventType':'create','time':'2021-06-18T09:42:39.552Z'},{'eventType':'edit','time':'2021-06-18T09:47:29.109Z'}]
5 │ general │ [{'eventType':'edit','time':'2021-06-18T09:42:39.560Z'},{'eventType':'open','time':'2021-06-18T09:42:39.680Z'},{'eventType':'close','time':'2021-06-18T09:42:41.207Z'},{'eventType':'edit','time':'2021-06-18T09:42:43.372Z'},{'eventType':'edit','time':'2021-06-18T09:42:45.642Z'}]
└─────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

Create a table to receive the JSON rows:

clickhousebook.local :) CREATE TABLE IF NOT EXISTS sample_json_objects_array (
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
) ENGINE = MergeTree
ORDER BY
channel

CREATE TABLE IF NOT EXISTS sample_json_objects_array
(
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
)
ENGINE = MergeTree
ORDER BY channel

Query id: d02696dd-3f9f-4863-be2a-b2c9a1ae922d


0 rows in set. Elapsed: 0.173 sec.

Insert the data:

clickhousebook.local :) INSERT INTO
sample_json_objects_array
SELECT
*
FROM
file(
'/opt/cases/000000/sample_json_objects_arrays.json',
'JSONEachRow'
);

INSERT INTO sample_json_objects_array SELECT *
FROM file('/opt/cases/000000/sample.json', 'JSONEachRow')

Query id: 60c4beab-3c2c-40c1-9c6f-bbbd7118dde3

Ok.

0 rows in set. Elapsed: 0.002 sec.

Check how the data inference acted on JSON object type:

clickhousebook.local :) DESCRIBE TABLE sample_json_objects_array SETTINGS describe_extend_object_types = 1;

DESCRIBE TABLE sample_json_objects_array
SETTINGS describe_extend_object_types = 1

Query id: 302c0c84-1b63-4f60-ad95-d91c0267b0d4

┌─name────┬─type────────────────────────────────────────┬─default_type─┬─default_expression─────────────────────┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ rawJSON │ String │ EPHEMERAL │ defaultValueOfTypeName('String') │ │ │ │
│ _id │ String │ DEFAULT │ JSONExtractString(rawJSON, '_id') │ │ │ │
│ channel │ String │ DEFAULT │ JSONExtractString(rawJSON, 'channel') │ │ │ │
│ events │ Array(Tuple(eventType String, time String))DEFAULT │ JSONExtractArrayRaw(rawJSON, 'events') │ │ │ │
└─────────┴─────────────────────────────────────────────┴──────────────┴────────────────────────────────────────┴─────────┴──────────────────┴────────────────┘

Events is an Array of Tuple each containing a eventType String and a time String fields. This latter type is suboptimal (we'd want DateTime instead).

Let's see the data:

clickhousebook.local :) SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 3ddd6843-5206-4f52-971f-1699f0ba1728

┌─_id─┬─channel───┬─events.eventType──────────────────────┬─events.time──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']['2021-06-18T09:42:39.560Z','2021-06-18T09:42:39.680Z','2021-06-18T09:42:41.207Z','2021-06-18T09:42:43.372Z','2021-06-18T09:42:45.642Z']
1 │ help │ ['open','close']['2021-06-18T09:42:39.527Z','2021-06-18T09:48:05.646Z']
3 │ questions │ ['close','create']['2021-06-18T09:42:39.543Z','2021-06-18T09:52:51.299Z']
└─────┴───────────┴───────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 0.001 sec.

Let's run a few queries:

_id and channel of events that have an eventType of value close

clickhousebook.local :) SELECT
_id,
channel,
events.eventType
FROM
sample_json_objects_array
WHERE
has(events.eventType,'close')

SELECT
_id,
channel,
events.eventType
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 033a0c56-7bfa-4261-a334-7323bdc40f87

┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘
┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘

6 rows in set. Elapsed: 0.001 sec.

We want to query the time , for example all events between a given time range, but we notice it was imported as String:

clickhousebook.local :) SELECT toTypeName(events.time) FROM sample_json_objects_array;

SELECT toTypeName(events.time)
FROM sample_json_objects_array

Query id: 27f07f02-66cd-420d-8623-eeed7d501014

┌─toTypeName(events.time)─┐
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
└─────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

So, in order to handle these as dates, first we want to convert to DateTime. To convert an array we use a map function:

clickhousebook.local :) 
SELECT
_id,
channel,
arrayMap(x->parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array

SELECT
_id,
channel,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array

Query id: f3c7881e-b41c-4872-9c67-5c25966599a1

┌─_id─┬─channel───┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─────────────────────────────────────────────┐
4 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:47:29']
5 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:42:39','2021-06-18 11:42:41','2021-06-18 11:42:43','2021-06-18 11:42:45']
1 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:48:05']
2 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:42:41']
3 │ questions │ ['2021-06-18 11:42:39','2021-06-18 11:52:51']
└─────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

we can appreciate the diffs using toTypeName on both the arrays:

clickhousebook.local :) SELECT
_id,
channel,
toTypeName(events.time) as events_as_strings,
toTypeName(arrayMap(x->parseDateTimeBestEffort(x), events.time)) as events_as_datetime
FROM
sample_json_objects_array

SELECT
_id,
channel,
toTypeName(events.time) AS events_as_strings,
toTypeName(arrayMap(x -> parseDateTimeBestEffort(x), events.time)) AS events_as_datetime
FROM sample_json_objects_array

Query id: 1af54994-b756-472f-88d7-8b5cdca0e54e

┌─_id─┬─channel───┬─events_as_strings─┬─events_as_datetime─┐
4 │ general │ Array(String) │ Array(DateTime)
5 │ general │ Array(String) │ Array(DateTime)
1 │ help │ Array(String) │ Array(DateTime)
2 │ help │ Array(String) │ Array(DateTime)
3 │ questions │ Array(String) │ Array(DateTime)
└─────┴───────────┴───────────────────┴────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

now let's get the id of of the rows where time is between a given interval.

we use arrayCount to see if there is a count greater than than 0 of items in the array returned by the map function that will match the condition x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome') AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome')

clickhousebook.local :) SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array
WHERE
arrayCount(
x -> x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome')
AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome'),
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
) > 0;

SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array
WHERE arrayCount(x -> ((x >= toDateTime('2021-06-18 11:46:00', 'Europe/Rome')) AND (x <= toDateTime('2021-06-18 11:50:00', 'Europe/Rome'))), arrayMap(x -> parseDateTimeBestEffort(x), events.time)) > 0

Query id: d4882fc3-9f99-4e87-9f89-47683f10656d

┌─_id─┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─┐
4['2021-06-18 11:42:39','2021-06-18 11:47:29']
1['2021-06-18 11:42:39','2021-06-18 11:48:05']
└─────┴─────────────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.002 sec.

⚠️

Please remember, at the time of writing this article the current implementation of JSON is experimental and not suited for production.

This example highlights how to quickly import JSON and start querying it and represents a tradeoff between the ease of use where we import the JSON objects as JSON type with no need to specify upfront the schema type. Convenient for a quick test however for long term use of the data we would like to, with regards to this example to store the data using the most appropriate types, so for the time field, use DateTime instead of String, in order to avoid any post-ingestion phase conversion as illustrated above. Please refer to the documentation for more about handling JSON.

· 2 min read

Question:

How can I quickly recreate a table and its data using just copy/paste across different terminals?

Answer:

This is NOT a recommended practice to migrate data from one database to another and it should NOT be used for production data migration.

This is simply intended as a quick and dirty way to recreate small amount of data when developing across multiple environments.

  1. Get the CREATE TABLE statement with SHOW CREATE table:
SHOW CREATE TABLE cookies;

SHOW CREATE TABLE cookies

Query id: 248ec8e2-5bce-45b3-97d9-ed68edf445a5

┌─statement────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
CREATE TABLE default.cookies
(
`id` String,
`timestamp` DateTime
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  1. Get the data export using FORMAT SQLInsert
SELECT * FROM cookies FORMAT SQLInsert;

SELECT *
FROM cookies
FORMAT SQLInsert

Query id: 383759b8-69c0-4561-ab95-f8224abc0071

INSERT INTO table (`id`, `timestamp`) VALUES ('4', '2023-03-15 16:28:46')
, ('2', '2023-03-15 16:28:41')
, ('1', '2023-03-15 16:11:02'), ('1', '2023-03-15 16:11:40'), ('1', '2023-03-15 16:11:48'), ('1', '2023-03-15 16:16:05'), ('2', '2023-03-15 16:11:06'), ('3', '2023-03-15 16:11:12'), ('3', '2023-03-15 16:11:45'), ('3', '2023-03-15 16:16:08'), ('4', '2023-03-15 16:11:14'), ('4', '2023-03-15 16:11:50'), ('4', '2023-03-15 16:16:01'), ('5', '2023-03-15 16:11:18'), ('5', '2023-03-15 16:16:11')
;

15 rows in set. Elapsed: 0.023 sec.

Note you will need to replace the name table at point 2 with the actual table name (cookies in this example)

· One min read

I want to export data segregating the path in S3 bucket to follow a structure like:

  • 2022
    • 1
    • 2
    • ...
    • 12
  • 2021
    • 1
    • 2
    • ...
    • 12

and so on ...

Answer

Considering the ClickHouse table:

CREATE TABLE sample_data (
`name` String,
`age` Int,
`time` DateTime
) ENGINE = MergeTree
ORDER BY
name

Add 10000 entries:

INSERT INTO
sample_data
SELECT
*
FROM
generateRandom(
'name String, age Int, time DateTime',
10,
10,
10
)
LIMIT
10000;

Run this to create the desired structure in s3 bucket my_bucket (note this example writes files in parquet format):

INSERT INTO
FUNCTION s3(
'https://s3-host:4321/my_bucket/{_partition_id}/file.parquet.gz',
's3-access-key',
's3-secret-access-key',
Parquet,
'name String, age Int, time DateTime'
) PARTITION BY concat(
formatDateTime(time, '%Y'),
'/',
formatDateTime(time, '%m')
)
SELECT
name,
age,
time
FROM
sample_data
Query id: 55adcf22-f6af-491e-b697-d09694bbcc56

Ok.

0 rows in set. Elapsed: 15.579 sec. Processed 10.00 thousand rows, 219.93 KB (641.87 rows/s., 14.12 KB/s.)

· 3 min read

ClickHouse has a built-in debugger and introspection capabilities. For example, you can get the stack traces of every server's thread at runtime by querying the system.stack_trace table:

SELECT
count(),
arrayStringConcat(arrayMap(x -> concat(demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym
FROM system.stack_trace
GROUP BY trace
ORDER BY count() DESC
LIMIT 10
FORMAT Vertical
SETTINGS allow_introspection_functions = 1;

The query result will show the locations in the ClickHouse source code where the threads are running or waiting. (You will need to set allow_introspection_functions to 1 to enable the introspection functions.) The response looks like:

Row 1:
──────
count(): 144
sym: pthread_cond_wait

DB::BackgroundSchedulePool::threadFunction()
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 2:
──────
count(): 80
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::OrdinaryRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 3:
──────
count(): 55
sym: pthread_cond_wait

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 4:
──────
count(): 16
sym:

DB::AsynchronousInsertQueue::processBatchDeadlines(unsigned long)
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 5:
──────
count(): 16
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 6:
──────
count(): 10
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
Poco::Net::TCPServer::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 7:
──────
count(): 9
sym: pthread_cond_wait

ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 8:
──────
count(): 7
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
DB::ReadBufferFromPocoSocket::poll(unsigned long) const
/usr/bin/clickhouse
DB::TCPHandler::runImpl()
/usr/bin/clickhouse
DB::TCPHandler::run()
/usr/bin/clickhouse

/usr/bin/clickhouse
Poco::Net::TCPServerConnection::start()
/usr/bin/clickhouse
Poco::Net::TCPServerDispatcher::run()
/usr/bin/clickhouse
Poco::PooledThread::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 9:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runCleanupThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 10:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runMainThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


10 rows in set. Elapsed: 0.026 sec.
note

If you installed ClickHouse from a .deb/.rpm/.tgz you can also install the package with the debug info to see the line numbers from the source code:

sudo apt install clickhouse-common-static-dbg

If you've installed ClickHouse as a single-binary, it already contains the debug info.

tip

For more high-level information, check out some of these other system tables:

And there is handy info in the other system tables also.

· 2 min read

Using INTO OUTFILE Clause

Add an INTO OUTFILE clause to your query.

For example:

SELECT * FROM table INTO OUTFILE 'file'

By default, ClickHouse uses the file extension of the filename to deteremine the output format and compression. For example, all of the rows in nyc_taxi will be exported to the nyc_taxi.parquet using the Parquet format:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.parquet'

And the following file will be a compressed, tab-separated file:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.tsv.gz'

If ClickHouse can not determine the format from the file extension, then the output format defaults to TabSeparated for output data. To specify the output format, use the FORMAT clause.

For example:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.txt'
FORMAT CSV

Using the File table engine

Another option is to use the File table engine, where ClickHouse uses the file to store the data. You can perform queries and inserts directly on the file.

For example:

CREATE TABLE my_table (
x UInt32,
y String,
z DateTime
)
ENGINE = File(Parquet)

Insert a few rows:

INSERT INTO my_table VALUES
(1, 'Hello', now()),
(2, 'World', now()),
(3, 'Goodbye', now())

The file is stored in the data folder of your ClickHouse server - specifically in /data/default/my_table in a file named data.Parquet.

note

Using the File table engine is incredibly handy for creating and querying files on your file system, but keep in mind that File tables are not MergeTree tables, so you don't get all the benefits that come with MergeTree. Use File for convenience when exporting data out of ClickHouse in convenient formats.

Using Command-Line Redirection

$ clickhouse-client --query "SELECT * from table" --format FormatName > result.txt

See clickhouse-client.

· 3 min read

Below are some basics of using the S3 table engine to read parquet files.

  • create access and secret keys for an IAM service user. normal login users usually don't work since they may have been configured with an MFA policy.

  • set the permissions on the policy to allow the service user to access the bucket and folders.

The following is a very simple example that you can use to test the mechanics of accessing your parquet files successfully prior to applying to your actual data.

If you need an example of creating a user and bucket, you can follow the first two sections (create user and create bucket): https://clickhouse.com/docs/en/guides/sre/configuring-s3-for-clickhouse-use/

I used this sample file: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet and uploaded it to my test bucket

You can set the policy something like this on the bucket: (adjust as needed, this one is fairly open for privileges but will help in testing. you can narrow your permissions as necessary)

{
"Version": "2012-10-17",
"Id": "Policy123456",
"Statement": [
{
"Sid": "abc123",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::1234567890:user/mars-s3-user"
]
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::mars-doc-test",
"arn:aws:s3:::mars-doc-test/*"
]
}
]
}

You can run queries with this type of syntax using the S3 table engine: https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

clickhouse-cloud :)  select count(*) from s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet','ABC123', 'abc+123', 'Parquet', 'first_name String');

SELECT count(*)
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'first_name String')

Query id: fd4f1193-d604-4ac0-9a46-bdd2d5e14727

┌─count()─┐
│ 1000 │
└─────────┘

1 row in set. Elapsed: 1.274 sec. Processed 1.00 thousand rows, 14.64 KB (784.81 rows/s., 11.49 KB/s.)

The data types reference for parquet format are here: https://clickhouse.com/docs/en/interfaces/formats/#data-format-parquet

To bring in the data into a native ClickHouse table:

create the table, something like this (just chose a couple of the columns in the parquet file):

clickhouse-cloud :) CREATE TABLE my_parquet_table (id UInt64, first_name String) ENGINE = MergeTree ORDER BY id;

CREATE TABLE my_parquet_table
(
`id` UInt64,
`first_name` String
)
ENGINE = MergeTree
ORDER BY id

Query id: 412e3994-bf8e-444e-ac43-a7c82642b7da

Ok.

0 rows in set. Elapsed: 0.600 sec.

Select the data from the S3 bucket to insert into the new table:

clickhouse-cloud :) INSERT INTO my_parquet_table (id, first_name) SELECT id, first_name FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123','abc+123', 'Parquet', 'id UInt64, first_name String') FORMAT Parquet

INSERT INTO my_parquet_table (id, first_name) SELECT
id,
first_name
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'id UInt64, first_name String')

Query id: c3cdc871-f338-462d-8797-6751b45a0b58

Ok.

0 rows in set. Elapsed: 1.220 sec. Processed 1.00 thousand rows, 22.64 KB (819.61 rows/s., 18.56 KB/s.)

Verify the import:

clickhouse-cloud :) SELECT * FROM my_parquet_table LIMIT 10;

SELECT *
FROM my_parquet_table
LIMIT 10

Query id: 1ccf59dd-d804-46a9-aadd-ed5c57b9e1a0

┌─id─┬─first_name─┐
│ 1 │ Amanda │
│ 2 │ Albert │
│ 3 │ Evelyn │
│ 4 │ Denise │
│ 5 │ Carlos │
│ 6 │ Kathryn │
│ 7 │ Samuel │
│ 8 │ Harry │
│ 9 │ Jose │
│ 10 │ Emily │
└────┴────────────┘

When you are ready to import your real data, you can use some special syntax like wildcards and ranges to specify your folders, subfolders and files in your bucket. I'd recommend to filter a few directories and files to test the import, maybe a certain year, a couple months and some date range to test first.

besides the path options here, newly released is syntax ** which specifies all subdirectories recursively. https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

For example, assuming the paths and bucket structure is something like this: https://your_s3_bucket.s3.amazonaws.com/<your_folder>/<year>/<month>/<day>/<filename>.parquet https://mars-doc-test.s3.amazonaws.com/system_logs/2022/11/01/my-app-logs-0001.parquet

This would get all files for 1st day of every month in 2021-2022 https://mars-doc-test.s3.amazonaws.com/system_logs/{2021-2022}/**/01/*.parquet