Skip to main content

· One min read

This is just a short example that illustrates the use of JSONExtract functions.

Create a table:

CREATE TABLE default.json_extract_example
(
`rawJSON` String EPHEMERAL,
`a1` String DEFAULT JSONExtractString(rawJSON, 'a1'),
`a2` Boolean DEFAULT JSONExtractBool(rawJSON, 'a2'),
`a3.aa1` Float DEFAULT JSONExtractFloat(JSONExtractRaw(rawJSON, 'a3'), 'aa1'),
`a3.aa2` UInt8 DEFAULT JSONExtractUInt(JSONExtractRaw(rawJSON, 'a3'), 'aa2')
)
ENGINE = MergeTree
ORDER BY (a1, a2)

Add your JSON raw string:

INSERT INTO default.json_extract_example (rawJSON) VALUES ('{"a1": "XX", "a2": true, "a3":{"aa1":23.11,"aa2":12}}');

Query your data:

SELECT *
FROM json_extract_example
FORMAT Pretty

Yields:

┏━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ a1 ┃ a2 ┃ a3.aa1 ┃ a3.aa2 ┃
┡━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ XX │ true │ 23.11 │ 12 │
└────┴──────┴────────┴────────┘

Each stored as the original JSON type:

SELECT
toTypeName(a1),
toTypeName(a2),
toTypeName(a3.aa1),
toTypeName(a3.aa2)
FROM default.json_extract_example
FORMAT Pretty
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ toTypeName(a1) ┃ toTypeName(a2) ┃ toTypeName(a3.aa1) ┃ toTypeName(a3.aa2) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ String │ Bool │ Float32 │ UInt8 │
└────────────────┴────────────────┴────────────────────┴────────────────────┘

· 2 min read

The short answer is “no”. The key-value workload is among top positions in the list of cases when NOT to use ClickHouse. It’s an OLAP system after all, while there are many excellent key-value storage systems out there.

However, there might be situations where it still makes sense to use ClickHouse for key-value-like queries. Usually, it’s some low-budget products where the main workload is analytical in nature and fits ClickHouse well, but there’s also some secondary process that needs a key-value pattern with not so high request throughput and without strict latency requirements. If you had an unlimited budget, you would have installed a secondary key-value database for this secondary workload, but in reality, there’s an additional cost of maintaining one more storage system (monitoring, backups, etc.) which might be desirable to avoid.

If you decide to go against recommendations and run some key-value-like queries against ClickHouse, here are some tips:

  • The key reason why point queries are expensive in ClickHouse is its sparse primary index of main MergeTree table engine family. This index can’t point to each specific row of data, instead, it points to each N-th and the system has to scan from the neighboring N-th row to the desired one, reading excessive data along the way. In a key-value scenario, it might be useful to reduce the value of N with the index_granularity setting.
  • ClickHouse keeps each column in a separate set of files, so to assemble one complete row it needs to go through each of those files. Their count increases linearly with the number of columns, so in the key-value scenario, it might be worth avoiding using many columns and put all your payload in a single String column encoded in some serialization format like JSON, Protobuf, or whatever makes sense.
  • There’s an alternative approach that uses Join table engine instead of normal MergeTree tables and joinGet function to retrieve the data. It can provide better query performance but might have some usability and reliability issues. Here’s an usage example.

· 2 min read

We can refer to systems like MapReduce as distributed computing systems in which the reduce operation is based on distributed sorting. The most common open-source solution in this class is Apache Hadoop.

These systems aren’t appropriate for online queries due to their high latency. In other words, they can’t be used as the back-end for a web interface. These types of systems aren’t useful for real-time data updates. Distributed sorting isn’t the best way to perform reduce operations if the result of the operation and all the intermediate results (if there are any) are located in the RAM of a single server, which is usually the case for online queries. In such a case, a hash table is an optimal way to perform reduce operations. A common approach to optimizing map-reduce tasks is pre-aggregation (partial reduce) using a hash table in RAM. The user performs this optimization manually. Distributed sorting is one of the main causes of reduced performance when running simple map-reduce tasks.

Most MapReduce implementations allow you to execute arbitrary code on a cluster. But a declarative query language is better suited to OLAP to run experiments quickly. For example, Hadoop has Hive and Pig. Also consider Cloudera Impala or Shark (outdated) for Spark, as well as Spark SQL, Presto, and Apache Drill. Performance when running such tasks is highly sub-optimal compared to specialized systems, but relatively high latency makes it unrealistic to use these systems as the backend for a web interface.

· One min read

We recommend having a maximum of 1000 databases and 5000 tables, 50000 partitions, and 100000 parts across all databases for a service. Databases in ClickHouse are more of a namespace and have no performance impact; 1000 databases is a loose guideline. However, the number of tables would affect the service startup time, so we recommend limiting the number of tables or partitions. ClickHouse provides a warning if these thresholds are hit.

These limits are also applicable to ClickHouse Cloud.

· 2 min read

This question usually arises when people see official ClickHouse t-shirts. They have large words “ClickHouse не тормозит” on the front.

Before ClickHouse became open-source, it has been developed as an in-house storage system by the largest Russian IT company, Yandex. That’s why it initially got its slogan in Russian, which is “не тормозит” (pronounced as “ne tormozit”). After the open-source release we first produced some of those t-shirts for events in Russia and it was a no-brainer to use the slogan as-is.

One of the following batches of those t-shirts was supposed to be given away on events outside of Russia and we tried to make the English version of the slogan. Unfortunately, the Russian language is kind of elegant in terms of expressing stuff and there was a restriction of limited space on a t-shirt, so we failed to come up with good enough translation (most options appeared to be either long or inaccurate) and decided to keep the slogan in Russian even on t-shirts produced for international events. It appeared to be a great decision because people all over the world get positively surprised and curious when they see it.

So, what does it mean? Here are some ways to translate “не тормозит”:

  • If you translate it literally, it’d be something like “ClickHouse does not press the brake pedal”.
  • If you’d want to express it as close to how it sounds to a Russian person with IT background, it’d be something like “If your larger system lags, it’s not because it uses ClickHouse”.
  • Shorter, but not so precise versions could be “ClickHouse is not slow”, “ClickHouse does not lag” or just “ClickHouse is fast”.

If you haven’t seen one of those t-shirts in person, you can check them out online in many ClickHouse-related videos. For example, this one:

P.S. These t-shirts are not for sale, they are given away for free on most ClickHouse Meetups, usually for best questions or other forms of active participation.

· 3 min read

OLAP stands for Online Analytical Processing. It is a broad term that can be looked at from two perspectives: technical and business. But at the very high level, you can just read these words backward:

Processing : Some source data is processed…

Analytical : …to produce some analytical reports and insights…

Online : …in real-time.

OLAP from the Business Perspective

In recent years, business people started to realize the value of data. Companies who make their decisions blindly, more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be remotely useful for making business decisions and need mechanisms to timely analyze them. Here’s where OLAP database management systems (DBMS) come in.

In a business sense, OLAP allows companies to continuously plan, analyze, and report operational activities, thus maximizing efficiency, reducing expenses, and ultimately conquering the market share. It could be done either in an in-house system or outsourced to SaaS providers like web/mobile analytics services, CRM services, etc. OLAP is the technology behind many BI applications (Business Intelligence).

ClickHouse is an OLAP database management system that is pretty often used as a backend for those SaaS solutions for analyzing domain-specific data. However, some businesses are still reluctant to share their data with third-party providers and an in-house data warehouse scenario is also viable.

OLAP from the Technical Perspective

All database management systems could be classified into two groups: OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing). Former focuses on building reports, each based on large volumes of historical data, but doing it not so frequently. While the latter usually handle a continuous stream of transactions, constantly modifying the current state of data.

In practice OLAP and OLTP are not categories, it’s more like a spectrum. Most real systems usually focus on one of them but provide some solutions or workarounds if the opposite kind of workload is also desired. This situation often forces businesses to operate multiple storage systems integrated, which might be not so big deal but having more systems make it more expensive to maintain. So the trend of recent years is HTAP (Hybrid Transactional/Analytical Processing) when both kinds of the workload are handled equally well by a single database management system.

Even if a DBMS started as a pure OLAP or pure OLTP, they are forced to move towards that HTAP direction to keep up with their competition. And ClickHouse is no exception, initially, it has been designed as fast-as-possible OLAP system and it still does not have full-fledged transaction support, but some features like consistent read/writes and mutations for updating/deleting data had to be added.

The fundamental trade-off between OLAP and OLTP systems remains:

  • To build analytical reports efficiently it’s crucial to be able to read columns separately, thus most OLAP databases are columnar,
  • While storing columns separately increases costs of operations on rows, like append or in-place modification, proportionally to the number of columns (which can be huge if the systems try to collect all details of an event just in case). Thus, most OLTP systems store data arranged by rows.

· 6 min read

First of all, let’s discuss why people ask this question in the first place. There are two key reasons:

  1. ClickHouse is developed with pretty high velocity, and usually there are 10+ stable releases per year. That makes a wide range of releases to choose from, which is not so trivial of a choice.
  2. Some users want to avoid spending time figuring out which version works best for their use case and just follow someone else’s advice.

The second reason is more fundamental, so we’ll start with that one and then get back to navigating through various ClickHouse releases.

Which ClickHouse Version Do You Recommend?

It’s tempting to hire consultants or trust some known experts to get rid of responsibility for your production environment. You install some specific ClickHouse version that someone else recommended; if there’s some issue with it - it’s not your fault, it’s someone else’s. This line of reasoning is a big trap. No external person knows better than you what’s going on in your company’s production environment.

So how do you properly choose which ClickHouse version to upgrade to? Or how do you choose your first ClickHouse version? First of all, you need to invest in setting up a realistic pre-production environment. In an ideal world, it could be a completely identical shadow copy, but that’s usually expensive.

Here are some key points to get reasonable fidelity in a pre-production environment with not-so-high costs:

  • Pre-production environment needs to run an as close of a set of queries as you intend to run in production:
    • Don’t make it read-only with some frozen data.
    • Don’t make it write-only with just copying data without building some typical reports.
    • Don’t wipe it clean instead of applying schema migrations.
  • Use a sample of real production data and queries. Try to choose a sample that’s still representative and makes SELECT queries return reasonable results. Use obfuscation if your data is sensitive and internal policies do not allow it to leave the production environment.
  • Make sure that pre-production is covered by your monitoring and alerting software the same way as your production environment does.
  • If your production spans across multiple datacenters or regions, make your pre-production do the same.
  • If your production uses complex features like replication, distributed tables and cascading materialized views, make sure they are configured similarly in pre-production.
  • There’s a trade-off on using the roughly same number of servers or VMs in pre-production as in production but of smaller size, or much less of them but of the same size. The first option might catch extra network-related issues, while the latter is easier to manage.

The second area to invest in is automated testing infrastructure. Don’t assume that if some kind of query has executed successfully once, it’ll continue to do so forever. It’s OK to have some unit tests where ClickHouse is mocked, but make sure your product has a reasonable set of automated tests that are run against real ClickHouse and check that all important use cases are still working as expected.

An extra step forward could be contributing those automated tests to ClickHouse’s open-source test infrastructure that are continuously used in its day-to-day development. It definitely will take some additional time and effort to learn how to run it and then how to adapt your tests to this framework, but it’ll pay off by ensuring that ClickHouse releases are already tested against them when they are announced stable, instead of repeatedly losing time on reporting the issue after the fact and then waiting for a bugfix to be implemented, backported and released. Some companies even have such test contributions to infrastructure by its use as an internal policy, (called Beyonce’s Rule at Google).

When you have your pre-production environment and testing infrastructure in place, choosing the best version is straightforward:

  1. Routinely run your automated tests against new ClickHouse releases. You can do it even for ClickHouse releases that are marked as testing, but going forward to the next steps with them is not recommended.
  2. Deploy the ClickHouse release that passed the tests to pre-production and check that all processes are running as expected.
  3. Report any issues you discovered to ClickHouse GitHub Issues.
  4. If there were no major issues, it should be safe to start deploying ClickHouse release to your production environment. Investing in gradual release automation that implements an approach similar to canary releases or green-blue deployments might further reduce the risk of issues in production.

As you might have noticed, there’s nothing specific to ClickHouse in the approach described above - people do that for any piece of infrastructure they rely on if they take their production environment seriously.

How to Choose Between ClickHouse Releases?

If you look into the contents of the ClickHouse package repository, you’ll see two kinds of packages:

  1. stable
  2. lts (long-term support)

Here is some guidance on how to choose between them:

  • stable is the kind of package we recommend by default. They are released roughly monthly (and thus provide new features with reasonable delay) and three latest stable releases are supported in terms of diagnostics and backporting of bugfixes.
  • lts are released twice a year and are supported for a year after their initial release. You might prefer them over stable in the following cases:
    • Your company has some internal policies that do not allow for frequent upgrades or using non-LTS software.
    • You are using ClickHouse in some secondary products that either do not require any complex ClickHouse features or do not have enough resources to keep it updated.

Many teams who initially think that lts is the way to go often switch to stable anyway because of some recent feature that’s important for their product.

danger

One more thing to keep in mind when upgrading ClickHouse: we’re always keeping an eye on compatibility across releases, but sometimes it’s not reasonable to keep and some minor details might change. So make sure you check the changelog before upgrading to see if there are any notes about backward-incompatible changes.

· 2 min read

Note: Please see the blog Working with Time series data in ClickHouse for additional examples of using ClickHouse for time series analysis.

ClickHouse is a generic data storage solution for OLAP workloads, while there are many specialized time-series database management systems. Nevertheless, ClickHouse’s focus on query execution speed allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we’re not going to conduct one here. Instead, let’s focus on ClickHouse features that are important to use if that’s your use case.

First of all, there are specialized codecs which make typical time-series. Either common algorithms like DoubleDelta and Gorilla or specific to ClickHouse like T64.

Second, time-series queries often hit only recent data, like one day or one week old. It makes sense to use servers that have both fast nVME/SSD drives and high-capacity HDD drives. ClickHouse TTL feature allows to configure keeping fresh hot data on fast drives and gradually move it to slower drives as it ages. Rollup or removal of even older data is also possible if your requirements demand it.

Even though it’s against ClickHouse philosophy of storing and processing raw data, you can use materialized views to fit into even tighter latency or costs requirements.

· 2 min read

Being an open-source product makes this question not so straightforward to answer. You do not have to tell anyone if you want to start using ClickHouse, you just go grab source code or pre-compiled packages. There’s no contract to sign and the Apache 2.0 license allows for unconstrained software distribution.

Also, the technology stack is often in a grey zone of what’s covered by an NDA. Some companies consider technologies they use as a competitive advantage even if they are open-source and do not allow employees to share any details publicly. Some see some PR risks and allow employees to share implementation details only with their PR department approval.

So how to tell who is using ClickHouse?

One way is to ask around. If it’s not in writing, people are much more willing to share what technologies are used in their companies, what the use cases are, what kind of hardware is used, data volumes, etc. We’re talking with users regularly on ClickHouse Meetups all over the world and have heard stories about 1000+ companies that use ClickHouse. Unfortunately, that’s not reproducible and we try to treat such stories as if they were told under NDA to avoid any potential troubles. But you can come to any of our future meetups and talk with other users on your own. There are multiple ways how meetups are announced, for example, you can subscribe to our Twitter.

The second way is to look for companies publicly saying that they use ClickHouse. It’s more substantial because there’s usually some hard evidence like a blog post, talk video recording, slide deck, etc. We collect the collection of links to such evidence on our Adopters page. Feel free to contribute the story of your employer or just some links you’ve stumbled upon (but try not to violate your NDA in the process).

You can find names of very large companies in the adopters list, like Bloomberg, Cisco, China Telecom, Tencent, or Uber, but with the first approach, we found that there are many more. For example, if you take the list of largest IT companies by Forbes (2020) over half of them are using ClickHouse in some way. Also, it would be unfair not to mention Yandex, the company which initially open-sourced ClickHouse in 2016 and happens to be one of the largest IT companies in Europe.

· 4 min read

It was designed to be fast. Query execution performance has always been a top priority during the development process, but other important characteristics like user-friendliness, scalability, and security were also considered so ClickHouse could become a real production system.

"Building for Fast", Alexey Milovidov (CTO, ClickHouse)

"Building for Fast" talk from ClickHouse Meetup Amsterdam, June 2022.

"Secrets of ClickHouse Performance Optimizations" talk from Big Data Technology Conference, December 2019, offers a more technical take on the same topic.

What Makes ClickHouse so Fast?

Architecture choices

ClickHouse was initially built as a prototype to do just a single task well: to filter and aggregate data as fast as possible. That’s what needs to be done to build a typical analytical report, and that’s what a typical GROUP BY query does. The ClickHouse team has made several high-level decisions that, when combined, made achieving this task possible:

Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few of them. The system needs to avoid reading unnecessary columns to avoid expensive disk read operations.

Indexes: Memory resident ClickHouse data structures allow the reading of only the necessary columns, and only the necessary row ranges of those columns.

Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because in real data a column often has the same, or not so many different, values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.

Vectorized query execution: ClickHouse not only stores data in columns but also processes data in columns. This leads to better CPU cache utilization and allows for SIMD CPU instructions usage.

Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but all CPU cores and disks of a cluster as well.

Attention to Low-Level Details

But many other database management systems use similar techniques. What really makes ClickHouse stand out is attention to low-level details. Most programming languages provide implementations for most common algorithms and data structures, but they tend to be too generic to be effective. Every task can be considered as a landscape with various characteristics, instead of just throwing in random implementation. For example, if you need a hash table, here are some key questions to consider:

  • Which hash function to choose?
  • Collision resolution algorithm: open addressing vs chaining?
  • Memory layout: one array for keys and values or separate arrays? Will it store small or large values?
  • Fill factor: when and how to resize? How to move values around on resize?
  • Will values be removed and which algorithm will work better if they will?
  • Will we need fast probing with bitmaps, inline placement of string keys, support for non-movable values, prefetch, and batching?

Hash table is a key data structure for GROUP BY implementation and ClickHouse automatically chooses one of 30+ variations for each specific query.

The same goes for algorithms, for example, in sorting you might consider:

  • What will be sorted: an array of numbers, tuples, strings, or structures?
  • Is all data available completely in RAM?
  • Do we need a stable sort?
  • Do we need a full sort? Maybe partial sort or n-th element will suffice?
  • How to implement comparisons?
  • Are we sorting data that has already been partially sorted?

Algorithms that they rely on characteristics of data they are working with can often do better than their generic counterparts. If it is not really known in advance, the system can try various implementations and choose the one that works best in runtime. For example, see an article on how LZ4 decompression is implemented in ClickHouse.

Last but not least, the ClickHouse team always monitors the Internet on people claiming that they came up with the best implementation, algorithm, or data structure to do something and tries it out. Those claims mostly appear to be false, but from time to time you’ll indeed find a gem.

Tips for building your own high-performance software
  • Keep in mind low-level details when designing your system.
  • Design based on hardware capabilities.
  • Choose data structures and abstractions based on the needs of the task.
  • Provide specializations for special cases.
  • Try new, "best" algorithms, that you read about yesterday.
  • Choose an algorithm in runtime based on statistics.
  • Benchmark on real datasets.
  • Test for performance regressions in CI.
  • Measure and observe everything.