As the popularity of ClickHouse increased over the years, more and more people know about the core features of ClickHouse like its incredible performance, its high compression ratio or the huge capabilities of reading and writing nearly all dataformats there are.
But ClickHouse also has a lot of hidden gems, which can help in your day to day work, a lot of the people don’t know about. Even so, most of them are well documented in the official Documentation, if you don’t know you are looking for them, you will not find them.
In this blog post, I’ll highlight some of my favorite small features that more people should know about.
Select modifiers
In column based DBMS like ClickHouse, queries like SELECT * FROM table should generally be avoided. At least that’s true when it comes to regular queries issued by your application. The typical day to day work of a DBA or database developer however often includes these types of queries, as otherwise it would take a huge amount of time to type down all X columns of a table manually.
But what if you want to run a lot of operations on some columns? Imagine you have the following table:
Unset
CREATE TABLE customer (
customerId UInt64,
custom_num_1 UInt64,
custom_num_2 UInt64,
...
custom_num_50 UInt64,
custom_string_1 String,
custom_string_2 String,
...
custom_string_50 String
);
When optimizing your table, you might be interested in the average length of the String columns in your table, as well as the maximum value of your number columns. If you want to handwrite a query to collect this information, you would have to write the correct function for 100 columns this way.
In some databases (like MySQL but also ClickHouse) you can utilize the INFORMATION_SCHEMA.COLUMNS table to build your query. This can be convenient for a lot of DBAs as they might already be used to this, but ClickHouse provides an even faster way to achieve your goal: Select Modifiers
Utilizing a combination of modifiers, our task comes down to a simple query:
Unset
SELECT
COLUMNS('.*num.*') APPLY max,
COLUMNS('.*string.*') APPLY length APPLY max
FROM customer
We are utilizing the COLUMNS modifier to apply a regex to the column names to only get columns with names num or string in it, and on all those columns we apply the function max if it has been in the set of number columns, or we first apply the function length and afterwards the function max. This gives us the wanted results, and takes you way less time than building a query via the information schema, or writing down 100 columns manually.
Manual test data in various formats
Often when I want to help customers or users of ClickHouse in the open slack channel, there are cases where you need to provide examples in a test table. Sometimes writing the table definition for such a test table and filling it with test data can even be more code than what you want to show in the first place.
As you could see in the previous section, the table I used for explaining select modifiers is longer than the actual code I want to present (and it has been abbreviated already).
But there is a possibility in ClickHouse to directly work on Data, as if it would be in a table, which is the format table-function
With this function you can specify which format you want to use (like JSONEachRow) and then directly enter the data, instead of inserting it into a table first:
Unset
SELECT *
FROM FORMAT(JSONEachRow, '{"customerId":1,"type":1,"custom_num1":4711}\n{"customerId":2, "type":2,"custom_ips":["127.0.0.1","127.0.0.2"]}')
+-customerId-+-type-+-custom_num1-+-custom_ips----------------+
| 1 | 1 | 4711 | [] |
| 2 | 2 | NULL | ['127.0.0.1','127.0.0.2'] |
+------------+------+-------------+---------------------------+
You can see it generates a result set with two rows, and even complex types like Arrays are doable. And you can use a large variety of data formats to use.
Generate series with numbers table function
Some DBMS have possibilities to generate number sequences by using table functions. ClickHouse is no exception in this regard, providing the numbers() table function for these regards. But did you know that you can easily generate date sequences or time sequences with this table function as well?
ClickHouse allows simple arithmetic functions on Date and DateTime data types, allowing you to easily generate sequences of dates or timestamps:
Unset
SELECT
number,
now() - number AS previousTimes,
toDate(now()) + number AS nextDays
FROM numbers(10)
+-number-+-------previousTimes-+---nextDays-+
| 0 | 2024-04-25 13:04:30 | 2024-04-25 |
| 1 | 2024-04-25 13:04:29 | 2024-04-26 |
| 2 | 2024-04-25 13:04:28 | 2024-04-27 |
| 3 | 2024-04-25 13:04:27 | 2024-04-28 |
| 4 | 2024-04-25 13:04:26 | 2024-04-29 |
| 5 | 2024-04-25 13:04:25 | 2024-04-30 |
| 6 | 2024-04-25 13:04:24 | 2024-05-01 |
| 7 | 2024-04-25 13:04:23 | 2024-05-02 |
| 8 | 2024-04-25 13:04:22 | 2024-05-03 |
| 9 | 2024-04-25 13:04:21 | 2024-05-04 |
+--------+---------------------+------------+
By applying multiplication you can also introduce steps etc.
As the final Datatype of the columns will still be Date or DateTime you can be sure, that only valid dates are generated
Data formatting in custom columns
Sometimes it is needed to partially format your data on output. For example if you want to insert data into a streaming service like Kafka. Typically you have some columns you would need as direct columns, but also combine the data of other columns into a payload column, in a given format (Typically JSON).
Of course, you can do a lot of these things in other DBMS as well, by using string concatenation, and build your JSON manually, or use specific JSON functions like toJSONString or manually create JSON Objects etc.
But ClickHouse also has you covered here, by giving you the function formatRowNoNewline(). This function allows you to format an arbitrary amount of columns into all possible output formats ClickHouse has to offer.
And of course you can also use Select Modifiers to specify which columns to format:
Unset
SELECT
customerId,
formatRowNoNewline('JSONEachRow', COLUMNS('.*num.*')) AS payload
FROM customer
LIMIT 10
+-customerId-+-payload------------------------------------------------------------------------------+
| 20 | {"custom_num_1":"4503644724578621668","custom_num_2":"156","custom_num_50":"32624"} |
| 111 | {"custom_num_1":"9395348731023764754","custom_num_2":"4","custom_num_50":"8919"} |
| 187 | {"custom_num_1":"4410745110154105282","custom_num_2":"67","custom_num_50":"19015"} |
| 231 | {"custom_num_1":"8206799308850873985","custom_num_2":"151","custom_num_50":"43260"} |
| 262 | {"custom_num_1":"14904510309594397590","custom_num_2":"83","custom_num_50":"2802"} |
| 375 | {"custom_num_1":"14468162691130275987","custom_num_2":"13","custom_num_50":"6910"} |
| 388 | {"custom_num_1":"15191470301382236130","custom_num_2":"110","custom_num_50":"39490"} |
| 434 | {"custom_num_1":"11648353314185268442","custom_num_2":"211","custom_num_50":"52725"} |
| 439 | {"custom_num_1":"8430391284904487000","custom_num_2":"31","custom_num_50":"43376"} |
| 468 | {"custom_num_1":"11265143749859112447","custom_num_2":"41","custom_num_50":"58748"} |
+------------+--------------------------------------------------------------------------------------+
Querying the whole cluster
Sometimes querying a single node is not enough. Imagine the case, that you are looking for queries which run longer than a specific threshold. You can get the information within ClickHouse’s system.query_log table, but you would have to check all hosts separately.
But again ClickHouse has you covered. There is a table function clusterAllReplicas which allows you to execute a query on all nodes of a cluster and giving you the combined result, as it would have been a local table:
Unset
SELECT
user,
substring(query, 1, 15) AS query_part,
query_duration_ms
FROM clusterAllReplicas('mytestcluster', system, query_log)
WHERE query_duration_ms > 50
LIMIT 3
+-user----+-query_part------+-query_duration_ms-+
| default | INSERT INTO col | 52 |
| default | INSERT INTO col | 55 |
| default | INSERT INTO col | 51 |
+---------+-----------------+-------------------+
Bonus: Working with AggregateStates
Working with Materialized Views and possibilities of AggregateFunctions could fill multiple blog posts on its own (for example this one about performance impact of Materialized Views). Therefore I only want to briefly mention some functionality not everyone knows about, but could come in handy.
Let’s assume we have the following 2 tables, which just count unique customers per hour or day:
Unset
CREATE TABLE customers_hourly (
eventDate Date,
eventHour UInt8,
uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate, eventHour);
CREATE TABLE customers_daily (
eventDate Date,
uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate);
Initialize aggregation
Filling those tables is quite easy with Materialized Views. But what if you manually want to insert a row. For example for testing purposes you want to insert the test customerId 4711 on 3 different hours of the same day.
As uniqueCustomers is an AggregationFunction you cannot directly insert the customerId into that column so something like this doesn’t work:
Unset
INSERT INTO customers_hourly
SELECT eventDate,eventHour,customerId
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
Of course it’s possible to generate an aggregation state by using window functions:
Unset
INSERT INTO customers_hourly
SELECT eventDate,eventHour,
uniqState(toUInt64(assumeNotNull(customerId))) OVER ()
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
But if you have multiple arguments to the function that could be hard to read, so you can also just use the initializeAggregation function to directly insert into the table:
Unset
INSERT INTO customers_hourly
SELECT eventDate,eventHour,
initializeAggregation('uniqState',toUInt64(assumeNotNull(customerId)))
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
Finalize aggregation
Now that we know how to directly write into an aggregation function, how can we read from it? Of course we can use uniqMerge to get the final result, but this is an Aggregation function, therefore we need to do GROUP BY etc to get the final result. And if we want to see the intermediate results we added into the table, we would also need to use a window again, to prevent the collapsing of the rows. Or we just use finalizeAggregation to make it easier:
Unset
SELECT
eventDate,
eventHour,
finalizeAggregation(uniqueCustomers)
FROM customers_hourly
┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
1. │ 2024-04-26 │ 1 │ 1 │
2. │ 2024-04-26 │ 2 │ 1 │
3. │ 2024-04-26 │ 3 │ 1 │
└────────────┴───────────┴──────────────────────────────────────┘
┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
4. │ 2024-04-26 │ 1 │ 1 │
5. │ 2024-04-26 │ 2 │ 1 │
6. │ 2024-04-26 │ 3 │ 1 │
└────────────┴───────────┴──────────────────────────────────────┘
MergeState
How to go from the hourly aggregation to the daily aggregation? In cases where you could just sum up the results from the hourly table, that’s of course quite easy to achieve. Unfortunately with the uniq function, you cannot just sum up the intermediate results, as the same user could have been active during multiple hours, therefore the daily result is not the sum of all hourly results. I’ve seen customers resulting to just recalculate the daily table from the basic table, instead of just continuing with the aggregation by using uniqMergeState.:
Unset
INSERT INTO customers_daily
SELECT eventDate, uniqMergeState(uniqueCustomers)
FROM customers_hourly
GROUP BY eventDate
The logic is as easy as it sounds. It will merge the intermediate results, but instead of giving the merged result, it will give the new internal state, therefore being stored in the daily Aggregation Table.
Those have been 5 (+3 bonus) small features I think are good to know. If you have any more neat features or topics you want to read about, feel free to contact me via slack or linked.in
Open Source lives from its community!
Featured image credit: Kevin Ku/Unsplash