What is clickhouse-local?
Sometimes we have to work with files, like CSV or Parquet, resident locally on our computers, readily accessible in S3, or easily exportable from MySQL or Postgres databases. Wouldn’t it be nice to have a tool to analyze and transform the data in those files using the power of SQL, and all of the ClickHouse functions, but without having to deploy a whole database server or write custom Python code?
Fortunately, this is precisely why clickhouse-local was created! The name “local” indicates that it is designed and optimized for data analysis using the local compute resources on your laptop or workstation. In this blog post, we’ll give you an overview of the capabilities of clickhouse-local and how it can increase the productivity of data scientists and engineers working with data in these scenarios.
Installation
curl https://clickhouse.com/ | sh
Now we can use the tool:
./clickhouse local --version ClickHouse local version 22.13.1.530 (official build).
Quick example
Suppose we have a simple CSV file we want to query:
./clickhouse local -q "SELECT * FROM file(sample.csv) LIMIT 2"
This will print the first two rows from the given sample.csv
file:
1 story pg 2006-10-09 21:21:51.000000000 2 story phyllis 2006-10-09 21:30:28.000000000 3 story phyllis 2006-10-09 21:40:33.000000000
The file() function, which is used to load data, takes a file path as the first argument and file format as an optional second argument.
Working with CSV files
Lets now introduce a more realistic dataset. A sample of the Hackernews dataset containing only posts concerning ClickHouse is available here for download. This CSV has a header row. In such cases, we can additionally pass the CSVWithNames
format as a second argument to the file function:
./clickhouse local -q "SELECT id, type, time, by, url FROM file(hackernews.csv, CSVWithNames) LIMIT 5"
Note how we can now refer to columns by their names in this case:
18346787 comment 2018-10-31 15:56:39.000000000 RobAtticus 18355652 comment 2018-11-01 16:29:16.000000000 jeroensoeters 18362819 comment 2018-11-02 13:26:59.000000000 arespredator 21938521 comment 2020-01-02 19:01:23.000000000 lykr0n 21942826 story 2020-01-03 03:25:46.000000000 phatak-dev http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/
In cases where we are dealing with CSVs without a header row, we can simply use CSV
format (or even omit, since Clickhouse can automatically detect formats):
./clickhouse local -q "SELECT * FROM file(hackernews.csv, CSV)"
In these cases, we can refer to specific columns using c
and a column index (c1
for the first column, c2
for the second one, and so on). The column types are still automatically inferred from the data. To select the first and third columns:
./clickhouse local -q "SELECT c1, c3 FROM file(file.csv)"
Using SQL to query data from files
We can use any SQL query to fetch and transform data from files. Let’s query for the most popular linked domain in Hacker News posts:
./clickhouse local -q "SELECT id, type, time, by, url FROM file(hackernews.csv, CSVWithNames) LIMIT 1"
Note how we can now refer to columns by their names in this case:
┌─d─────────────────┬──t─┐ │ github.com │ 14 │ └───────────────────┴────┘
Or we can build the hourly distribution of posts to understand the most and least popular hours for posting:
./clickhouse local -q "SELECT toHour(time) h, count(*) t, bar(t, 0, 100, 25) as c FROM file(hackernews.csv, CSVWithNames) GROUP BY h ORDER BY h"
4pm seems to be the least popular hour to post:
┌──h─┬───t─┬─c─────────────────────────┐ │ 0 │ 38 │ █████████▌ │ │ 1 │ 36 │ █████████ │ │ 2 │ 29 │ ███████▏ │ │ 3 │ 41 │ ██████████▎ │ │ 4 │ 25 │ ██████▎ │ │ 5 │ 33 │ ████████▎ │ │ 6 │ 36 │ █████████ │ │ 7 │ 37 │ █████████▎ │ │ 8 │ 44 │ ███████████ │ │ 9 │ 38 │ █████████▌ │ │ 10 │ 43 │ ██████████▋ │ │ 11 │ 40 │ ██████████ │ │ 12 │ 32 │ ████████ │ │ 13 │ 59 │ ██████████████▋ │ │ 14 │ 56 │ ██████████████ │ │ 15 │ 68 │ █████████████████ │ │ 16 │ 70 │ █████████████████▌ │ │ 17 │ 92 │ ███████████████████████ │ │ 18 │ 95 │ ███████████████████████▋ │ │ 19 │ 102 │ █████████████████████████ │ │ 20 │ 75 │ ██████████████████▋ │ │ 21 │ 69 │ █████████████████▎ │ │ 22 │ 64 │ ████████████████ │ │ 23 │ 58 │ ██████████████▍ │ └────┴─────┴───────────────────────────┘
In order to understand file structure, we can use the DESCRIBE
query:
./clickhouse local -q "DESCRIBE file(hackernews.csv, CSVWithNames)"
Which will print the columns with their types:
┌─name────────┬─type────────────────────┬ │ id │ Nullable(Int64) │ │ deleted │ Nullable(Int64) │ │ type │ Nullable(String) │ │ by │ Nullable(String) │ │ time │ Nullable(DateTime64(9)) │ │ text │ Nullable(String) │ │ dead │ Nullable(Int64) │ │ parent │ Nullable(Int64) │ │ poll │ Nullable(Int64) │ │ kids │ Array(Nullable(Int64)) │ │ url │ Nullable(String) │ │ score │ Nullable(Int64) │ │ title │ Nullable(String) │ │ parts │ Nullable(String) │ │ descendants │ Nullable(Int64) │ └─────────────┴─────────────────────────┴
Output formatting
By default, clickhouse-client will output everything in TSV format, but we can use any of many available output formats for this:
./clickhouse local -q "SELECT event, value FROM file(events.csv, CSVWithNames) WHERE value < 1e5 FORMAT SQLInsert"
This will output results in a standard SQL format, which can then be used to feed data to SQL databases, like MySQL or Postgres:
INSERT INTO table (`event`, `value`) VALUES ('click', 71364)...
Saving output to file
We can save the output to file by using the ‘INTO OUTFILE’ clause:
./clickhouse local -q "SELECT id, url, time FROM file(hackernews.csv, CSVWithNames) INTO OUTFILE 'urls.tsv'"
This will create a hn.tsv
file (TSV format):
[email protected] ~% head urls.tsv 18346787 2018-10-31 15:56:39.000000000 18355652 2018-11-01 16:29:16.000000000 18362819 2018-11-02 13:26:59.000000000 21938521 2020-01-02 19:01:23.000000000 21942826 http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/ 2020-01-03 03:25:46.000000000 21953967 2020-01-04 09:56:48.000000000 21966741 2020-01-06 05:31:48.000000000 18404015 2018-11-08 02:44:50.000000000 18404089 2018-11-08 03:05:27.000000000 18404090 2018-11-08 03:06:14.000000000
Deleting data from CSV and other files
We can delete data from local files by combining query filtering and saving results to files. Let’s delete rows from the file hackernews.csv
that have an empty url
. To do this, we just need to filter the rows we want to keep and save the result to a new file:
./clickhouse local -q "SELECT * FROM file(hackernews.csv