Show HN: ClickHouse-local – a small tool for serverless data analytics by zX41ZdbW

Share This Article

Sed ut perspiciatis unde.

What is clickhouse-local?

Sometimes we have to work with files, like CSV or Parquet, resident locally on our computers, readily accessible in S3, or easily exportable from MySQL or Postgres databases. Wouldn’t it be nice to have a tool to analyze and transform the data in those files using the power of SQL, and all of the ClickHouse functions, but without having to deploy a whole database server or write custom Python code?

Fortunately, this is precisely why clickhouse-local was created! The name “local” indicates that it is designed and optimized for data analysis using the local compute resources on your laptop or workstation. In this blog post, we’ll give you an overview of the capabilities of clickhouse-local and how it can increase the productivity of data scientists and engineers working with data in these scenarios.

Installation


curl https://clickhouse.com/ | sh

Now we can use the tool:

./clickhouse local --version ClickHouse local version 22.13.1.530 (official build).

Quick example

Suppose we have a simple CSV file we want to query:


./clickhouse local -q "SELECT * FROM file(sample.csv) LIMIT 2"

This will print the first two rows from the given sample.csv file:

1 story pg 2006-10-09 21:21:51.000000000 2 story phyllis 2006-10-09 21:30:28.000000000 3 story phyllis 2006-10-09 21:40:33.000000000

The file() function, which is used to load data, takes a file path as the first argument and file format as an optional second argument.

Working with CSV files

Lets now introduce a more realistic dataset. A sample of the Hackernews dataset containing only posts concerning ClickHouse is available here for download. This CSV has a header row. In such cases, we can additionally pass the CSVWithNames format as a second argument to the file function:


./clickhouse local -q "SELECT id, type, time, by, url FROM file(hackernews.csv, CSVWithNames) LIMIT 5"

Note how we can now refer to columns by their names in this case:

18346787 comment 2018-10-31 15:56:39.000000000 RobAtticus 18355652 comment 2018-11-01 16:29:16.000000000 jeroensoeters 18362819 comment 2018-11-02 13:26:59.000000000 arespredator 21938521 comment 2020-01-02 19:01:23.000000000 lykr0n 21942826 story 2020-01-03 03:25:46.000000000 phatak-dev http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/

In cases where we are dealing with CSVs without a header row, we can simply use CSV format (or even omit, since Clickhouse can automatically detect formats):


./clickhouse local -q "SELECT * FROM file(hackernews.csv, CSV)"

In these cases, we can refer to specific columns using c and a column index (c1 for the first column, c2 for the second one, and so on). The column types are still automatically inferred from the data. To select the first and third columns:


./clickhouse local -q "SELECT c1, c3 FROM file(file.csv)"

Using SQL to query data from files

We can use any SQL query to fetch and transform data from files. Let’s query for the most popular linked domain in Hacker News posts:


./clickhouse local -q "SELECT id, type, time, by, url FROM file(hackernews.csv, CSVWithNames) LIMIT 1"

Note how we can now refer to columns by their names in this case:

┌─d─────────────────┬──t─┐ │ github.com │ 14 │ └───────────────────┴────┘

Or we can build the hourly distribution of posts to understand the most and least popular hours for posting:


./clickhouse local -q "SELECT toHour(time) h, count(*) t, bar(t, 0, 100, 25) as c FROM file(hackernews.csv, CSVWithNames) GROUP BY h ORDER BY h"

4pm seems to be the least popular hour to post:

┌──h─┬───t─┬─c─────────────────────────┐ │ 0 │ 38 │ █████████▌ │ │ 1 │ 36 │ █████████ │ │ 2 │ 29 │ ███████▏ │ │ 3 │ 41 │ ██████████▎ │ │ 4 │ 25 │ ██████▎ │ │ 5 │ 33 │ ████████▎ │ │ 6 │ 36 │ █████████ │ │ 7 │ 37 │ █████████▎ │ │ 8 │ 44 │ ███████████ │ │ 9 │ 38 │ █████████▌ │ │ 10 │ 43 │ ██████████▋ │ │ 11 │ 40 │ ██████████ │ │ 12 │ 32 │ ████████ │ │ 13 │ 59 │ ██████████████▋ │ │ 14 │ 56 │ ██████████████ │ │ 15 │ 68 │ █████████████████ │ │ 16 │ 70 │ █████████████████▌ │ │ 17 │ 92 │ ███████████████████████ │ │ 18 │ 95 │ ███████████████████████▋ │ │ 19 │ 102 │ █████████████████████████ │ │ 20 │ 75 │ ██████████████████▋ │ │ 21 │ 69 │ █████████████████▎ │ │ 22 │ 64 │ ████████████████ │ │ 23 │ 58 │ ██████████████▍ │ └────┴─────┴───────────────────────────┘

In order to understand file structure, we can use the DESCRIBE query:


./clickhouse local -q "DESCRIBE file(hackernews.csv, CSVWithNames)"

Which will print the columns with their types:


┌─name────────┬─type────────────────────┬
│ id          │ Nullable(Int64)         │
│ deleted     │ Nullable(Int64)         │
│ type        │ Nullable(String)        │
│ by          │ Nullable(String)        │
│ time        │ Nullable(DateTime64(9)) │
│ text        │ Nullable(String)        │
│ dead        │ Nullable(Int64)         │
│ parent      │ Nullable(Int64)         │
│ poll        │ Nullable(Int64)         │
│ kids        │ Array(Nullable(Int64))  │
│ url         │ Nullable(String)        │
│ score       │ Nullable(Int64)         │
│ title       │ Nullable(String)        │
│ parts       │ Nullable(String)        │
│ descendants │ Nullable(Int64)         │
└─────────────┴─────────────────────────┴

Output formatting

By default, clickhouse-client will output everything in TSV format, but we can use any of many available output formats for this:


./clickhouse local -q "SELECT event, value FROM file(events.csv, CSVWithNames) WHERE value < 1e5 FORMAT SQLInsert"

This will output results in a standard SQL format, which can then be used to feed data to SQL databases, like MySQL or Postgres:


INSERT INTO table (`event`, `value`) VALUES ('click', 71364)...

Saving output to file

We can save the output to file by using the ‘INTO OUTFILE’ clause:


./clickhouse local -q "SELECT id, url, time FROM file(hackernews.csv, CSVWithNames) INTO OUTFILE 'urls.tsv'"

This will create a hn.tsvfile (TSV format):

[email protected] ~% head urls.tsv 18346787 2018-10-31 15:56:39.000000000 18355652 2018-11-01 16:29:16.000000000 18362819 2018-11-02 13:26:59.000000000 21938521 2020-01-02 19:01:23.000000000 21942826 http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/ 2020-01-03 03:25:46.000000000 21953967 2020-01-04 09:56:48.000000000 21966741 2020-01-06 05:31:48.000000000 18404015 2018-11-08 02:44:50.000000000 18404089 2018-11-08 03:05:27.000000000 18404090 2018-11-08 03:06:14.000000000

Deleting data from CSV and other files

We can delete data from local files by combining query filtering and saving results to files. Let’s delete rows from the file hackernews.csv that have an empty url. To do this, we just need to filter the rows we want to keep and save the result to a new file:


./clickhouse local -q "SELECT * FROM file(hackernews.csv

Show HN: ClickHouse-local – a small tool for serverless data analytics by zX41ZdbW

Show HN: ClickHouse-local – a small tool for serverless data analytics by zX41ZdbW

Share This Article

Newsletter

What is clickhouse-local?

Installation

Quick example

Working with CSV files

Using SQL to query data from files

Output formatting

Saving output to file

Deleting data from CSV and other files

HackTech

Leave a comment Cancel reply

Editor's Choice

Show HN: ClickHouse-local – a small tool for serverless data analytics by zX41ZdbW

Show HN: ClickHouse-local – a small tool for serverless data analytics by zX41ZdbW

Share This Article

Newsletter

What is clickhouse-local?

Installation

Quick example

Working with CSV files

Using SQL to query data from files

Output formatting

Saving output to file

Deleting data from CSV and other files

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter