How to Ingest Data from Kafka into ClickHouse
Learn how to ingest data from a Kafka topic into ClickHouse using the Kafka table engine, materialized views, and MergeTree tables.
Overview
This article walks through the process of sending data from a Kafka topic to a ClickHouse table. We’ll use the Wiki recent changes feed, which provides a stream of events that represent changes made to various Wikimedia properties. The steps include:
- How to setup Kafka on Ubuntu
 - Ingest a stream of data into a Kakfa topic
 - Create a ClickHouse table that subscribes to the topic
 
1. Setup Kafka on Ubuntu
- Create an Ubuntu ec2 instance and SSH on to it:
 
- Install Kafka (based on the instructions here: https://www.linode.com/docs/guides/how-to-install-apache-kafka-on-ubuntu/):
 
- Start ZooKeeper:
 
- Open a new console and launch Kafka:
 
- Open a third console and create a topic named wikimedia:
 
- You can verify it was created successfully by:
 
2. Ingest the Wikimedia Stream into Kafka
- We need some utilities first:
 
- The data is sent to Kafka using a clever curl command that grabs the latest Wikimedia events, parses out the JSON data and sends that to the Kafka topic:
 
- You can "describe" the topic:
 
- Let's verify everything is working by consuming some events:
 
- Hit Ctrl+c to kill the previous command.
 
3. Ingest the Data into ClickHouse
- Here is what the incoming data looks like:
 
- We will need the Kafka table engine to pull the data from the Kafka topic:
 
- For some reason the Kafka table engine seems to take the public ec2 URL and convert it to the private DNS name, so I had to add that to my local 
/etc/hostsfile: 
- You can read from a Kafka table, you just have to enable a setting:
 
The rows should come back nicely parsed based on the columns defined in the wikiQueue table:
- We need a MergeTree table to store these incoming events:
 
- Let's define a materialized view that gets triggered when an insert occurs on the Kafka table and sends the data to our rawEvents table:
 
- You should start seeing data going into rawEvents almost immediately:
 
- Let's view some of the rows:
 
- Let's see what types of events are coming in:
 
Let's define a materialized view chained to our current materialized view. We will keep track of some aggregated stats per minute:
- We will need -Merge functions to view the results:
 
 · 6 мин. чтения