Introduction
If you're dealing with analytics workloads, log processing, or any scenario that requires fast aggregations over billions of rows, you've probably heard of ClickHouse. Originally developed at Yandex, ClickHouse is a column-oriented database management system that can process analytical queries at incredible speeds.
However, setting up a production-ready ClickHouse cluster isn't trivial. You need to consider:
- High availability: What happens when a node fails?
- Data replication: How do you prevent data loss?
- Coordination: How do nodes communicate and elect leaders?
- Monitoring: How do you know when something goes wrong?
- Security: How do you protect your data?
- Cost: Can you do all this without breaking the bank?
In this guide, I'll walk you through building a complete production-ready ClickHouse cluster on Hetzner Cloud for approximately €37/month. We'll use:
- Terraform for infrastructure provisioning
- Ansible for configuration management
- ClickHouse Keeper for distributed coordination
- Cloudflare for SSL and DDoS protection
- Grafana + Prometheus for monitoring
- GitHub Actions for CI/CD
By the end, you'll have a fully automated, scalable, and secure analytics infrastructure.
Why This Stack?
Why ClickHouse?
ClickHouse excels at:
- Speed: Queries that take minutes in traditional databases complete in seconds
- Compression: Column storage achieves 10-20x compression ratios
- Real-time ingestion: Handle millions of rows per second
- SQL compatibility: Familiar query language with powerful extensions
Real-world example: At scale, a query like "count all events from last month grouped by country" might scan 50 billion rows in under 2 seconds.
Why Hetzner Cloud?
While AWS, GCP, and Azure dominate the cloud market, Hetzner offers compelling advantages:
| Provider | 4 vCPU, 8GB RAM | Egress (1TB) | Total/Month |
|---|---|---|---|
| AWS (EC2) | ~$120 | $90 | ~$210 |
| GCP | ~$100 | $120 | ~$220 |
| Hetzner | €8.99 | Free (20TB) | ~€9 |
For a 3-node cluster, Hetzner saves you ~€500/month compared to hyperscalers. That's significant for startups and side projects.
Why Terraform + Ansible?
This combination follows the separation of concerns principle:
- Terraform: "What infrastructure exists" (declarative)
- Ansible: "How infrastructure is configured" (procedural)
Terraform creates servers, networks, and load balancers. Ansible installs software, manages configs, and handles services. Each tool does what it's best at.
Why ClickHouse Keeper instead of ZooKeeper?
Historically, ClickHouse used Apache ZooKeeper for coordination. ClickHouse Keeper is a drop-in replacement that:
- Runs as a single binary (simpler operations)
- Has the same protocol (compatible with existing configs)
- Uses less memory
- Is maintained by the ClickHouse team
Architecture Overview
Here's what we're building:
Internet
│
▼
┌───────────────────────┐
│ Cloudflare │
│ (SSL Termination) │
│ grafana.domain.com │
│ clickhouse.domain.com │
└───────────┬───────────┘
│ HTTPS
▼
┌───────────────────────────────────────────────────────────────────────────────┐
│ Hetzner Cloud │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Hetzner Load Balancer (LB11) │ │
│ │ Health checks: /ping on port 8123 │ │
│ └─────────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┼──────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ CH Node 1 │ │ CH Node 2 │ │ CH Node 3 │ │
│ ├───────────┤ ├───────────┤ ├───────────┤ │
│ │ClickHouse │ │ClickHouse │ │ClickHouse │ │
│ │ Server │◄────────────►│ Server │◄────────────►│ Server │ │
│ ├───────────┤ Replication ├───────────┤ Replication ├───────────┤ │
│ │ Keeper │ │ Keeper │ │ Keeper │ │
│ │ (Raft) │◄────────────►│ (Raft) │◄────────────►│ (Raft) │ │
│ └─────┬─────┘ Consensus └─────┬─────┘ Consensus └─────┬─────┘ │
│ │ │ │ │
│ └──────────────────────────┼──────────────────────────┘ │
│ │ │
│ Private Network │
│ 10.0.1.0/24 │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ Monitoring │ │
│ │ ┌─────────────┐ │ │
│ │ │ Grafana │ │ │
│ │ │ Prometheus │ │ │
│ │ │ Nginx │ │ │
│ │ └─────────────┘ │ │
│ └───────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
Component Breakdown
| Component | Purpose | Why We Need It |
|---|---|---|
| ClickHouse Server | Analytics database | The core of our system |
| ClickHouse Keeper | Distributed coordination | Leader election, distributed locks, replication coordination |
| Hetzner Load Balancer | Traffic distribution | Round-robin queries across healthy nodes |
| Private Network | Secure communication | Cluster traffic never leaves Hetzner's network |
| Prometheus | Metrics collection | Time-series data for monitoring |
| Grafana | Visualization | Dashboards and alerting |
| Nginx | Reverse proxy | SSL termination for Cloudflare |
| Cloudflare | DNS, SSL, DDoS protection | Free SSL and protection |
Data Flow
- Writes: Client → Load Balancer → ClickHouse Node → Replicated to other nodes via Keeper
- Reads: Client → Load Balancer → Any healthy ClickHouse Node
- Metrics: All nodes → Prometheus → Grafana
Why 3 Nodes?
The number 3 isn't arbitrary. For consensus algorithms like Raft (used by Keeper):
- 3 nodes: Tolerates 1 failure (need majority: 2)
- 5 nodes: Tolerates 2 failures (need majority: 3)
With 3 nodes, we get high availability while keeping costs reasonable.
Prerequisites
Before we begin, ensure you have:
Accounts
- Hetzner Cloud - Sign up here
- Cloudflare - Sign up here (free tier works)
- Terraform Cloud - Sign up here (free tier)
- GitHub - For CI/CD
Local Tools
# macOS
brew install terraform ansible
# Ubuntu/Debian
sudo apt update
sudo apt install -y terraform ansible
# Verify installations
terraform version # Should be 1.6+
ansible --version # Should be 2.15+API Tokens
You'll need to generate:
| Token | Where | Permissions |
|---|---|---|
| Hetzner API Token | Console → Security → API Tokens | Read & Write |
| Cloudflare API Token | Profile → API Tokens | Zone:DNS:Edit, Zone:Settings:Edit |
| Terraform Cloud Token | User Settings → Tokens | N/A |
Domain Name
You need a domain managed by Cloudflare. We'll create:
clickhouse.yourdomain.comgrafana.yourdomain.com
Part 1: Infrastructure with Terraform
Let's start by creating the infrastructure. Create a new project directory:
mkdir clickhouse-cluster && cd clickhouse-cluster
mkdir -p terraform ansible secretsUnderstanding the Terraform Structure
terraform/
├── main.tf # Core resources (servers, network, LB)
├── variables.tf # Input variables
├── outputs.tf # Output values
├── cloudflare.tf # DNS and SSL settings
├── backend.tf # Terraform Cloud configuration
└── templates/
└── inventory.tpl # Ansible inventory template
Step 1: Provider Configuration
Create terraform/main.tf:
# terraform/main.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = "~> 1.45"
}
cloudflare = {
source = "cloudflare/cloudflare"
version = "~> 4.0"
}
}
}
provider "hcloud" {
token = var.hcloud_token
}
provider "cloudflare" {
api_token = var.cloudflare_api_token
}What's happening here?
- We declare our required Terraform version (1.6+)
- We specify two providers:
hcloudfor Hetzner andcloudflarefor DNS - Providers are authenticated via variables (we'll define those next)
Step 2: Variables
Create terraform/variables.tf:
# terraform/variables.tf
# =============================================================================
# REQUIRED VARIABLES
# =============================================================================
variable "hcloud_token" {
description = "Hetzner Cloud API token"
type = string
sensitive = true
}
variable "cloudflare_api_token" {
description = "Cloudflare API token"
type = string
sensitive = true
}
variable "cloudflare_zone_name" {
description = "Your domain name in Cloudflare"
type = string
}
# =============================================================================
# CLUSTER CONFIGURATION
# =============================================================================
variable "cluster_name" {
description = "Name prefix for all resources"
type = string
default = "clickhouse"
}
variable "location" {
description = "Hetzner datacenter location"
type = string
default = "nbg1" # Nuremberg, Germany
validation {
condition = contains(["fsn1", "nbg1", "hel1", "ash", "hil"], var.location)
error_message = "Location must be: fsn1, nbg1, hel1, ash, or hil"
}
}
variable "clickhouse_node_count" {
description = "Number of ClickHouse nodes"
type = number
default = 3
validation {
condition = var.clickhouse_node_count >= 1
error_message = "Must have at least 1 node"
}
}
variable "clickhouse_server_type" {
description = "Server type for ClickHouse nodes"
type = string
default = "cx33" # 4 vCPU, 8GB RAM
}
variable "monitoring_server_type" {
description = "Server type for monitoring"
type = string
default = "cx23" # 2 vCPU, 4GB RAM
}
variable "ssh_allowed_ips" {
description = "IPs allowed to SSH (CIDR notation)"
type = list(string)
default = ["0.0.0.0/0"] # Restrict in production!
}Key decisions explained:
- Location:
nbg1(Nuremberg) is often the cheapest and has good availability - Server types:
cx33gives us 4 vCPU and 8GB RAM—enough for moderate workloads - SSH access: Defaults to open, but you should restrict to your IP
Step 3: SSH Key Generation
We need SSH access to our servers. Terraform can generate keys for us:
# Add to terraform/main.tf
# =============================================================================
# SSH KEY
# =============================================================================
resource "tls_private_key" "cluster_key" {
algorithm = "ED25519"
}
resource "hcloud_ssh_key" "cluster" {
name = "${var.cluster_name}-key"
public_key = tls_private_key.cluster_key.public_key_openssh
}
resource "local_file" "private_key" {
content = tls_private_key.cluster_key.private_key_openssh
filename = "${path.module}/../secrets/${var.cluster_name}-key"
file_permission = "0600"
}Why ED25519?
- Shorter keys (256-bit vs 4096-bit RSA)
- Faster operations
- Modern and secure
Step 4: Network Infrastructure
Now let's create the private network:
# Add to terraform/main.tf
# =============================================================================
# NETWORK
# =============================================================================
resource "hcloud_network" "cluster" {
name = "${var.cluster_name}-network"
ip_range = "10.0.0.0/16"
}
resource "hcloud_network_subnet" "cluster" {
network_id = hcloud_network.cluster.id
type = "cloud"
network_zone = "eu-central"
ip_range = "10.0.1.0/24"
}Why a private network?
- Security: Inter-node traffic stays private
- Performance: Lower latency than public internet
- Cost: Private traffic is free
Step 5: Firewall Rules
Security first! Let's define what traffic is allowed:
# Add to terraform/main.tf
# =============================================================================
# FIREWALL
# =============================================================================
resource "hcloud_firewall" "cluster" {
name = "${var.cluster_name}-firewall"
# SSH access (restrict to your IP in production!)
dynamic "rule" {
for_each = var.ssh_allowed_ips
content {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = [rule.value]
}
}
# HTTP for Cloudflare proxy
rule {
direction = "in"
protocol = "tcp"
port = "80"
source_ips = ["0.0.0.0/0", "::/0"]
}
# HTTPS for Cloudflare proxy
rule {
direction = "in"
protocol = "tcp"
port = "443"
source_ips = ["0.0.0.0/0", "::/0"]
}
# Grafana
rule {
direction = "in"
protocol = "tcp"
port = "3000"
source_ips = ["0.0.0.0/0", "::/0"]
}
# ClickHouse HTTP API
rule {
direction = "in"
protocol = "tcp"
port = "8123"
source_ips = ["0.0.0.0/0", "::/0"]
}
# ClickHouse Native Protocol
rule {
direction = "in"
protocol = "tcp"
port = "9000"
source_ips = ["0.0.0.0/0", "::/0"]
}
# Allow all outbound
rule {
direction = "out"
protocol = "tcp"
port = "1-65535"
destination_ips = ["0.0.0.0/0", "::/0"]
}
rule {
direction = "out"
protocol = "udp"
port = "1-65535"
destination_ips = ["0.0.0.0/0", "::/0"]
}
}Step 6: ClickHouse Servers
Now the main event—creating our ClickHouse nodes:
# Add to terraform/main.tf
# =============================================================================
# CLICKHOUSE NODES
# =============================================================================
resource "hcloud_server" "clickhouse" {
count = var.clickhouse_node_count
name = "${var.cluster_name}-ch-${count.index + 1}"
image = "ubuntu-24.04"
server_type = var.clickhouse_server_type
location = var.location
ssh_keys = [hcloud_ssh_key.cluster.id]
labels = {
cluster = var.cluster_name
role = "clickhouse"
}
# Cloud-init: disable unattended-upgrades (prevents apt lock issues)
user_data = <<-EOF
#cloud-config
package_update: false
package_upgrade: false
runcmd:
- systemctl stop unattended-upgrades || true
- systemctl disable unattended-upgrades || true
- apt-get remove -y unattended-upgrades || true
- echo "vm.max_map_count=262144" >> /etc/sysctl.conf
- sysctl -p
EOF
lifecycle {
ignore_changes = [user_data]
}
}
# Attach to private network
resource "hcloud_server_network" "clickhouse" {
count = var.clickhouse_node_count
server_id = hcloud_server.clickhouse[count.index].id
subnet_id = hcloud_network_subnet.cluster.id
ip = "10.0.1.${count.index + 2}" # .2, .3, .4
}
# Attach firewall
resource "hcloud_firewall_attachment" "clickhouse" {
firewall_id = hcloud_firewall.cluster.id
server_ids = hcloud_server.clickhouse[*].id
}Important cloud-init settings:
- Disable unattended-upgrades: Ubuntu's automatic updates can lock apt during Ansible runs, causing failures
- vm.max_map_count: ClickHouse needs this increased for memory-mapped files
Step 7: Monitoring Server
# Add to terraform/main.tf
# =============================================================================
# MONITORING NODE
# =============================================================================
resource "hcloud_server" "monitoring" {
name = "${var.cluster_name}-monitoring"
image = "ubuntu-24.04"
server_type = var.monitoring_server_type
location = var.location
ssh_keys = [hcloud_ssh_key.cluster.id]
labels = {
cluster = var.cluster_name
role = "monitoring"
}
user_data = <<-EOF
#cloud-config
package_update: false
package_upgrade: false
runcmd:
- systemctl stop unattended-upgrades || true
- systemctl disable unattended-upgrades || true
- apt-get remove -y unattended-upgrades || true
EOF
lifecycle {
ignore_changes = [user_data]
}
}
resource "hcloud_server_network" "monitoring" {
server_id = hcloud_server.monitoring.id
subnet_id = hcloud_network_subnet.cluster.id
ip = "10.0.1.10"
}
resource "hcloud_firewall_attachment" "monitoring" {
firewall_id = hcloud_firewall.cluster.id
server_ids = [hcloud_server.monitoring.id]
}Step 8: Load Balancer
The load balancer distributes traffic across healthy nodes:
# Add to terraform/main.tf
# =============================================================================
# LOAD BALANCER
# =============================================================================
resource "hcloud_load_balancer" "cluster" {
name = "${var.cluster_name}-lb"
load_balancer_type = "lb11"
location = var.location
}
resource "hcloud_load_balancer_network" "cluster" {
load_balancer_id = hcloud_load_balancer.cluster.id
subnet_id = hcloud_network_subnet.cluster.id
ip = "10.0.1.100"
}
# HTTP API service
resource "hcloud_load_balancer_service" "http" {
load_balancer_id = hcloud_load_balancer.cluster.id
protocol = "tcp"
listen_port = 8123
destination_port = 8123
health_check {
protocol = "http"
port = 8123
interval = 5
timeout = 3
retries = 3
http {
path = "/ping"
status_codes = ["200"]
}
}
}
# Native protocol service
resource "hcloud_load_balancer_service" "native" {
load_balancer_id = hcloud_load_balancer.cluster.id
protocol = "tcp"
listen_port = 9000
destination_port = 9000
health_check {
protocol = "tcp"
port = 9000
interval = 5
timeout = 3
retries = 3
}
}
# Attach ClickHouse nodes
resource "hcloud_load_balancer_target" "clickhouse" {
count = var.clickhouse_node_count
type = "server"
load_balancer_id = hcloud_load_balancer.cluster.id
server_id = hcloud_server.clickhouse[count.index].id
use_private_ip = true
depends_on = [
hcloud_load_balancer_network.cluster,
hcloud_server_network.clickhouse
]
}Health check explained:
ClickHouse exposes /ping endpoint that returns "Ok." when healthy. The load balancer checks this every 5 seconds and removes unhealthy nodes from rotation.
Step 9: Cloudflare DNS
Create terraform/cloudflare.tf:
# terraform/cloudflare.tf
data "cloudflare_zone" "domain" {
name = var.cloudflare_zone_name
}
# Grafana DNS record
resource "cloudflare_record" "grafana" {
zone_id = data.cloudflare_zone.domain.id
name = "grafana"
content = hcloud_server.monitoring.ipv4_address
type = "A"
ttl = 1 # Auto TTL when proxied
proxied = true
}
# ClickHouse DNS record
resource "cloudflare_record" "clickhouse" {
zone_id = data.cloudflare_zone.domain.id
name = "clickhouse"
content = hcloud_server.monitoring.ipv4_address # Goes through Nginx
type = "A"
ttl = 1
proxied = true
}
# SSL settings
resource "cloudflare_zone_settings_override" "ssl_settings" {
zone_id = data.cloudflare_zone.domain.id
settings {
ssl = "full"
always_use_https = "on"
min_tls_version = "1.2"
automatic_https_rewrites = "on"
}
}Why route through monitoring node?
Cloudflare only proxies HTTP (ports 80/443). Our Nginx on the monitoring node:
- Terminates SSL
- Proxies to ClickHouse via private network
- Provides a single entry point
Step 10: Outputs
Create terraform/outputs.tf:
# terraform/outputs.tf
output "load_balancer_ip" {
description = "Public IP of the load balancer"
value = hcloud_load_balancer.cluster.ipv4
}
output "clickhouse_https_url" {
description = "ClickHouse HTTPS URL"
value = "https://clickhouse.${var.cloudflare_zone_name}"
}
output "grafana_https_url" {
description = "Grafana HTTPS URL"
value = "https://grafana.${var.cloudflare_zone_name}"
}
output "monitoring_public_ip" {
description = "Public IP of monitoring node"
value = hcloud_server.monitoring.ipv4_address
}
output "ansible_inventory" {
description = "Generated Ansible inventory"
value = templatefile("${path.module}/templates/inventory.tpl", {
cluster_name = var.cluster_name
clickhouse_nodes = hcloud_server.clickhouse
clickhouse_ips = hcloud_server_network.clickhouse[*].ip
monitoring_node = hcloud_server.monitoring
monitoring_ip = hcloud_server_network.monitoring.ip
})
}Step 11: Inventory Template
Create terraform/templates/inventory.tpl:
---
# Auto-generated by Terraform
all:
vars:
ansible_user: root
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
cluster_name: ${cluster_name}
children:
clickhouse:
hosts:
%{ for i, node in clickhouse_nodes ~}
${node.name}:
ansible_host: ${node.ipv4_address}
private_ip: ${clickhouse_ips[i]}
keeper_id: ${i + 1}
shard: ${(i % 2) + 1}
replica: ${floor(i / 2) + 1}
%{ endfor ~}
keeper:
hosts:
%{ for i, node in clickhouse_nodes ~}
${node.name}:
ansible_host: ${node.ipv4_address}
private_ip: ${clickhouse_ips[i]}
keeper_id: ${i + 1}
%{ endfor ~}
monitoring:
hosts:
${monitoring_node.name}:
ansible_host: ${monitoring_node.ipv4_address}
private_ip: ${monitoring_ip}Step 12: Backend Configuration
Create terraform/backend.tf:
# terraform/backend.tf
terraform {
cloud {
organization = "your-org-name" # Change this
workspaces {
name = "clickhouse-cluster"
}
}
}Step 13: Deploy Infrastructure
Now let's deploy:
cd terraform
# Create your variables file
cat > terraform.tfvars << 'EOF'
hcloud_token = "your-hetzner-token"
cloudflare_api_token = "your-cloudflare-token"
cloudflare_zone_name = "yourdomain.com"
cluster_name = "clickhouse"
location = "nbg1"
clickhouse_server_type = "cx33"
monitoring_server_type = "cx23"
EOF
# Login to Terraform Cloud
terraform login
# Initialize
terraform init
# Plan (review changes)
terraform plan
# Apply
terraform applyAfter ~2 minutes, you'll have:
- 3 ClickHouse servers
- 1 Monitoring server
- 1 Load balancer
- Private network
- Firewall rules
- DNS records
Generate the Ansible inventory:
terraform output -raw ansible_inventory > ../ansible/inventory/hosts.ymlPart 2: Configuration with Ansible
Now we configure our servers. Ansible will:
- Install system dependencies
- Configure ClickHouse Keeper (coordination)
- Configure ClickHouse Server
- Set up Prometheus + Grafana
- Configure Nginx for SSL
Ansible Structure
ansible/
├── ansible.cfg # Configuration
├── site.yml # Main playbook
├── group_vars/
│ └── all.yml # Global variables
├── inventory/
│ └── hosts.yml # From Terraform
└── roles/
├── common/ # Base configuration
├── clickhouse-keeper/
├── clickhouse/
├── grafana/
└── nginx/
Step 1: Ansible Configuration
Create ansible/ansible.cfg:
[defaults]
inventory = inventory/hosts.yml
host_key_checking = False
private_key_file = ../secrets/clickhouse-key
retry_files_enabled = False
stdout_callback = yaml
deprecations_warnings = False
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=noStep 2: Global Variables
Create ansible/group_vars/all.yml:
---
# Cluster configuration
cluster_name: production_cluster
# ClickHouse users
clickhouse_users:
- name: default
password: ""
networks:
- "127.0.0.1"
- "10.0.0.0/8"
profile: default
quota: default
- name: admin
password: "{{ lookup('env', 'CLICKHOUSE_ADMIN_PASSWORD') | default('ChangeMe123!', true) }}"
networks:
- "::/0"
profile: default
quota: default
access_management: true
- name: readonly
password: "{{ lookup('env', 'CLICKHOUSE_READONLY_PASSWORD') | default('ReadOnly123!', true) }}"
networks:
- "::/0"
profile: readonly
quota: default
access_management: false
# Keeper configuration
keeper_port: 9181
keeper_raft_port: 9234
# Monitoring
prometheus_retention: "15d"
grafana_admin_password: "{{ lookup('env', 'GRAFANA_ADMIN_PASSWORD') | default('admin', true) }}"
node_exporter_version: "1.7.0"
# System tuning
system_settings:
vm_max_map_count: 262144
net_core_somaxconn: 65535
fs_file_max: 2097152
# Domain
domain_name: yourdomain.com
# Python interpreter
ansible_python_interpreter: /usr/bin/python3User roles explained:
| User | Purpose | Permissions |
|---|---|---|
default | Internal cluster communication | No external access |
admin | Application writes, DDL | Full access |
readonly | Dashboards, reporting | SELECT only |
Step 3: Main Playbook
Create ansible/site.yml:
---
- name: Apply common configuration
hosts: all
become: true
roles:
- common
- name: Configure ClickHouse Keeper
hosts: keeper
become: true
roles:
- clickhouse-keeper
- name: Configure ClickHouse servers
hosts: clickhouse
become: true
roles:
- clickhouse
- name: Configure monitoring
hosts: monitoring
become: true
roles:
- grafana
- nginx
- name: Verify cluster
hosts: clickhouse[0]
become: true
tasks:
- name: Wait for ClickHouse
wait_for:
port: 8123
delay: 5
timeout: 60
- name: Check cluster status
shell: clickhouse-client --query "SELECT cluster, count() FROM system.clusters WHERE cluster='{{ cluster_name }}' GROUP BY cluster"
register: cluster_status
- name: Display status
debug:
msg: "{{ cluster_status.stdout }}"Step 4: Common Role
Create ansible/roles/common/tasks/main.yml:
---
- name: Wait for apt locks
shell: |
TIMEOUT=60
COUNTER=0
while fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
if [ $COUNTER -ge $TIMEOUT ]; then
pkill -9 unattended-upgr || true
pkill -9 apt || true
rm -f /var/lib/dpkg/lock-frontend
break
fi
sleep 2
COUNTER=$((COUNTER + 2))
done
dpkg --configure -a || true
changed_when: false
- name: Disable unattended-upgrades
systemd:
name: unattended-upgrades
state: stopped
enabled: false
ignore_errors: true
- name: Install common packages
apt:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- htop
- net-tools
- vim
- wget
- jq
state: present
update_cache: true
- name: Configure sysctl
sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { name: "vm.max_map_count", value: "{{ system_settings.vm_max_map_count }}" }
- { name: "net.core.somaxconn", value: "{{ system_settings.net_core_somaxconn }}" }
- { name: "fs.file-max", value: "{{ system_settings.fs_file_max }}" }
- name: Configure /etc/hosts
lineinfile:
path: /etc/hosts
line: "{{ hostvars[item]['private_ip'] }} {{ item }}"
state: present
loop: "{{ groups['all'] }}"
- name: Install Node Exporter
include_tasks: node_exporter.ymlStep 5: ClickHouse Keeper Role
Create ansible/roles/clickhouse-keeper/tasks/main.yml:
---
- name: Add ClickHouse repository key
apt_key:
keyserver: keyserver.ubuntu.com
id: 8919F6BD2B48D754
- name: Add ClickHouse repository
apt_repository:
repo: "deb https://packages.clickhouse.com/deb lts main"
state: present
- name: Install ClickHouse Keeper
apt:
name: clickhouse-keeper
state: present
update_cache: true
- name: Create directories
file:
path: "{{ item }}"
state: directory
owner: clickhouse
group: clickhouse
mode: '0750'
loop:
- /var/lib/clickhouse-keeper
- /var/log/clickhouse-keeper
- name: Configure Keeper
template:
src: keeper_config.xml.j2
dest: /etc/clickhouse-keeper/keeper_config.xml
owner: clickhouse
group: clickhouse
notify: Restart clickhouse-keeper
- name: Start Keeper
systemd:
name: clickhouse-keeper
state: started
enabled: trueCreate ansible/roles/clickhouse-keeper/templates/keeper_config.xml.j2:
<?xml version="1.0"?>
<clickhouse>
<logger>
<level>information</level>
<log>/var/log/clickhouse-keeper/clickhouse-keeper.log</log>
<errorlog>/var/log/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
<size>100M</size>
<count>3</count>
</logger>
<listen_host>0.0.0.0</listen_host>
<keeper_server>
<tcp_port>{{ keeper_port }}</tcp_port>
<server_id>{{ keeper_id }}</server_id>
<log_storage_path>/var/lib/clickhouse-keeper/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse-keeper/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>warning</raft_logs_level>
</coordination_settings>
<raft_configuration>
{% for host in groups['keeper'] %}
<server>
<id>{{ hostvars[host]['keeper_id'] }}</id>
<hostname>{{ hostvars[host]['private_ip'] }}</hostname>
<port>{{ keeper_raft_port }}</port>
</server>
{% endfor %}
</raft_configuration>
</keeper_server>
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<metrics>true</metrics>
</prometheus>
</clickhouse>Key configuration points:
- server_id: Unique ID for each Keeper node (1, 2, 3)
- raft_configuration: Lists all Keeper nodes for consensus
- prometheus: Exposes metrics on port 9363
Step 6: ClickHouse Server Role
Create ansible/roles/clickhouse/tasks/main.yml:
---
- name: Add ClickHouse repository
apt_repository:
repo: "deb https://packages.clickhouse.com/deb lts main"
state: present
- name: Install ClickHouse
apt:
name:
- clickhouse-server
- clickhouse-client
state: present
update_cache: true
- name: Configure cluster
template:
src: config.xml.j2
dest: /etc/clickhouse-server/config.d/cluster.xml
owner: clickhouse
group: clickhouse
notify: Restart clickhouse-server
- name: Configure users
template:
src: users.xml.j2
dest: /etc/clickhouse-server/users.d/custom_users.xml
owner: clickhouse
group: clickhouse
notify: Restart clickhouse-server
- name: Configure macros
template:
src: macros.xml.j2
dest: /etc/clickhouse-server/config.d/macros.xml
owner: clickhouse
group: clickhouse
notify: Restart clickhouse-server
- name: Configure Keeper connection
template:
src: zookeeper.xml.j2
dest: /etc/clickhouse-server/config.d/zookeeper.xml
owner: clickhouse
group: clickhouse
notify: Restart clickhouse-server
- name: Enable ClickHouse
systemd:
name: clickhouse-server
enabled: true
- name: Start ClickHouse (async)
shell: systemctl start clickhouse-server &
async: 30
poll: 0
- name: Wait for ClickHouse
wait_for:
port: 8123
delay: 5
timeout: 120Create ansible/roles/clickhouse/templates/config.xml.j2:
<?xml version="1.0"?>
<clickhouse>
<listen_host>0.0.0.0</listen_host>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<interserver_http_port>9009</interserver_http_port>
<remote_servers>
<{{ cluster_name }}>
<shard>
<internal_replication>true</internal_replication>
{% for host in groups['clickhouse'] if hostvars[host]['shard'] == 1 %}
<replica>
<host>{{ hostvars[host]['private_ip'] }}</host>
<port>9000</port>
</replica>
{% endfor %}
</shard>
<shard>
<internal_replication>true</internal_replication>
{% for host in groups['clickhouse'] if hostvars[host]['shard'] == 2 %}
<replica>
<host>{{ hostvars[host]['private_ip'] }}</host>
<port>9000</port>
</replica>
{% endfor %}
</shard>
</{{ cluster_name }}>
</remote_servers>
<prometheus>
<endpoint>/metrics</endpoint>
<port>9364</port>
<metrics>true</metrics>
</prometheus>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
</distributed_ddl>
</clickhouse>Cluster topology:
- 2 shards: Data is distributed across shards
- Replicas: Each shard can have multiple replicas for redundancy
- With 3 nodes: Shard 1 has 2 replicas, Shard 2 has 1 replica
Create ansible/roles/clickhouse/templates/macros.xml.j2:
<?xml version="1.0"?>
<clickhouse>
<macros>
<cluster>{{ cluster_name }}</cluster>
<shard>{{ shard }}</shard>
<replica>{{ inventory_hostname }}</replica>
</macros>
</clickhouse>Macros explained:
Macros are variables used in ReplicatedMergeTree table definitions:
CREATE TABLE events ON CLUSTER 'production_cluster'
(
event_id UUID,
timestamp DateTime
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY timestamp;The {shard} and {replica} are automatically replaced with values from macros.
Step 7: Run Ansible
Set your passwords and run:
cd ansible
# Create .env file
cat > .env << 'EOF'
export CLICKHOUSE_ADMIN_PASSWORD="YourSecureAdminPass123"
export CLICKHOUSE_READONLY_PASSWORD="YourSecureReadOnlyPass123"
export GRAFANA_ADMIN_PASSWORD="YourGrafanaPass123"
EOF
# Source it
source .env
# Run playbook
ansible-playbook site.ymlThis takes about 5-10 minutes. You'll see each task execute across all hosts.
Part 3: SSL with Cloudflare
Cloudflare provides free SSL, but it only proxies ports 80/443. We need Nginx to:
- Listen on port 80/443
- Terminate SSL (with self-signed cert—Cloudflare handles public SSL)
- Proxy requests to internal services
Nginx Role
Create ansible/roles/nginx/tasks/main.yml:
---
- name: Install Nginx
apt:
name: nginx
state: present
update_cache: true
- name: Create SSL directory
file:
path: /etc/nginx/ssl
state: directory
mode: '0700'
- name: Generate self-signed certificate
command: >
openssl req -x509 -nodes -days 365 -newkey rsa:2048
-keyout /etc/nginx/ssl/private.key
-out /etc/nginx/ssl/certificate.crt
-subj "/C=US/ST=State/L=City/O=Org/CN={{ domain_name }}"
args:
creates: /etc/nginx/ssl/certificate.crt
- name: Configure Grafana proxy
template:
src: grafana.conf.j2
dest: /etc/nginx/sites-available/grafana.conf
notify: Reload nginx
- name: Configure ClickHouse proxy
template:
src: clickhouse.conf.j2
dest: /etc/nginx/sites-available/clickhouse.conf
notify: Reload nginx
- name: Enable sites
file:
src: "/etc/nginx/sites-available/{{ item }}"
dest: "/etc/nginx/sites-enabled/{{ item }}"
state: link
loop:
- grafana.conf
- clickhouse.conf
notify: Reload nginx
- name: Remove default site
file:
path: /etc/nginx/sites-enabled/default
state: absent
- name: Start Nginx
systemd:
name: nginx
state: started
enabled: trueCreate ansible/roles/nginx/templates/clickhouse.conf.j2:
server {
listen 80;
server_name clickhouse.{{ domain_name }};
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name clickhouse.{{ domain_name }};
ssl_certificate /etc/nginx/ssl/certificate.crt;
ssl_certificate_key /etc/nginx/ssl/private.key;
ssl_protocols TLSv1.2 TLSv1.3;
location / {
proxy_pass http://{{ hostvars[groups['clickhouse'][0]]['private_ip'] }}:8123;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto https;
proxy_buffering off;
proxy_read_timeout 300s;
}
}Why self-signed certificates?
Cloudflare's "Full" SSL mode:
User → Cloudflare (valid public SSL) → Your Server (self-signed OK)
Cloudflare encrypts traffic to your origin, but doesn't verify the certificate. This is secure enough for most use cases and avoids the complexity of Let's Encrypt.
Part 4: CI/CD with GitHub Actions
Automate deployments with GitHub Actions:
Create .github/workflows/deploy.yml:
name: Deploy ClickHouse Cluster
on:
push:
branches: [main]
paths:
- 'terraform/**'
- 'ansible/**'
- '.github/workflows/**'
workflow_dispatch:
inputs:
action:
description: 'Action'
required: true
default: 'plan'
type: choice
options: [plan, apply, destroy]
env:
TF_VERSION: '1.6.0'
HCLOUD_TOKEN: ${{ secrets.HCLOUD_TOKEN }}
CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
jobs:
terraform:
name: Terraform
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}
terraform_wrapper: false
- name: Terraform Init
working-directory: terraform
run: terraform init
- name: Terraform Plan
working-directory: terraform
run: |
terraform plan \
-var="hcloud_token=$HCLOUD_TOKEN" \
-var="cloudflare_api_token=$CLOUDFLARE_API_TOKEN"
- name: Terraform Apply
if: github.event_name == 'push' || github.event.inputs.action == 'apply'
working-directory: terraform
run: |
terraform apply -auto-approve \
-var="hcloud_token=$HCLOUD_TOKEN" \
-var="cloudflare_api_token=$CLOUDFLARE_API_TOKEN"
- name: Generate Inventory
if: github.event_name == 'push' || github.event.inputs.action == 'apply'
working-directory: terraform
run: |
mkdir -p ../ansible/inventory
terraform output -raw ansible_inventory > ../ansible/inventory/hosts.yml
- name: Upload Inventory
if: github.event_name == 'push' || github.event.inputs.action == 'apply'
uses: actions/upload-artifact@v4
with:
name: inventory
path: ansible/inventory/hosts.yml
ansible:
name: Ansible
runs-on: ubuntu-latest
needs: terraform
if: github.event_name == 'push' || github.event.inputs.action == 'apply'
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Ansible
run: pip install ansible
- name: Download Inventory
uses: actions/download-artifact@v4
with:
name: inventory
path: ansible/inventory
- name: Setup SSH
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/cluster_key
chmod 600 ~/.ssh/cluster_key
- name: Wait for servers
run: sleep 120
- name: Run Ansible
working-directory: ansible
env:
ANSIBLE_HOST_KEY_CHECKING: 'false'
CLICKHOUSE_ADMIN_PASSWORD: ${{ secrets.CLICKHOUSE_ADMIN_PASSWORD }}
CLICKHOUSE_READONLY_PASSWORD: ${{ secrets.CLICKHOUSE_READONLY_PASSWORD }}
GRAFANA_ADMIN_PASSWORD: ${{ secrets.GRAFANA_ADMIN_PASSWORD }}
run: |
ansible-playbook \
-i inventory/hosts.yml \
--private-key ~/.ssh/cluster_key \
site.ymlRequired GitHub Secrets
| Secret | Description |
|---|---|
HCLOUD_TOKEN | Hetzner API token |
CLOUDFLARE_API_TOKEN | Cloudflare API token |
TF_API_TOKEN | Terraform Cloud token |
SSH_PRIVATE_KEY | Contents of secrets/clickhouse-key |
CLICKHOUSE_ADMIN_PASSWORD | Admin password |
CLICKHOUSE_READONLY_PASSWORD | Readonly password |
GRAFANA_ADMIN_PASSWORD | Grafana password |
Part 5: Monitoring with Grafana
The monitoring stack includes:
- Prometheus: Scrapes metrics from all nodes
- Grafana: Visualizes metrics
- Node Exporter: System metrics (CPU, RAM, disk)
Prometheus Configuration
Create ansible/roles/grafana/templates/prometheus.yml.j2:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
{% for host in groups['all'] %}
- '{{ hostvars[host]["private_ip"] }}:9100'
{% endfor %}
- job_name: 'clickhouse'
static_configs:
- targets:
{% for host in groups['clickhouse'] %}
- '{{ hostvars[host]["private_ip"] }}:9364'
{% endfor %}
- job_name: 'clickhouse-keeper'
static_configs:
- targets:
{% for host in groups['keeper'] %}
- '{{ hostvars[host]["private_ip"] }}:9363'
{% endfor %}Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
ClickHouseMetrics_Query | Active queries | > 100 |
ClickHouseMetrics_MemoryTracking | Memory usage | > 80% |
ClickHouseProfileEvents_FailedQuery | Failed queries | > 0 |
node_cpu_seconds_total | CPU usage | > 80% |
node_memory_MemAvailable_bytes | Available RAM | < 1GB |
Security Hardening
1. Restrict SSH Access
# terraform.tfvars
ssh_allowed_ips = ["YOUR.IP.ADDRESS/32"]2. Use Strong Passwords
# Generate secure passwords
openssl rand -base64 243. Network Segmentation
All inter-node traffic uses the private network (10.0.1.0/24). Only necessary ports are exposed publicly.
4. Firewall Rules
Only these ports are accessible from the internet:
- 22 (SSH) - restricted to your IP
- 80/443 (HTTP/HTTPS) - for Cloudflare proxy
- 8123/9000 (ClickHouse) - consider restricting to known IPs
5. User Permissions
The readonly user cannot modify data:
-- This will fail for readonly user
INSERT INTO events VALUES (...);
-- Error: readonly: Cannot execute query in readonly modeConnecting to Your Cluster
HTTP API (via Cloudflare)
# Simple query
curl "https://clickhouse.yourdomain.com/" \
-u admin:YourAdminPassword \
-d "SELECT version()"
# JSON format
curl "https://clickhouse.yourdomain.com/?default_format=JSON" \
-u admin:YourAdminPassword \
-d "SELECT * FROM system.clusters"Native Protocol (direct)
clickhouse-client \
--host <LOAD_BALANCER_IP> \
--user admin \
--password 'YourAdminPassword'Creating Tables
-- Create a replicated table
CREATE TABLE events ON CLUSTER 'production_cluster'
(
event_id UUID DEFAULT generateUUIDv4(),
timestamp DateTime DEFAULT now(),
user_id UInt64,
event_type String,
properties String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);
-- Create distributed table for querying across shards
CREATE TABLE events_distributed ON CLUSTER 'production_cluster'
AS events
ENGINE = Distributed('production_cluster', 'default', 'events', rand());
-- Insert data (goes to one shard)
INSERT INTO events (user_id, event_type, properties)
VALUES (1, 'page_view', '{"page": "/home"}');
-- Query all shards
SELECT count() FROM events_distributed;Troubleshooting Guide
Issue: Ansible hangs on apt install
Cause: Ubuntu's unattended-upgrades is locking apt.
Solution:
ansible -i inventory/hosts.yml all -m shell -a "pkill -9 unattended-upgr; systemctl disable unattended-upgrades"Issue: ClickHouse won't start
Cause: Usually port conflict or config error.
Solution:
# Check logs
ssh root@<node-ip> "cat /var/log/clickhouse-server/clickhouse-server.err.log | tail -50"
# Common fix: port 9363 conflict between Keeper and Server
# Server should use 9364, Keeper uses 9363Issue: Cloudflare 521 error
Cause: Origin server not responding on port 80/443.
Solution:
# Check Nginx
ansible -i inventory/hosts.yml monitoring -m shell -a "systemctl status nginx"
# Check port 80
ansible -i inventory/hosts.yml monitoring -m shell -a "ss -tlnp | grep :80"Issue: Authentication failed
Cause: Passwords don't match or env vars weren't set.
Solution:
source .env
ansible-playbook site.yml --limit clickhouse
ansible -i inventory/hosts.yml clickhouse -m shell -a "systemctl restart clickhouse-server &"Cost Analysis
Monthly Breakdown
| Resource | Type | Cost |
|---|---|---|
| 3x ClickHouse Nodes | CX33 (4 vCPU, 8GB) | €26.97 |
| 1x Monitoring | CX23 (2 vCPU, 4GB) | €4.49 |
| Load Balancer | LB11 | €5.39 |
| Network | Private | Free |
| Traffic | 20TB included | Free |
| Total | €36.85 |
Scaling Costs
| Configuration | Nodes | Monthly Cost |
|---|---|---|
| Small (this guide) | 3x CX33 | €37 |
| Medium | 3x CX43 | €59 |
| Large | 5x CX43 | €95 |
| Enterprise | 5x CCX33 | €270 |
Conclusion
You now have a production-ready ClickHouse cluster with:
✅ High Availability: 3-node Keeper quorum
✅ Data Replication: Automatic via ReplicatedMergeTree
✅ Load Balancing: Health-checked traffic distribution
✅ SSL/TLS: Free via Cloudflare
✅ Monitoring: Prometheus + Grafana dashboards
✅ CI/CD: Automated deployments via GitHub Actions
✅ Security: Firewall, private network, user roles
✅ Cost: ~€37/month
Next Steps
- Add alerting: Configure Grafana alerts for critical metrics
- Backup strategy: Set up S3-compatible backup with Hetzner's Object Storage
- Scale horizontally: Add more nodes as data grows
- Tune performance: Adjust memory settings based on workload