Back to blog

Building a Production-Ready ClickHouse Cluster on Hetzner Cloud

A comprehensive guide to deploying a scalable, cost-effective ClickHouse analytics cluster using Terraform, Ansible, Cloudflare SSL, and GitHub Actions CI/CD. Complete with high availability, monitoring, and security hardening.

Capturable Team
December 29, 2025
22 min read
Share:

Introduction

If you're dealing with analytics workloads, log processing, or any scenario that requires fast aggregations over billions of rows, you've probably heard of ClickHouse. Originally developed at Yandex, ClickHouse is a column-oriented database management system that can process analytical queries at incredible speeds.

However, setting up a production-ready ClickHouse cluster isn't trivial. You need to consider:

  • High availability: What happens when a node fails?
  • Data replication: How do you prevent data loss?
  • Coordination: How do nodes communicate and elect leaders?
  • Monitoring: How do you know when something goes wrong?
  • Security: How do you protect your data?
  • Cost: Can you do all this without breaking the bank?

In this guide, I'll walk you through building a complete production-ready ClickHouse cluster on Hetzner Cloud for approximately €37/month. We'll use:

  • Terraform for infrastructure provisioning
  • Ansible for configuration management
  • ClickHouse Keeper for distributed coordination
  • Cloudflare for SSL and DDoS protection
  • Grafana + Prometheus for monitoring
  • GitHub Actions for CI/CD

By the end, you'll have a fully automated, scalable, and secure analytics infrastructure.


Why This Stack?

Why ClickHouse?

ClickHouse excels at:

  • Speed: Queries that take minutes in traditional databases complete in seconds
  • Compression: Column storage achieves 10-20x compression ratios
  • Real-time ingestion: Handle millions of rows per second
  • SQL compatibility: Familiar query language with powerful extensions

Real-world example: At scale, a query like "count all events from last month grouped by country" might scan 50 billion rows in under 2 seconds.

Why Hetzner Cloud?

While AWS, GCP, and Azure dominate the cloud market, Hetzner offers compelling advantages:

Provider4 vCPU, 8GB RAMEgress (1TB)Total/Month
AWS (EC2)~$120$90~$210
GCP~$100$120~$220
Hetzner€8.99Free (20TB)~€9

For a 3-node cluster, Hetzner saves you ~€500/month compared to hyperscalers. That's significant for startups and side projects.

Why Terraform + Ansible?

This combination follows the separation of concerns principle:

  • Terraform: "What infrastructure exists" (declarative)
  • Ansible: "How infrastructure is configured" (procedural)

Terraform creates servers, networks, and load balancers. Ansible installs software, manages configs, and handles services. Each tool does what it's best at.

Why ClickHouse Keeper instead of ZooKeeper?

Historically, ClickHouse used Apache ZooKeeper for coordination. ClickHouse Keeper is a drop-in replacement that:

  • Runs as a single binary (simpler operations)
  • Has the same protocol (compatible with existing configs)
  • Uses less memory
  • Is maintained by the ClickHouse team

Architecture Overview

Here's what we're building:

                                    Internet
                                        │
                                        ▼
                            ┌───────────────────────┐
                            │      Cloudflare       │
                            │   (SSL Termination)   │
                            │   grafana.domain.com  │
                            │ clickhouse.domain.com │
                            └───────────┬───────────┘
                                        │ HTTPS
                                        ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│                               Hetzner Cloud                                    │
│                                                                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │                    Hetzner Load Balancer (LB11)                          │  │
│  │                    Health checks: /ping on port 8123                     │  │
│  └─────────────────────────────────┬───────────────────────────────────────┘  │
│                                    │                                           │
│         ┌──────────────────────────┼──────────────────────────┐               │
│         │                          │                          │               │
│         ▼                          ▼                          ▼               │
│   ┌───────────┐              ┌───────────┐              ┌───────────┐         │
│   │ CH Node 1 │              │ CH Node 2 │              │ CH Node 3 │         │
│   ├───────────┤              ├───────────┤              ├───────────┤         │
│   │ClickHouse │              │ClickHouse │              │ClickHouse │         │
│   │  Server   │◄────────────►│  Server   │◄────────────►│  Server   │         │
│   ├───────────┤  Replication ├───────────┤  Replication ├───────────┤         │
│   │  Keeper   │              │  Keeper   │              │  Keeper   │         │
│   │ (Raft)    │◄────────────►│ (Raft)    │◄────────────►│ (Raft)    │         │
│   └─────┬─────┘   Consensus  └─────┬─────┘   Consensus  └─────┬─────┘         │
│         │                          │                          │               │
│         └──────────────────────────┼──────────────────────────┘               │
│                                    │                                           │
│                          Private Network                                       │
│                           10.0.1.0/24                                          │
│                                    │                                           │
│                          ┌─────────┴─────────┐                                │
│                          │    Monitoring     │                                │
│                          │  ┌─────────────┐  │                                │
│                          │  │   Grafana   │  │                                │
│                          │  │  Prometheus │  │                                │
│                          │  │    Nginx    │  │                                │
│                          │  └─────────────┘  │                                │
│                          └───────────────────┘                                │
│                                                                                │
└───────────────────────────────────────────────────────────────────────────────┘

Component Breakdown

ComponentPurposeWhy We Need It
ClickHouse ServerAnalytics databaseThe core of our system
ClickHouse KeeperDistributed coordinationLeader election, distributed locks, replication coordination
Hetzner Load BalancerTraffic distributionRound-robin queries across healthy nodes
Private NetworkSecure communicationCluster traffic never leaves Hetzner's network
PrometheusMetrics collectionTime-series data for monitoring
GrafanaVisualizationDashboards and alerting
NginxReverse proxySSL termination for Cloudflare
CloudflareDNS, SSL, DDoS protectionFree SSL and protection

Data Flow

  1. Writes: Client → Load Balancer → ClickHouse Node → Replicated to other nodes via Keeper
  2. Reads: Client → Load Balancer → Any healthy ClickHouse Node
  3. Metrics: All nodes → Prometheus → Grafana

Why 3 Nodes?

The number 3 isn't arbitrary. For consensus algorithms like Raft (used by Keeper):

  • 3 nodes: Tolerates 1 failure (need majority: 2)
  • 5 nodes: Tolerates 2 failures (need majority: 3)

With 3 nodes, we get high availability while keeping costs reasonable.


Prerequisites

Before we begin, ensure you have:

Accounts

  1. Hetzner Cloud - Sign up here
  2. Cloudflare - Sign up here (free tier works)
  3. Terraform Cloud - Sign up here (free tier)
  4. GitHub - For CI/CD

Local Tools

# macOS
brew install terraform ansible
 
# Ubuntu/Debian
sudo apt update
sudo apt install -y terraform ansible
 
# Verify installations
terraform version  # Should be 1.6+
ansible --version  # Should be 2.15+

API Tokens

You'll need to generate:

TokenWherePermissions
Hetzner API TokenConsole → Security → API TokensRead & Write
Cloudflare API TokenProfile → API TokensZone:DNS:Edit, Zone:Settings:Edit
Terraform Cloud TokenUser Settings → TokensN/A

Domain Name

You need a domain managed by Cloudflare. We'll create:

  • clickhouse.yourdomain.com
  • grafana.yourdomain.com

Part 1: Infrastructure with Terraform

Let's start by creating the infrastructure. Create a new project directory:

mkdir clickhouse-cluster && cd clickhouse-cluster
mkdir -p terraform ansible secrets

Understanding the Terraform Structure

terraform/
├── main.tf           # Core resources (servers, network, LB)
├── variables.tf      # Input variables
├── outputs.tf        # Output values
├── cloudflare.tf     # DNS and SSL settings
├── backend.tf        # Terraform Cloud configuration
└── templates/
    └── inventory.tpl # Ansible inventory template

Step 1: Provider Configuration

Create terraform/main.tf:

# terraform/main.tf
 
terraform {
  required_version = ">= 1.6.0"
 
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.45"
    }
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 4.0"
    }
  }
}
 
provider "hcloud" {
  token = var.hcloud_token
}
 
provider "cloudflare" {
  api_token = var.cloudflare_api_token
}

What's happening here?

  • We declare our required Terraform version (1.6+)
  • We specify two providers: hcloud for Hetzner and cloudflare for DNS
  • Providers are authenticated via variables (we'll define those next)

Step 2: Variables

Create terraform/variables.tf:

# terraform/variables.tf
 
# =============================================================================
# REQUIRED VARIABLES
# =============================================================================
 
variable "hcloud_token" {
  description = "Hetzner Cloud API token"
  type        = string
  sensitive   = true
}
 
variable "cloudflare_api_token" {
  description = "Cloudflare API token"
  type        = string
  sensitive   = true
}
 
variable "cloudflare_zone_name" {
  description = "Your domain name in Cloudflare"
  type        = string
}
 
# =============================================================================
# CLUSTER CONFIGURATION
# =============================================================================
 
variable "cluster_name" {
  description = "Name prefix for all resources"
  type        = string
  default     = "clickhouse"
}
 
variable "location" {
  description = "Hetzner datacenter location"
  type        = string
  default     = "nbg1"  # Nuremberg, Germany
  
  validation {
    condition     = contains(["fsn1", "nbg1", "hel1", "ash", "hil"], var.location)
    error_message = "Location must be: fsn1, nbg1, hel1, ash, or hil"
  }
}
 
variable "clickhouse_node_count" {
  description = "Number of ClickHouse nodes"
  type        = number
  default     = 3
  
  validation {
    condition     = var.clickhouse_node_count >= 1
    error_message = "Must have at least 1 node"
  }
}
 
variable "clickhouse_server_type" {
  description = "Server type for ClickHouse nodes"
  type        = string
  default     = "cx33"  # 4 vCPU, 8GB RAM
}
 
variable "monitoring_server_type" {
  description = "Server type for monitoring"
  type        = string
  default     = "cx23"  # 2 vCPU, 4GB RAM
}
 
variable "ssh_allowed_ips" {
  description = "IPs allowed to SSH (CIDR notation)"
  type        = list(string)
  default     = ["0.0.0.0/0"]  # Restrict in production!
}

Key decisions explained:

  • Location: nbg1 (Nuremberg) is often the cheapest and has good availability
  • Server types: cx33 gives us 4 vCPU and 8GB RAM—enough for moderate workloads
  • SSH access: Defaults to open, but you should restrict to your IP

Step 3: SSH Key Generation

We need SSH access to our servers. Terraform can generate keys for us:

# Add to terraform/main.tf
 
# =============================================================================
# SSH KEY
# =============================================================================
 
resource "tls_private_key" "cluster_key" {
  algorithm = "ED25519"
}
 
resource "hcloud_ssh_key" "cluster" {
  name       = "${var.cluster_name}-key"
  public_key = tls_private_key.cluster_key.public_key_openssh
}
 
resource "local_file" "private_key" {
  content         = tls_private_key.cluster_key.private_key_openssh
  filename        = "${path.module}/../secrets/${var.cluster_name}-key"
  file_permission = "0600"
}

Why ED25519?

  • Shorter keys (256-bit vs 4096-bit RSA)
  • Faster operations
  • Modern and secure

Step 4: Network Infrastructure

Now let's create the private network:

# Add to terraform/main.tf
 
# =============================================================================
# NETWORK
# =============================================================================
 
resource "hcloud_network" "cluster" {
  name     = "${var.cluster_name}-network"
  ip_range = "10.0.0.0/16"
}
 
resource "hcloud_network_subnet" "cluster" {
  network_id   = hcloud_network.cluster.id
  type         = "cloud"
  network_zone = "eu-central"
  ip_range     = "10.0.1.0/24"
}

Why a private network?

  • Security: Inter-node traffic stays private
  • Performance: Lower latency than public internet
  • Cost: Private traffic is free

Step 5: Firewall Rules

Security first! Let's define what traffic is allowed:

# Add to terraform/main.tf
 
# =============================================================================
# FIREWALL
# =============================================================================
 
resource "hcloud_firewall" "cluster" {
  name = "${var.cluster_name}-firewall"
 
  # SSH access (restrict to your IP in production!)
  dynamic "rule" {
    for_each = var.ssh_allowed_ips
    content {
      direction  = "in"
      protocol   = "tcp"
      port       = "22"
      source_ips = [rule.value]
    }
  }
 
  # HTTP for Cloudflare proxy
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "80"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
 
  # HTTPS for Cloudflare proxy
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "443"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
 
  # Grafana
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "3000"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
 
  # ClickHouse HTTP API
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "8123"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
 
  # ClickHouse Native Protocol
  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "9000"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
 
  # Allow all outbound
  rule {
    direction       = "out"
    protocol        = "tcp"
    port            = "1-65535"
    destination_ips = ["0.0.0.0/0", "::/0"]
  }
 
  rule {
    direction       = "out"
    protocol        = "udp"
    port            = "1-65535"
    destination_ips = ["0.0.0.0/0", "::/0"]
  }
}

Step 6: ClickHouse Servers

Now the main event—creating our ClickHouse nodes:

# Add to terraform/main.tf
 
# =============================================================================
# CLICKHOUSE NODES
# =============================================================================
 
resource "hcloud_server" "clickhouse" {
  count       = var.clickhouse_node_count
  name        = "${var.cluster_name}-ch-${count.index + 1}"
  image       = "ubuntu-24.04"
  server_type = var.clickhouse_server_type
  location    = var.location
  ssh_keys    = [hcloud_ssh_key.cluster.id]
 
  labels = {
    cluster = var.cluster_name
    role    = "clickhouse"
  }
 
  # Cloud-init: disable unattended-upgrades (prevents apt lock issues)
  user_data = <<-EOF
    #cloud-config
    package_update: false
    package_upgrade: false
    runcmd:
      - systemctl stop unattended-upgrades || true
      - systemctl disable unattended-upgrades || true
      - apt-get remove -y unattended-upgrades || true
      - echo "vm.max_map_count=262144" >> /etc/sysctl.conf
      - sysctl -p
  EOF
 
  lifecycle {
    ignore_changes = [user_data]
  }
}
 
# Attach to private network
resource "hcloud_server_network" "clickhouse" {
  count     = var.clickhouse_node_count
  server_id = hcloud_server.clickhouse[count.index].id
  subnet_id = hcloud_network_subnet.cluster.id
  ip        = "10.0.1.${count.index + 2}"  # .2, .3, .4
}
 
# Attach firewall
resource "hcloud_firewall_attachment" "clickhouse" {
  firewall_id = hcloud_firewall.cluster.id
  server_ids  = hcloud_server.clickhouse[*].id
}

Important cloud-init settings:

  1. Disable unattended-upgrades: Ubuntu's automatic updates can lock apt during Ansible runs, causing failures
  2. vm.max_map_count: ClickHouse needs this increased for memory-mapped files

Step 7: Monitoring Server

# Add to terraform/main.tf
 
# =============================================================================
# MONITORING NODE
# =============================================================================
 
resource "hcloud_server" "monitoring" {
  name        = "${var.cluster_name}-monitoring"
  image       = "ubuntu-24.04"
  server_type = var.monitoring_server_type
  location    = var.location
  ssh_keys    = [hcloud_ssh_key.cluster.id]
 
  labels = {
    cluster = var.cluster_name
    role    = "monitoring"
  }
 
  user_data = <<-EOF
    #cloud-config
    package_update: false
    package_upgrade: false
    runcmd:
      - systemctl stop unattended-upgrades || true
      - systemctl disable unattended-upgrades || true
      - apt-get remove -y unattended-upgrades || true
  EOF
 
  lifecycle {
    ignore_changes = [user_data]
  }
}
 
resource "hcloud_server_network" "monitoring" {
  server_id = hcloud_server.monitoring.id
  subnet_id = hcloud_network_subnet.cluster.id
  ip        = "10.0.1.10"
}
 
resource "hcloud_firewall_attachment" "monitoring" {
  firewall_id = hcloud_firewall.cluster.id
  server_ids  = [hcloud_server.monitoring.id]
}

Step 8: Load Balancer

The load balancer distributes traffic across healthy nodes:

# Add to terraform/main.tf
 
# =============================================================================
# LOAD BALANCER
# =============================================================================
 
resource "hcloud_load_balancer" "cluster" {
  name               = "${var.cluster_name}-lb"
  load_balancer_type = "lb11"
  location           = var.location
}
 
resource "hcloud_load_balancer_network" "cluster" {
  load_balancer_id = hcloud_load_balancer.cluster.id
  subnet_id        = hcloud_network_subnet.cluster.id
  ip               = "10.0.1.100"
}
 
# HTTP API service
resource "hcloud_load_balancer_service" "http" {
  load_balancer_id = hcloud_load_balancer.cluster.id
  protocol         = "tcp"
  listen_port      = 8123
  destination_port = 8123
 
  health_check {
    protocol = "http"
    port     = 8123
    interval = 5
    timeout  = 3
    retries  = 3
 
    http {
      path         = "/ping"
      status_codes = ["200"]
    }
  }
}
 
# Native protocol service
resource "hcloud_load_balancer_service" "native" {
  load_balancer_id = hcloud_load_balancer.cluster.id
  protocol         = "tcp"
  listen_port      = 9000
  destination_port = 9000
 
  health_check {
    protocol = "tcp"
    port     = 9000
    interval = 5
    timeout  = 3
    retries  = 3
  }
}
 
# Attach ClickHouse nodes
resource "hcloud_load_balancer_target" "clickhouse" {
  count            = var.clickhouse_node_count
  type             = "server"
  load_balancer_id = hcloud_load_balancer.cluster.id
  server_id        = hcloud_server.clickhouse[count.index].id
  use_private_ip   = true
 
  depends_on = [
    hcloud_load_balancer_network.cluster,
    hcloud_server_network.clickhouse
  ]
}

Health check explained:

ClickHouse exposes /ping endpoint that returns "Ok." when healthy. The load balancer checks this every 5 seconds and removes unhealthy nodes from rotation.

Step 9: Cloudflare DNS

Create terraform/cloudflare.tf:

# terraform/cloudflare.tf
 
data "cloudflare_zone" "domain" {
  name = var.cloudflare_zone_name
}
 
# Grafana DNS record
resource "cloudflare_record" "grafana" {
  zone_id = data.cloudflare_zone.domain.id
  name    = "grafana"
  content = hcloud_server.monitoring.ipv4_address
  type    = "A"
  ttl     = 1  # Auto TTL when proxied
  proxied = true
}
 
# ClickHouse DNS record
resource "cloudflare_record" "clickhouse" {
  zone_id = data.cloudflare_zone.domain.id
  name    = "clickhouse"
  content = hcloud_server.monitoring.ipv4_address  # Goes through Nginx
  type    = "A"
  ttl     = 1
  proxied = true
}
 
# SSL settings
resource "cloudflare_zone_settings_override" "ssl_settings" {
  zone_id = data.cloudflare_zone.domain.id
 
  settings {
    ssl                      = "full"
    always_use_https         = "on"
    min_tls_version         = "1.2"
    automatic_https_rewrites = "on"
  }
}

Why route through monitoring node?

Cloudflare only proxies HTTP (ports 80/443). Our Nginx on the monitoring node:

  1. Terminates SSL
  2. Proxies to ClickHouse via private network
  3. Provides a single entry point

Step 10: Outputs

Create terraform/outputs.tf:

# terraform/outputs.tf
 
output "load_balancer_ip" {
  description = "Public IP of the load balancer"
  value       = hcloud_load_balancer.cluster.ipv4
}
 
output "clickhouse_https_url" {
  description = "ClickHouse HTTPS URL"
  value       = "https://clickhouse.${var.cloudflare_zone_name}"
}
 
output "grafana_https_url" {
  description = "Grafana HTTPS URL"
  value       = "https://grafana.${var.cloudflare_zone_name}"
}
 
output "monitoring_public_ip" {
  description = "Public IP of monitoring node"
  value       = hcloud_server.monitoring.ipv4_address
}
 
output "ansible_inventory" {
  description = "Generated Ansible inventory"
  value = templatefile("${path.module}/templates/inventory.tpl", {
    cluster_name     = var.cluster_name
    clickhouse_nodes = hcloud_server.clickhouse
    clickhouse_ips   = hcloud_server_network.clickhouse[*].ip
    monitoring_node  = hcloud_server.monitoring
    monitoring_ip    = hcloud_server_network.monitoring.ip
  })
}

Step 11: Inventory Template

Create terraform/templates/inventory.tpl:

---
# Auto-generated by Terraform
 
all:
  vars:
    ansible_user: root
    ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
    cluster_name: ${cluster_name}
 
  children:
    clickhouse:
      hosts:
%{ for i, node in clickhouse_nodes ~}
        ${node.name}:
          ansible_host: ${node.ipv4_address}
          private_ip: ${clickhouse_ips[i]}
          keeper_id: ${i + 1}
          shard: ${(i % 2) + 1}
          replica: ${floor(i / 2) + 1}
%{ endfor ~}
 
    keeper:
      hosts:
%{ for i, node in clickhouse_nodes ~}
        ${node.name}:
          ansible_host: ${node.ipv4_address}
          private_ip: ${clickhouse_ips[i]}
          keeper_id: ${i + 1}
%{ endfor ~}
 
    monitoring:
      hosts:
        ${monitoring_node.name}:
          ansible_host: ${monitoring_node.ipv4_address}
          private_ip: ${monitoring_ip}

Step 12: Backend Configuration

Create terraform/backend.tf:

# terraform/backend.tf
 
terraform {
  cloud {
    organization = "your-org-name"  # Change this
    
    workspaces {
      name = "clickhouse-cluster"
    }
  }
}

Step 13: Deploy Infrastructure

Now let's deploy:

cd terraform
 
# Create your variables file
cat > terraform.tfvars << 'EOF'
hcloud_token           = "your-hetzner-token"
cloudflare_api_token   = "your-cloudflare-token"
cloudflare_zone_name   = "yourdomain.com"
cluster_name           = "clickhouse"
location               = "nbg1"
clickhouse_server_type = "cx33"
monitoring_server_type = "cx23"
EOF
 
# Login to Terraform Cloud
terraform login
 
# Initialize
terraform init
 
# Plan (review changes)
terraform plan
 
# Apply
terraform apply

After ~2 minutes, you'll have:

  • 3 ClickHouse servers
  • 1 Monitoring server
  • 1 Load balancer
  • Private network
  • Firewall rules
  • DNS records

Generate the Ansible inventory:

terraform output -raw ansible_inventory > ../ansible/inventory/hosts.yml

Part 2: Configuration with Ansible

Now we configure our servers. Ansible will:

  1. Install system dependencies
  2. Configure ClickHouse Keeper (coordination)
  3. Configure ClickHouse Server
  4. Set up Prometheus + Grafana
  5. Configure Nginx for SSL

Ansible Structure

ansible/
├── ansible.cfg          # Configuration
├── site.yml             # Main playbook
├── group_vars/
│   └── all.yml          # Global variables
├── inventory/
│   └── hosts.yml        # From Terraform
└── roles/
    ├── common/          # Base configuration
    ├── clickhouse-keeper/
    ├── clickhouse/
    ├── grafana/
    └── nginx/

Step 1: Ansible Configuration

Create ansible/ansible.cfg:

[defaults]
inventory = inventory/hosts.yml
host_key_checking = False
private_key_file = ../secrets/clickhouse-key
retry_files_enabled = False
stdout_callback = yaml
deprecations_warnings = False
 
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no

Step 2: Global Variables

Create ansible/group_vars/all.yml:

---
# Cluster configuration
cluster_name: production_cluster
 
# ClickHouse users
clickhouse_users:
  - name: default
    password: ""
    networks:
      - "127.0.0.1"
      - "10.0.0.0/8"
    profile: default
    quota: default
  - name: admin
    password: "{{ lookup('env', 'CLICKHOUSE_ADMIN_PASSWORD') | default('ChangeMe123!', true) }}"
    networks:
      - "::/0"
    profile: default
    quota: default
    access_management: true
  - name: readonly
    password: "{{ lookup('env', 'CLICKHOUSE_READONLY_PASSWORD') | default('ReadOnly123!', true) }}"
    networks:
      - "::/0"
    profile: readonly
    quota: default
    access_management: false
 
# Keeper configuration
keeper_port: 9181
keeper_raft_port: 9234
 
# Monitoring
prometheus_retention: "15d"
grafana_admin_password: "{{ lookup('env', 'GRAFANA_ADMIN_PASSWORD') | default('admin', true) }}"
node_exporter_version: "1.7.0"
 
# System tuning
system_settings:
  vm_max_map_count: 262144
  net_core_somaxconn: 65535
  fs_file_max: 2097152
 
# Domain
domain_name: yourdomain.com
 
# Python interpreter
ansible_python_interpreter: /usr/bin/python3

User roles explained:

UserPurposePermissions
defaultInternal cluster communicationNo external access
adminApplication writes, DDLFull access
readonlyDashboards, reportingSELECT only

Step 3: Main Playbook

Create ansible/site.yml:

---
- name: Apply common configuration
  hosts: all
  become: true
  roles:
    - common
 
- name: Configure ClickHouse Keeper
  hosts: keeper
  become: true
  roles:
    - clickhouse-keeper
 
- name: Configure ClickHouse servers
  hosts: clickhouse
  become: true
  roles:
    - clickhouse
 
- name: Configure monitoring
  hosts: monitoring
  become: true
  roles:
    - grafana
    - nginx
 
- name: Verify cluster
  hosts: clickhouse[0]
  become: true
  tasks:
    - name: Wait for ClickHouse
      wait_for:
        port: 8123
        delay: 5
        timeout: 60
 
    - name: Check cluster status
      shell: clickhouse-client --query "SELECT cluster, count() FROM system.clusters WHERE cluster='{{ cluster_name }}' GROUP BY cluster"
      register: cluster_status
 
    - name: Display status
      debug:
        msg: "{{ cluster_status.stdout }}"

Step 4: Common Role

Create ansible/roles/common/tasks/main.yml:

---
- name: Wait for apt locks
  shell: |
    TIMEOUT=60
    COUNTER=0
    while fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
      if [ $COUNTER -ge $TIMEOUT ]; then
        pkill -9 unattended-upgr || true
        pkill -9 apt || true
        rm -f /var/lib/dpkg/lock-frontend
        break
      fi
      sleep 2
      COUNTER=$((COUNTER + 2))
    done
    dpkg --configure -a || true
  changed_when: false
 
- name: Disable unattended-upgrades
  systemd:
    name: unattended-upgrades
    state: stopped
    enabled: false
  ignore_errors: true
 
- name: Install common packages
  apt:
    name:
      - apt-transport-https
      - ca-certificates
      - curl
      - gnupg
      - htop
      - net-tools
      - vim
      - wget
      - jq
    state: present
    update_cache: true
 
- name: Configure sysctl
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: true
  loop:
    - { name: "vm.max_map_count", value: "{{ system_settings.vm_max_map_count }}" }
    - { name: "net.core.somaxconn", value: "{{ system_settings.net_core_somaxconn }}" }
    - { name: "fs.file-max", value: "{{ system_settings.fs_file_max }}" }
 
- name: Configure /etc/hosts
  lineinfile:
    path: /etc/hosts
    line: "{{ hostvars[item]['private_ip'] }} {{ item }}"
    state: present
  loop: "{{ groups['all'] }}"
 
- name: Install Node Exporter
  include_tasks: node_exporter.yml

Step 5: ClickHouse Keeper Role

Create ansible/roles/clickhouse-keeper/tasks/main.yml:

---
- name: Add ClickHouse repository key
  apt_key:
    keyserver: keyserver.ubuntu.com
    id: 8919F6BD2B48D754
 
- name: Add ClickHouse repository
  apt_repository:
    repo: "deb https://packages.clickhouse.com/deb lts main"
    state: present
 
- name: Install ClickHouse Keeper
  apt:
    name: clickhouse-keeper
    state: present
    update_cache: true
 
- name: Create directories
  file:
    path: "{{ item }}"
    state: directory
    owner: clickhouse
    group: clickhouse
    mode: '0750'
  loop:
    - /var/lib/clickhouse-keeper
    - /var/log/clickhouse-keeper
 
- name: Configure Keeper
  template:
    src: keeper_config.xml.j2
    dest: /etc/clickhouse-keeper/keeper_config.xml
    owner: clickhouse
    group: clickhouse
  notify: Restart clickhouse-keeper
 
- name: Start Keeper
  systemd:
    name: clickhouse-keeper
    state: started
    enabled: true

Create ansible/roles/clickhouse-keeper/templates/keeper_config.xml.j2:

<?xml version="1.0"?>
<clickhouse>
    <logger>
        <level>information</level>
        <log>/var/log/clickhouse-keeper/clickhouse-keeper.log</log>
        <errorlog>/var/log/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
        <size>100M</size>
        <count>3</count>
    </logger>
 
    <listen_host>0.0.0.0</listen_host>
 
    <keeper_server>
        <tcp_port>{{ keeper_port }}</tcp_port>
        <server_id>{{ keeper_id }}</server_id>
 
        <log_storage_path>/var/lib/clickhouse-keeper/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse-keeper/snapshots</snapshot_storage_path>
 
        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>warning</raft_logs_level>
        </coordination_settings>
 
        <raft_configuration>
{% for host in groups['keeper'] %}
            <server>
                <id>{{ hostvars[host]['keeper_id'] }}</id>
                <hostname>{{ hostvars[host]['private_ip'] }}</hostname>
                <port>{{ keeper_raft_port }}</port>
            </server>
{% endfor %}
        </raft_configuration>
    </keeper_server>
 
    <prometheus>
        <endpoint>/metrics</endpoint>
        <port>9363</port>
        <metrics>true</metrics>
    </prometheus>
</clickhouse>

Key configuration points:

  • server_id: Unique ID for each Keeper node (1, 2, 3)
  • raft_configuration: Lists all Keeper nodes for consensus
  • prometheus: Exposes metrics on port 9363

Step 6: ClickHouse Server Role

Create ansible/roles/clickhouse/tasks/main.yml:

---
- name: Add ClickHouse repository
  apt_repository:
    repo: "deb https://packages.clickhouse.com/deb lts main"
    state: present
 
- name: Install ClickHouse
  apt:
    name:
      - clickhouse-server
      - clickhouse-client
    state: present
    update_cache: true
 
- name: Configure cluster
  template:
    src: config.xml.j2
    dest: /etc/clickhouse-server/config.d/cluster.xml
    owner: clickhouse
    group: clickhouse
  notify: Restart clickhouse-server
 
- name: Configure users
  template:
    src: users.xml.j2
    dest: /etc/clickhouse-server/users.d/custom_users.xml
    owner: clickhouse
    group: clickhouse
  notify: Restart clickhouse-server
 
- name: Configure macros
  template:
    src: macros.xml.j2
    dest: /etc/clickhouse-server/config.d/macros.xml
    owner: clickhouse
    group: clickhouse
  notify: Restart clickhouse-server
 
- name: Configure Keeper connection
  template:
    src: zookeeper.xml.j2
    dest: /etc/clickhouse-server/config.d/zookeeper.xml
    owner: clickhouse
    group: clickhouse
  notify: Restart clickhouse-server
 
- name: Enable ClickHouse
  systemd:
    name: clickhouse-server
    enabled: true
 
- name: Start ClickHouse (async)
  shell: systemctl start clickhouse-server &
  async: 30
  poll: 0
 
- name: Wait for ClickHouse
  wait_for:
    port: 8123
    delay: 5
    timeout: 120

Create ansible/roles/clickhouse/templates/config.xml.j2:

<?xml version="1.0"?>
<clickhouse>
    <listen_host>0.0.0.0</listen_host>
    <http_port>8123</http_port>
    <tcp_port>9000</tcp_port>
    <interserver_http_port>9009</interserver_http_port>
 
    <remote_servers>
        <{{ cluster_name }}>
            <shard>
                <internal_replication>true</internal_replication>
{% for host in groups['clickhouse'] if hostvars[host]['shard'] == 1 %}
                <replica>
                    <host>{{ hostvars[host]['private_ip'] }}</host>
                    <port>9000</port>
                </replica>
{% endfor %}
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
{% for host in groups['clickhouse'] if hostvars[host]['shard'] == 2 %}
                <replica>
                    <host>{{ hostvars[host]['private_ip'] }}</host>
                    <port>9000</port>
                </replica>
{% endfor %}
            </shard>
        </{{ cluster_name }}>
    </remote_servers>
 
    <prometheus>
        <endpoint>/metrics</endpoint>
        <port>9364</port>
        <metrics>true</metrics>
    </prometheus>
 
    <distributed_ddl>
        <path>/clickhouse/task_queue/ddl</path>
    </distributed_ddl>
</clickhouse>

Cluster topology:

  • 2 shards: Data is distributed across shards
  • Replicas: Each shard can have multiple replicas for redundancy
  • With 3 nodes: Shard 1 has 2 replicas, Shard 2 has 1 replica

Create ansible/roles/clickhouse/templates/macros.xml.j2:

<?xml version="1.0"?>
<clickhouse>
    <macros>
        <cluster>{{ cluster_name }}</cluster>
        <shard>{{ shard }}</shard>
        <replica>{{ inventory_hostname }}</replica>
    </macros>
</clickhouse>

Macros explained:

Macros are variables used in ReplicatedMergeTree table definitions:

CREATE TABLE events ON CLUSTER 'production_cluster'
(
    event_id UUID,
    timestamp DateTime
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY timestamp;

The {shard} and {replica} are automatically replaced with values from macros.

Step 7: Run Ansible

Set your passwords and run:

cd ansible
 
# Create .env file
cat > .env << 'EOF'
export CLICKHOUSE_ADMIN_PASSWORD="YourSecureAdminPass123"
export CLICKHOUSE_READONLY_PASSWORD="YourSecureReadOnlyPass123"
export GRAFANA_ADMIN_PASSWORD="YourGrafanaPass123"
EOF
 
# Source it
source .env
 
# Run playbook
ansible-playbook site.yml

This takes about 5-10 minutes. You'll see each task execute across all hosts.


Part 3: SSL with Cloudflare

Cloudflare provides free SSL, but it only proxies ports 80/443. We need Nginx to:

  1. Listen on port 80/443
  2. Terminate SSL (with self-signed cert—Cloudflare handles public SSL)
  3. Proxy requests to internal services

Nginx Role

Create ansible/roles/nginx/tasks/main.yml:

---
- name: Install Nginx
  apt:
    name: nginx
    state: present
    update_cache: true
 
- name: Create SSL directory
  file:
    path: /etc/nginx/ssl
    state: directory
    mode: '0700'
 
- name: Generate self-signed certificate
  command: >
    openssl req -x509 -nodes -days 365 -newkey rsa:2048
    -keyout /etc/nginx/ssl/private.key
    -out /etc/nginx/ssl/certificate.crt
    -subj "/C=US/ST=State/L=City/O=Org/CN={{ domain_name }}"
  args:
    creates: /etc/nginx/ssl/certificate.crt
 
- name: Configure Grafana proxy
  template:
    src: grafana.conf.j2
    dest: /etc/nginx/sites-available/grafana.conf
  notify: Reload nginx
 
- name: Configure ClickHouse proxy
  template:
    src: clickhouse.conf.j2
    dest: /etc/nginx/sites-available/clickhouse.conf
  notify: Reload nginx
 
- name: Enable sites
  file:
    src: "/etc/nginx/sites-available/{{ item }}"
    dest: "/etc/nginx/sites-enabled/{{ item }}"
    state: link
  loop:
    - grafana.conf
    - clickhouse.conf
  notify: Reload nginx
 
- name: Remove default site
  file:
    path: /etc/nginx/sites-enabled/default
    state: absent
 
- name: Start Nginx
  systemd:
    name: nginx
    state: started
    enabled: true

Create ansible/roles/nginx/templates/clickhouse.conf.j2:

server {
    listen 80;
    server_name clickhouse.{{ domain_name }};
    return 301 https://$host$request_uri;
}
 
server {
    listen 443 ssl;
    server_name clickhouse.{{ domain_name }};
 
    ssl_certificate /etc/nginx/ssl/certificate.crt;
    ssl_certificate_key /etc/nginx/ssl/private.key;
    ssl_protocols TLSv1.2 TLSv1.3;
 
    location / {
        proxy_pass http://{{ hostvars[groups['clickhouse'][0]]['private_ip'] }}:8123;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-Proto https;
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Why self-signed certificates?

Cloudflare's "Full" SSL mode:

User → Cloudflare (valid public SSL) → Your Server (self-signed OK)

Cloudflare encrypts traffic to your origin, but doesn't verify the certificate. This is secure enough for most use cases and avoids the complexity of Let's Encrypt.


Part 4: CI/CD with GitHub Actions

Automate deployments with GitHub Actions:

Create .github/workflows/deploy.yml:

name: Deploy ClickHouse Cluster
 
on:
  push:
    branches: [main]
    paths:
      - 'terraform/**'
      - 'ansible/**'
      - '.github/workflows/**'
  workflow_dispatch:
    inputs:
      action:
        description: 'Action'
        required: true
        default: 'plan'
        type: choice
        options: [plan, apply, destroy]
 
env:
  TF_VERSION: '1.6.0'
  HCLOUD_TOKEN: ${{ secrets.HCLOUD_TOKEN }}
  CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
 
jobs:
  terraform:
    name: Terraform
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}
          cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}
          terraform_wrapper: false
 
      - name: Terraform Init
        working-directory: terraform
        run: terraform init
 
      - name: Terraform Plan
        working-directory: terraform
        run: |
          terraform plan \
            -var="hcloud_token=$HCLOUD_TOKEN" \
            -var="cloudflare_api_token=$CLOUDFLARE_API_TOKEN"
 
      - name: Terraform Apply
        if: github.event_name == 'push' || github.event.inputs.action == 'apply'
        working-directory: terraform
        run: |
          terraform apply -auto-approve \
            -var="hcloud_token=$HCLOUD_TOKEN" \
            -var="cloudflare_api_token=$CLOUDFLARE_API_TOKEN"
 
      - name: Generate Inventory
        if: github.event_name == 'push' || github.event.inputs.action == 'apply'
        working-directory: terraform
        run: |
          mkdir -p ../ansible/inventory
          terraform output -raw ansible_inventory > ../ansible/inventory/hosts.yml
 
      - name: Upload Inventory
        if: github.event_name == 'push' || github.event.inputs.action == 'apply'
        uses: actions/upload-artifact@v4
        with:
          name: inventory
          path: ansible/inventory/hosts.yml
 
  ansible:
    name: Ansible
    runs-on: ubuntu-latest
    needs: terraform
    if: github.event_name == 'push' || github.event.inputs.action == 'apply'
    environment: production
 
    steps:
      - uses: actions/checkout@v4
 
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install Ansible
        run: pip install ansible
 
      - name: Download Inventory
        uses: actions/download-artifact@v4
        with:
          name: inventory
          path: ansible/inventory
 
      - name: Setup SSH
        run: |
          mkdir -p ~/.ssh
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/cluster_key
          chmod 600 ~/.ssh/cluster_key
 
      - name: Wait for servers
        run: sleep 120
 
      - name: Run Ansible
        working-directory: ansible
        env:
          ANSIBLE_HOST_KEY_CHECKING: 'false'
          CLICKHOUSE_ADMIN_PASSWORD: ${{ secrets.CLICKHOUSE_ADMIN_PASSWORD }}
          CLICKHOUSE_READONLY_PASSWORD: ${{ secrets.CLICKHOUSE_READONLY_PASSWORD }}
          GRAFANA_ADMIN_PASSWORD: ${{ secrets.GRAFANA_ADMIN_PASSWORD }}
        run: |
          ansible-playbook \
            -i inventory/hosts.yml \
            --private-key ~/.ssh/cluster_key \
            site.yml

Required GitHub Secrets

SecretDescription
HCLOUD_TOKENHetzner API token
CLOUDFLARE_API_TOKENCloudflare API token
TF_API_TOKENTerraform Cloud token
SSH_PRIVATE_KEYContents of secrets/clickhouse-key
CLICKHOUSE_ADMIN_PASSWORDAdmin password
CLICKHOUSE_READONLY_PASSWORDReadonly password
GRAFANA_ADMIN_PASSWORDGrafana password

Part 5: Monitoring with Grafana

The monitoring stack includes:

  • Prometheus: Scrapes metrics from all nodes
  • Grafana: Visualizes metrics
  • Node Exporter: System metrics (CPU, RAM, disk)

Prometheus Configuration

Create ansible/roles/grafana/templates/prometheus.yml.j2:

global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'node'
    static_configs:
      - targets:
{% for host in groups['all'] %}
          - '{{ hostvars[host]["private_ip"] }}:9100'
{% endfor %}
 
  - job_name: 'clickhouse'
    static_configs:
      - targets:
{% for host in groups['clickhouse'] %}
          - '{{ hostvars[host]["private_ip"] }}:9364'
{% endfor %}
 
  - job_name: 'clickhouse-keeper'
    static_configs:
      - targets:
{% for host in groups['keeper'] %}
          - '{{ hostvars[host]["private_ip"] }}:9363'
{% endfor %}

Key Metrics to Monitor

MetricDescriptionAlert Threshold
ClickHouseMetrics_QueryActive queries> 100
ClickHouseMetrics_MemoryTrackingMemory usage> 80%
ClickHouseProfileEvents_FailedQueryFailed queries> 0
node_cpu_seconds_totalCPU usage> 80%
node_memory_MemAvailable_bytesAvailable RAM< 1GB

Security Hardening

1. Restrict SSH Access

# terraform.tfvars
ssh_allowed_ips = ["YOUR.IP.ADDRESS/32"]

2. Use Strong Passwords

# Generate secure passwords
openssl rand -base64 24

3. Network Segmentation

All inter-node traffic uses the private network (10.0.1.0/24). Only necessary ports are exposed publicly.

4. Firewall Rules

Only these ports are accessible from the internet:

  • 22 (SSH) - restricted to your IP
  • 80/443 (HTTP/HTTPS) - for Cloudflare proxy
  • 8123/9000 (ClickHouse) - consider restricting to known IPs

5. User Permissions

The readonly user cannot modify data:

-- This will fail for readonly user
INSERT INTO events VALUES (...);
-- Error: readonly: Cannot execute query in readonly mode

Connecting to Your Cluster

HTTP API (via Cloudflare)

# Simple query
curl "https://clickhouse.yourdomain.com/" \
  -u admin:YourAdminPassword \
  -d "SELECT version()"
 
# JSON format
curl "https://clickhouse.yourdomain.com/?default_format=JSON" \
  -u admin:YourAdminPassword \
  -d "SELECT * FROM system.clusters"

Native Protocol (direct)

clickhouse-client \
  --host <LOAD_BALANCER_IP> \
  --user admin \
  --password 'YourAdminPassword'

Creating Tables

-- Create a replicated table
CREATE TABLE events ON CLUSTER 'production_cluster'
(
    event_id UUID DEFAULT generateUUIDv4(),
    timestamp DateTime DEFAULT now(),
    user_id UInt64,
    event_type String,
    properties String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);
 
-- Create distributed table for querying across shards
CREATE TABLE events_distributed ON CLUSTER 'production_cluster'
AS events
ENGINE = Distributed('production_cluster', 'default', 'events', rand());
 
-- Insert data (goes to one shard)
INSERT INTO events (user_id, event_type, properties)
VALUES (1, 'page_view', '{"page": "/home"}');
 
-- Query all shards
SELECT count() FROM events_distributed;

Troubleshooting Guide

Issue: Ansible hangs on apt install

Cause: Ubuntu's unattended-upgrades is locking apt.

Solution:

ansible -i inventory/hosts.yml all -m shell -a "pkill -9 unattended-upgr; systemctl disable unattended-upgrades"

Issue: ClickHouse won't start

Cause: Usually port conflict or config error.

Solution:

# Check logs
ssh root@<node-ip> "cat /var/log/clickhouse-server/clickhouse-server.err.log | tail -50"
 
# Common fix: port 9363 conflict between Keeper and Server
# Server should use 9364, Keeper uses 9363

Issue: Cloudflare 521 error

Cause: Origin server not responding on port 80/443.

Solution:

# Check Nginx
ansible -i inventory/hosts.yml monitoring -m shell -a "systemctl status nginx"
 
# Check port 80
ansible -i inventory/hosts.yml monitoring -m shell -a "ss -tlnp | grep :80"

Issue: Authentication failed

Cause: Passwords don't match or env vars weren't set.

Solution:

source .env
ansible-playbook site.yml --limit clickhouse
ansible -i inventory/hosts.yml clickhouse -m shell -a "systemctl restart clickhouse-server &"

Cost Analysis

Monthly Breakdown

ResourceTypeCost
3x ClickHouse NodesCX33 (4 vCPU, 8GB)€26.97
1x MonitoringCX23 (2 vCPU, 4GB)€4.49
Load BalancerLB11€5.39
NetworkPrivateFree
Traffic20TB includedFree
Total€36.85

Scaling Costs

ConfigurationNodesMonthly Cost
Small (this guide)3x CX33€37
Medium3x CX43€59
Large5x CX43€95
Enterprise5x CCX33€270

Conclusion

You now have a production-ready ClickHouse cluster with:

High Availability: 3-node Keeper quorum
Data Replication: Automatic via ReplicatedMergeTree
Load Balancing: Health-checked traffic distribution
SSL/TLS: Free via Cloudflare
Monitoring: Prometheus + Grafana dashboards
CI/CD: Automated deployments via GitHub Actions
Security: Firewall, private network, user roles
Cost: ~€37/month

Next Steps

  1. Add alerting: Configure Grafana alerts for critical metrics
  2. Backup strategy: Set up S3-compatible backup with Hetzner's Object Storage
  3. Scale horizontally: Add more nodes as data grows
  4. Tune performance: Adjust memory settings based on workload

Resources

C

Capturable Team

Building the future of user intelligence. We help teams understand their users deeply through unified analytics, surveys, and engagement.

Limited early access spots

Ready to know your users?

Join the waitlist and get early access with lifetime benefits.

Join 500+ others already on the waitlist