Intro
Mission: Deploy new Ceph poduction-ready cluster on 4 new hardware nodes.
Overview
I had zero experience with the current cephadm orchestrator for ceph. But time flies and ceph-ansible is under further deprecation, so let old dog learn new tricks.
Requirements
Nothing new here:
- JBOD for osds
- raid1 for 2 OS disks (mdraid is fine)
- minimum 3 nodes (and maybe around 15 maximum for the sake of possible rebalance)
- at least 2 network cards with 2×10 gbps ports (better 25 gbps) on each node
- configured lacp on network switches
- at least 2xOSD vcpu and 4xOSD ram on each OSD node
- proxy docker registry to quay.io accessible from all nodes
- 2 separate networks – ceph-internal / ceph-cluster (with jumbo frames and mtu 9000)
- accessible corporate ntp/dns servers, IPMI access if something goes wrong
Preparation
Step 1.
- provision all hosts, install Ubuntu 22.04
- configure hostnames, /etc/hosts localhost records
- configure chronyd and systemd-resolve. Here we configure resolv.conf to follow changes from netplan
sudo rm -f /etc/resolv.conf
sudo ln -s /run/resolvconf/resolv.conf /etc/resolv.conf
- configure netplan. Pay attention to dhcp-dns, mtu and other configuration options:
network:
bonds:
bond0:
interfaces:
- enp180s0f0np0
- enp180s0f1np1
parameters:
lacp-rate: fast
mode: 802.3ad
transmit-hash-policy: layer2+3
bond1:
interfaces:
- enp179s0f0np0
- enp179s0f1np1
parameters:
lacp-rate: fast
mode: 802.3ad
transmit-hash-policy: layer2+3
ethernets:
enp179s0f0np0:
dhcp4: true
mtu: 9000
enp179s0f1np1:
dhcp4: true
mtu: 9000
enp180s0f0np0:
dhcp4: true
dhcp4-overrides:
use-dns: false
enp180s0f1np1:
dhcp4: true
dhcp4-overrides:
use-dns: false
version: 2
vlans:
bond0.10:
id: 10
link: bond0
addresses:
- 10.10.10.10/24
dhcp4-overrides:
use-dns: false
nameservers:
addresses:
- 10.10.10.200
- 10.10.10.220
search:
- "mycompany.cloud"
routes:
- to: default
via: 10.10.10.1
bond1.20:
id: 20
link: bond1
addresses:
- 10.10.20.11/24
mtu: 9000
- apply netplan configuration, verify that everything is fine
netplan apply
ip -4 a; ip l;
timedatectl status
dig ceph-02.mycompany.cloud
OR
nslookup ceph-02
- generate ssh id_rsa for the first host and copy it to all others
ssh ceph-01
ssh-keygen
echo "ssh-rsa ....." >> /.ssh/authorized_keys
ssh ceph-02; echo "ssh-rsa ....." >> /.ssh/authorized_keys
ssh ceph-03; echo "ssh-rsa ....." >> /.ssh/authorized_keys
- configure correct apt repositories on all nodes
vi /etc/apt/sources.list
apt update
- configure correct docker-apt repository (add docker gpg key even if you use proxy repo)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
- install docker.io and other dependencies
apt install -y docker.io lvm2 python3
Bootstraping cluster
We are going to bootstrap a new cluster starting from the first node. This node will be our “admin” node. So on the first ceph node do:apt install -y cephadm ceph-common
Wait for some time and check the status of cluster
cephadm bootstrap --mon-ip 10.10.10.50 --log-to-file --registry-url 10.10.10.70:5002 --registry-username docker --registry-password password --cluster-network 10.10.20.0/24
ceph -s
ceph orch host ls
Adding new hosts
First of all, let’s make new mons unmanaged. I prefer to know on which nodes my osds are located.ssh ceph-01
Next, lets add other hosts.
ceph orch apply mon --unmanaged
First, copy ssh-key of our adm node to them:ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-01
Next, add new hosts via orchestrator. I recommend to wait a little time before proceeding to next host if you use single proxy registry. Note: it’s crucial to use correct existing hostnames of new nodes here. So they could be in uppercase, with special charact$rs etc..
ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-02
ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-03
ceph orch host add ceph-02 10.10.10.51
And now add monitors
ceph orch host add ceph-03 10.10.10.52
ceph orch daemon add mon ceph-02:10.10.60.51
Check the status
ceph orch daemon add mon ceph-03:10.10.60.52
ceph -s
ceph orch host ls
Adding OSDs
It’s recommended to go easy path – just add all available devices as is. It’s working, sure, especially with more or less homogeneous setup.
All your future osds should be listed here as availableceph orch device ls
Apply dry-run first
ceph orch apply osd --all-available-devices --dry-run
ceph orch apply osd --all-available-devices
Other way around.
You can add all devices manually, with some advanced configuration:ceph orch daemon add osd ceph-01:data_devices=/dev/sda,/dev/sdb,db_devices=/dev/sdc,osds_per_device=2
Or even use yml configuration for that purpose: docs.ceph.com/drivegroups
Disable auto-provisioning of OSDs
The same thing as with monitors – ceph orchestrator keeps recreating new osds on every ocassion – like, you wipe disk – it creates new osd; you add new drive to host – it creates new osd.
I don’t know why but I believe that quite many system administrators are NOT comfortable with this behavior. So let’s disable itceph orch apply osd --all-available-devices --unmanaged=true
After that, if you want to setup new osds, you will need to do:
ceph orch daemon add osd *<host>*:*<path-to-device>*
Removing an OSD
To remove an OSD issue these commandsceph orch osd rm <osd_id(s)> [--replace] --force --zap
Also you could manually zap device if you forgot to provide –zap flag:
ceph orch osd rm status
- determine your lvs/vgs for drive to zap
cephadm shell --fsid <fsid> -- ceph-volume ls
- zap device via orch
ceph orch device zap my_hostname /dev/sdx --force
- OR via ceph-volume
cephadm shell --fsid <fsid> -- ceph-volume lvm zap \
ceph-vgid/osd-block-lvid --destroy
It’s possible that you’ll need also manually delete osd:
- check if osd is still there
cephadm node ls
- remove osd
ceph osd rm osd.ID
- if it’s not sufficient – you can try to delete it from crushmap manually
ceph osd crusn rm osd.31
What else
Straying daemons
Sometimes ceph orch got stuck – not sure why, some stray daemons that it couldn’t find etc.. the only solution that I’ve found is:ceph orch restart mgr
If you cannot perfome it because of only 1 mgr – deploy another one (even temporaly) through ceph orch daemon add mgr HOST:IP After restarting mgrs all duplicates should gone;
Auto-memory tuning
By default, cephadm set osd_memory_target_autotune=true, which is highly unsuitable for heterogeneous or hyperconverged infrastructures. You can check current memory consumption and limit withceph orch ps
You can either place the label on node to prevent memory autotune or set config option
ceph orch host label add HOSTNAME _no_autotune_memory
OR
ceph config set osd.123 osd_memory_target_autotune false
ceph config set osd.123 osd_memory_target 16G
Getting logs
Get logs from daemonscephadm logs --name osd.34
Removing crash messages
ceph crash ls
ceph crash archive-all