When building a Ceph-cluster, it was important for us to plan ahead. Not only does one usually start out with a minimum of ~5 servers, but one should also expect some growth in the cluster. Running the cluster also means patching the operating system and Ceph itself, and with Ceph being a crucial infrastructure component it’s also desireable to have a proper rollback procedure.

Using CI to maintain image

We’ve grown really fond of ramdisk nodes. Using a jenkins instance, we run a build service that will boot up a VM using an updated version of CentOS 7, connect to puppet and provision in all the basic components that we want the server to have. This image is converted into a root filesystem, and uploaded to our redundant service hosts for availability.

Covering this build process could be an entire Sysadvent calendar on its own, but this is the gist of it.

Booting it up

There’s a lovely piece of software that’s incredibly powerful for booting things on the network - and it’s called iPXE. We chainload iPXE in the ipxe/uefi boot process, and it lets us run a small script before we decide what image to boot. In this script, we pass information about which puppet-environment the server belongs to, how to configure bonding, and other pieces of information that is unique to that server to the kernel command line.

We then boot up the kernel/rootfs of the image, parse /proc/cmdline, configure networking (change a dynamic lease on a single interface to a bonded interface with a static ip address), download puppet certificates, and then run puppet.

Puppet puts the last pieces in place

The image will have all the basic software in place, but it may or may not have the necessary keys in place to be allowed to talk to the cluster monitors. Puppet will ensure that ceph.conf / keys are placed where they should be in a secure manner. Puppet will also take care of day to day configuration changes - such as adding new monitoring probes and tools, as we build upon and improve them.

We continuously do changes on the servers based on our best practices through puppet, and we try to not do updates to the image unless it’s an issue of security or availability.

What about OSD state?

And this is the point I’m trying to drive home with this post: I guess you’re wondering how we know which drive is which OSD in our cluster. There’s a really simple trick that we figured out:

  • For each filesystem (we use xfs in production), query OSD ID by using the UUID of said filesystem
  • If the UUID is unknown a new OSD ID is created - but for existing OSDs, the correct ID is returned.

From that point on, we have all information we need to mount the OSDs and start the systemd-unit for it.

Something along these lines will get us up and running:

#!/bin/bash

for a in /dev/disk/by-uuid/*; do
  FS=$(blkid -o value -s TYPE $a)
  if [ "$FS" == "xfs" ]; then
    # Get OSD id from monitors, create if missing
    UUID=$(basename $a)
    OSDID=$(ceph osd create $UUID)
    if [ ! -e "/var/lib/ceph/osd/ceph-$OSDID" ]; then
      mkdir /var/lib/ceph/osd/ceph-$OSDID
    fi
    if mountpoint -q /var/lib/ceph/osd/ceph-$OSDID; then
      echo "Already mounted, skipping. (OSD: $OSDID, UUID: $UUID)"
    else
      if grep -q $UUID /etc/fstab; then
        echo "Mountpoint already in fstab, not adding"
      else
        echo "UUID=$UUID /var/lib/ceph/osd/ceph-$OSDID xfs defaults 0 0" >> /etc/fstab
      fi
      mount $a /var/lib/ceph/osd/ceph-$OSDID
    fi

    if [ ! -e "/var/lib/ceph/osd/ceph-$OSDID/keyring" ]; then
      ceph-osd -i $OSDID --mkfs --mkkey --osd-uuid $UUID
      ceph auth add osd.$OSDID osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-$OSDID/keyring
      ceph osd crush add osd.$OSDID 1.0 host=$(facter hostname)
    else
      echo "Already initialized, skipping. (OSD: $OSDID, UUID: $UUID)"
    fi

    RUNNING=$(pgrep -f "^ceph-osd -i $OSDID\$" -c)
    if [ "$RUNNING" -eq 1 ]; then
      echo "Already running, not starting."
    else
      systemctl enable ceph-osd@${OSDID}.service
      systemctl start ceph-osd@${OSDID}.service
    fi
  fi
done

This code has the unintended sideeffect that it can also be used to easily add new OSDs to the cluster; Just insert a new disk, mkfs.xfs, start_osds, and it will be added to the cluster. As one would usually do, just be careful about the performance impact of the backfill-operations that adding new OSDs may have.

For journals, we use raw partitions. To ensure that journals will work across boots - even if hardware would come up in different order than before (making sda to sdb, or the other way around) - we always use /dev/disk/by-wwn/ to look up the correct partition.

What did we really achieve?

This simplifies scaling up a lot.

  • When hardware has arrived, and the physical part of the job is taken care of - initializing a new node is done by booting it, formatting the drives, and starting the OSDs.
  • Adding new drives is solved by formatting the drives and starting the OSDs.

It also ensures that as long as all nodes have booted the same version of the image - we can also expect them to contain the same software versions and identical configuration.

Upgrading the image is just about doing ceph set osd noout, and perform a rolling reboot of the cluster. We don’t like what we just patched into? Perform a rollback! We keep the old image, and we know exactly what to expect from it!