As mentioned in the previous ansible post, we use ansible quite a lot for day to day operations. While we prefer Puppet for configuration management, ansible is excellent for automation of maintenance procedures.

One such procedure is gracefully applying package upgrades, including any required reboot, of application servers. In this post we’ll take a look at upgrading a cluster of web application servers defined in the ansible hostgroup “webservers”. They’re located behind a redundant pair of haproxy load balancers running on the “loadbalancers” hostgroup. The webservers in this example are running Ubuntu 16.04.

The process

In short, we want to:

  1. Verify that the cluster is working as it should (we don’t want to bring down anything for maintenance if other parts of the cluster is already broken).
  2. For each webserver, one at a time:
    • Bring it out of the loadbalanced cluster
    • Upgrade packages and reboot if needed
    • Add it back into the loadbalanced cluster

Prerequisites

This playbook needs something like the “cut” unix program when massaging a list output in a jinja2 template. To do this we create a new filter plugin, and tell ansible where to find it. Create a directory, and tell ansible about it through your ansible.cfg:

  # in ansible.cfg:
  filter_plugins = /some/path/filter_plugins/

Now put the following filter plugin into the file “splitpart.py” in the above directory:

def splitpart (value, index, char = ','):
    if isinstance(value, (list, tuple)):
        ret = []
        for v in value:
            ret.append(v.split(char)[index])
        return ret
    else:
        return value.split(char)[index]

class FilterModule(object):
    def filters(self):
        return {'splitpart': splitpart}

The playbook

Let’s do a breakdown of the full playbook

Before we do any modifications we want to verify that everything is already working as it should. This is monitored by the monitoring system as well, but sysadmins are paranoid. First the tasklist header…


- name: ensure services are up before doing anything
  hosts: webservers
  any_errors_fatal: true # stop if anything is wrong
  serial: 1              # one server at a time
  become: false          # …no need for root

  tasks:

Let’s say we have two virtualhosts that need probing (site1.example.com and site2.example.com).


    - name: verify that site1.example.com is up
      uri:
        url: http://localhost/
        status_code: 200
        follow_redirects: none
      vars:
        http_headers:
          Host: site1.example.com

    - name: verify that site2.example.com is up
      uri:
        url: http://localhost/
        status_code: 200
        follow_redirects: none
      vars:
        http_headers:
          Host: site2.example.com

This uses the ansible uri module to fetch the frontpage of the two local sites on all the webservers, and verify that they yield a 200 response code.

The default of the “status_code” attribute is 200, but I included it for easy tuning.

Next, we’ll also make sure that all the webservers are in the loadbalancing cluster. This will enable any webservers that were out of the cluster.


    - name: make sure all node is enabled and up in haproxy
      delegate_to: "{{ item }}"
      become: true # this part needs root
      haproxy:
        state: enabled
        host: "{{ inventory_hostname | splitpart(0, '.') }}"
        socket: /var/run/haproxy.sock
        wait: yes
      with_items: "{{ groups.loadbalancers }}"

Using with_items: makes this whole task a loop that is executed once for each host in the “loadbalancers” hostgroup. For each iteration, the variable “item” is set to the current loadbalancer server, and we use this variable in delegate_to to tell ansible to carry out the current task on each load balancer in order. Since the tasklist including this task is performed once for every server in the “webservers” hostgroup, this task is in effect done for every webserver on every loadbalancer.

On the loadbalancers, the ansible haproxy module enables us to ensure that each webserver is enabled and “UP”. The wait: yes ensures that the task doesn’t finish before the server is actually in an “UP” state according to the loadbalancer probes, as opposed to just enabling it if it was in maintenance state.

The host attribute takes the inventory_hostname (the FQDN of the webserver in this case) and picks out the first element (the shortname of the host), since that’s the name of the server in our haproxy definition. The {{ … }} is a jinja2 template, which opens up a lot of options when customisation is required.

The haproxy socket needs to have the “admin” flag in haproxy.cfg on the loadbalancer servers. E.g.

global
    stats socket /var/run/haproxy.sock user haproxy group haproxy mode 440 level admin
    # […etc]

In addition to checking state, this authorises disabling/enabling webservers through the socket.

At this point we have confirmed that both websites are up on all the webservers, and that all the webservers are active on all the loadbalancers. It’s time to start doing the actual work. We need to start off a new tasklist:


- name: upgrade packages and reboot (if necessary)
  hosts: webservers
  serial: 1    # one host at a time
  become: true # as root
  any_errors_fatal: true
  max_fail_percentage: 0
  vars:        # used by nagios-downtime/undowntime tasks
    icinga_server: monitoring.example.com

  tasks:

This tasklist loops through the webservers, one after the other, as root, and will abort the whole playbook run if anything goes wrong at any point in this task.

The var “icinga_server” is used for setting/removing downtime in our icinga monitoring system. If you haven’t got one, just remove that bit, along with the downtime tasks further down.

At this point we initially jumped straight to the apt-get upgrade part. But over time, the effect of “it’d be handy if the automated package update also did X and Y, wouldn’t it?” has evolved the task list to something more complex and even more useful. We see this effect on other ansible playbooks as well.

Let’s first figure out what we want to upgrade…


    # do an "apt-get update", to ensure latest package lists
    - name: apt-get update
      apt:
        update-cache: yes
      changed_when: 0

    # get a list of packages that have updates
    - name: get list of pending upgrades
      command: apt-get --simulate dist-upgrade
      args:
        warn: false # don't warn us about apt having its own plugin
      register: apt_simulate
      changed_when: 0

    # pick out list of pending updates from command output. This essentially
    # takes the above output from "apt-get --simulate dist-upgrade", and
    # pipes it through "cut -f2 -d' ' | sort"
    - name: parse apt-get output to get list of changed packages
      set_fact:
        updates: '{{ apt_simulate.stdout_lines | select("match", "^Inst ") | list | splitpart(1, " ") | list | sort }}'
      changed_when: 0

    # tell user about packages being updated
    - name: show pending updates
      debug:
        var: updates
      when: updates.0 is defined

…that was a handful. We first do an apt-get update through the ansible apt module. Even though this changes files in /var/lib/apt/ we don’t really care – we only want ansible to mark a webserver as changed if it actually upgraded any packages. We therefore force the change flag to never be set by setting the changed_when meta parameter. We do this in many tasks throughout this playbook for the same reason.

Next we run an apt-get --simulate dist-upgrade and store the command output in a variable called “apt_simulate” for use by later tasks. We do this through the ansible command module since the apt module does not have support for --simulate. The command module will notice that we’re running apt-get directly and warn us that we might want to use the apt module instead. We tell it to skip that warning through the warn option.

The next task then picks the lines of stdout that started with Inst to get a full list of all the packages that will be updated.

The list of packages is useful for the sysadmin to know, so we print it using the ansible debug module.

When starting to use ansible playbooks for routines like this, it can be quite useful to ask for sysadmin confirmation before doing any actual changes. If you want to request such a confirmation, this is a good place to do it.


    # request manual ack before proceeding with package upgrade
    - pause:
      when: updates.0 is defined

We now know what will be updated (if anything), and we’ve got sysadmin confirmation if we’re about to do any changes. Let’s get to work!


    # if a new kernel is incoming, remove old ones to avoid full /boot
    - name: apt-get autoremove
      command: apt-get -y autoremove
      args:
        warn: false
      when: '"Inst linux-image-" in apt_simulate.stdout'
      changed_when: 0

Most Debian/Ubuntu admins have at some time ended up with a full /boot when upgrading kernels because of old kernel packages staying around. While there are other ways to avoid this (especially in newer distro versions), it doesn’t hurt to make sure to get rid of any old kernel packages that are no longer needed.


    # do the actual apt-get dist-upgrade
    - name: apt-get dist-upgrade
      apt:
        upgrade: dist # upgrade all packages to latest version

Finally the actual command we set out to do! This is pretty self explanatory.

…but…what did we do? Did we upgrade libc? Systemd? The kernel? Something else that needs a reboot? Newer Debian-based systems create the file /var/run/reboot-required if a reboot is necessary after a package upgrade. Let’s look at that……


    # check if we need a reboot
    - name: check if reboot needed
      stat: path=/var/run/reboot-required
      register: file_reboot_required

Using the ansible stat module, the result of a stat of the file /var/run/reboot-required has now been stored in the variable “file_reboot_required”.

We can now check the “exists” flag for the remaining commands that are about to do the server reboot etc, but that would be quite a lot of clutter. There is a more elegant way of skipping the rest of the tasklist for the current webserver to skip straight to the next.


    # "meta: end_play" aborts the rest of the tasks in the current «tasks:»
    # section, for the current webserver
    # "when:" clause ensures that the "meta: end_play" only triggers if the
    # current webserver does _not_ need a reboot
    - meta: end_play
      when: not file_reboot_required.stat.exists

In other words, we stop the current tasklist for the current websever, unless the file /var/run/reboot-required exists. If the file exists we need a reboot, but if not we can just skip the reboot and continue on with the next webserver.

This means that the rest of the tasklist will only be executed if the current webserver needs a reboot, so let’s start prepping just that.


    # add nagios downtime for the webserver
    - name: set nagios downtime for host
      delegate_to: "{{ icinga_server }}" # do this on the monitoring server
      nagios:
        action: downtime
        comment: OS Upgrades
        service: all
        minutes: 30
        host: '{{ inventory_hostname }}'
        author: "{{ lookup('env','USER') }}"

False positives in the monitoring system is bad, so we use the ansible nagios module to ssh to the icinga server and set downtime for the webserver we’re about to reboot, as well as all services on it.

Next we take the webserver out of the loadbalancer cluster.


    - name: disable haproxy backend {{ inventory_hostname }}
      delegate_to: "{{ item }}"
      haproxy:
        state: disabled
        host: "{{ inventory_hostname | splitpart(0, '.') }}"
        socket: /var/run/haproxy.sock
        wait: yes
        #drain: yes # requires ansible 2.4
      with_items: "{{ groups.loadbalancers }}"

Using the same haproxy module that we earlier used to ensure that all haproxy backend servers were enabled, we now disable the webserver we’re about to reboot on all the loadbalancer servers. state: disabled means we want the server to end up in “MAINT” mode. Optimally we’d want the drain parameter as well, as the combination of the drain and wait flags ensure that all active connections to the webserver get to finish gracefully before proceeding to the reboot. The drain option was added in ansible 2.4, and some of our management nodes don’t have new enough ansible versions to support that parameter. Use it if you can.

Since ansible re-uses ssh connections to servers for consecutive tasks, we need to jump through a couple of hoops when rebooting.


    - name: reboot node
      shell: sleep 2 && shutdown -r now "Reboot triggered by ansible"
      async: 1
      poll: 0
      ignore_errors: true

    # poll ssh port until we get a tcp connect
    - name: wait for node to finish booting
      become: false
      local_action: wait_for host={{ ansible_ssh_host }}
          port={{ ansible_port }}
          state=started
          delay=5
          timeout=600

    # give sshd time to start fully
    - name: wait for ssh to start fully
      pause:
        seconds: 15


We first do a reboot through the ansible shell module with a sleep and some flags to avoid getting an ansible connection error.

The second block waits until the ssh port on the webserver starts accepting connections. Before Ubuntu 16.04 this was enough, but in 16.04 ssh accepts connections before it properly accepts logins during boot, so we do an extra wait to ensure that we can log into the webserver.

Ansible 2.3 has a wait_for_connection module which can probably replace the second and third block, but again some of our management nodes have older versions.

We’ve now rebooted the server. Before we re-add it to the loadbalancing cluster, we need to make sure that the applications work as they should.


    # verify that services are back up
    - name: verify that site1.example.com is up
      uri:
        url: http://localhost/
        status_code: 200
        follow_redirects: none
      retries: 60
      delay: 2
      register: probe
      vars:
        http_headers:
          Host: site1.example.com
      until: probe.status == 200

    - name: verify that site2.example.com is up
      uri:
        url: http://localhost/
        status_code: 200
        follow_redirects: none
      retries: 60
      delay: 2
      register: probe
      vars:
        http_headers:
          Host: site2.example.com
      until: probe.status == 200

This is essentially what we did to ensure functioning servers before we did the package upgrade, except that this will keep retrying for 2 minutes if it does not get an immediate 200 response. It’s not uncommon for web servers to take a while to start. The uri module will be retried up to 60 times with 2 second delay between each retry, until it returns a 200 response code.

Now we’re pretty much done. We’ve upgraded and rebooted the webserver, and have confirmed that the virtualhosts respond with 200. It’s time to clean up.


    # reenable disabled services
    - name: re-enable haproxy backend {{ inventory_hostname }}
      delegate_to: "{{ item }}"
      haproxy:
        state: enabled
        host: "{{ inventory_hostname | splitpart(0, '.') }}"
        socket: /var/run/haproxy.sock
        wait: yes
      with_items: "{{ groups.loadbalancers }}"

    # remove nagios downtime for the host
    - name: remove nagios downtime for host
      delegate_to: "{{ icinga_server }}" # do this on the monitoring server
      nagios:
        action: delete_downtime
        host: '{{ inventory_hostname }}'
        service: all

These just undo what we did before the reboot. Note that the wait flag on the haproxy module asserts that the webserver actually ends up in an “UP” state in the loadbalancers after it is brought out of maintenance mode. In other words we’ll notice (and the ansible playbook will abort) if the haproxy probe thinks the webserver is unhealthy.

Heavily loaded webservers often need a bit of time to get “warm”. To ensure stability we wait a few minutes before we proceed to the next webserver.


    # wait a few minutes between hosts, unless we're on the last
    - name: waiting between hosts
      pause:
        minutes: 10
      when: inventory_hostname != ansible_play_hosts[-1]

Result

The end result is a playbook that we can trust to do its own thing without much oversight. If anything fails it’ll stop in its tracks, meaning that at most one webserver should end up in a failed state.

References