Using Ansible for system updates

This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

As mentioned in the previous Ansible post, we use Ansible quite a lot for day to day operations. While we prefer Puppet for configuration management, Ansible is excellent for automation of maintenance procedures.

One such procedure is gracefully applying package upgrades, including any required reboot, of application servers. In this post we’ll take a look at upgrading a cluster of web application servers defined in the Ansible hostgroup “webservers”. They’re located behind a redundant pair of HAProxy load balancers running on the “loadbalancers” hostgroup. The web servers in this example are running Ubuntu 16.04.

The process

In short, we want to:

Verify that the cluster is working as it should (we don’t want to bring down anything for maintenance if other parts of the cluster is already broken).
For each web server, one at a time:
- Bring it out of the load-balanced cluster
- Upgrade packages and reboot if needed
- Add it back into the load-balanced cluster

Prerequisites

This playbook needs something like the “cut” UNIX program when massaging a list output in a Jinja2 template. To do this we create a new filter plugin, and tell Ansible where to find it. Create a directory, and tell Ansible about it through your ansible.cfg:

# in ansible.cfg:
filter_plugins = /some/path/filter_plugins/

Now put the following filter plugin into the file “splitpart.py” in the above directory:

def splitpart (value, index, char = ','):
    if isinstance(value, (list, tuple)):
        ret = []
        for v in value:
            ret.append(v.split(char)[index])
        return ret
    else:
        return value.split(char)[index]

class FilterModule(object):
    def filters(self):
        return {'splitpart': splitpart}

The playbook

Let’s do a breakdown of the full playbook…

Before we do any modifications we want to verify that everything is already working as it should. This is monitored by the monitoring system as well, but sysadmins are paranoid. First the task-list header…

}
- name: Ensure services are up before doing anything
  hosts: webservers
  any_errors_fatal: true # stop if anything is wrong
  serial: 1              # one server at a time
  become: false          # …no need for root
  tasks:

Let’s say we have two virtualhosts that need probing (site1.example.com and site2.example.com).

- name: Verify that site1.example.com is up
  ansible.builtin.uri:
    url: "http://localhost/"
    status_code: 200
    follow_redirects: none
  vars:
    http_headers:
      Host: site1.example.com

- name: Verify that site2.example.com is up
  ansible.builtin.uri:
    url: http://localhost/
    status_code: 200
    follow_redirects: none
  vars:
    http_headers:
      Host: site2.example.com

This uses the Ansible uri module to fetch the frontpage of the two local sites on all the web servers, and verify that they yield a 200 response code.

The default of the “status_code” attribute is 200, but I included it for easy tuning.

Next, we’ll also make sure that all the web servers are in the load-balancing cluster. This will enable any web servers that were out of the cluster.

- name: Make sure all node is enabled and up in haproxy
  delegate_to: "{{ item }}"
  become: true # this part needs root
  community.general.haproxy:
    state: enabled
    host: "{{ inventory_hostname | splitpart(0, '.') }}"
    socket: /var/run/haproxy.sock
    wait: yes
  with_items: "{{ groups.loadbalancers }}"

Using with_items: makes this whole task a loop that is executed once for each host in the “loadbalancers” hostgroup. For each iteration, the variable “item” is set to the current loadbalancer server, and we use this variable in delegate_to to tell Ansible to carry out the current task on each load balancer in order. Since the task-list including this task is performed once for every server in the “webservers” hostgroup, this task is in effect done for every web server on every loadbalancer.

On the loadbalancers, the Ansible HAProxy module enables us to ensure that each web server is enabled and “UP”. The wait: yes ensures that the task doesn’t finish before the server is actually in an “UP” state according to the loadbalancer probes, as opposed to just enabling it if it was in maintenance state.

The host attribute takes the inventory_hostname (the FQDN of the web server in this case) and picks out the first element (the shortname of the host), since that’s the name of the server in our HAProxy definition. The {{ … }} is a Jinja2 template, which opens up a lot of options when customisation is required.

The HAProxy socket needs to have the “admin” flag in haproxy.cfg on the loadbalancer servers. E.g.

global
    stats socket /var/run/haproxy.sock user haproxy group haproxy mode 440 level admin
    # […etc]

In addition to checking state, this authorises disabling/enabling web servers through the socket.

At this point we have confirmed that both websites are up on all the web servers, and that all the web servers are active on all the loadbalancers. It’s time to start doing the actual work. We need to start off a new task-list:

---
- name: Upgrade packages and reboot (if necessary)
  hosts: webservers
  serial: 1    # one host at a time
  become: true # as root
  any_errors_fatal: true
  max_fail_percentage: 0
  vars:        # used by nagios-downtime/undowntime tasks
    icinga_server: monitoring.example.com
  tasks:

This task-list loops through the web servers, one after the other, as root, and will abort the whole playbook run if anything goes wrong at any point in this task.

The var “icinga_server” is used for setting/removing downtime in our Icinga monitoring system. If you haven’t got one, just remove that bit, along with the downtime tasks further down.

At this point we initially jumped straight to the apt-get upgrade part. But over time, the effect of “it’d be handy if the automated package update also did X and Y, wouldn’t it?” has evolved the task list to something more complex and even more useful. We see this effect on other Ansible playbooks as well.

Let’s first figure out what we want to upgrade…

---
# do an "apt-get update", to ensure latest package lists
- name: Apt-get update
  ansible.builtin.apt:
    update-cache: true
  changed_when: 0

# get a list of packages that have updates
- name: Get list of pending upgrades
  ansible.builtin.command: apt-get --simulate dist-upgrade
  args:
    warn: false # don't warn us about apt having its own plugin
  register: apt_simulate
  changed_when: 0

# pick out list of pending updates from command output. This essentially
# takes the above output from "apt-get --simulate dist-upgrade", and
# pipes it through "cut -f2 -d' ' | sort"
- name: Parse apt-get output to get list of changed packages
  ansible.builtin.set_fact:
    updates: '{{ apt_simulate.stdout_lines | select("match", "^Inst ") | list | splitpart(1, " ") | list | sort }}'
  changed_when: 0

# tell user about packages being updated
- name: Show pending updates
  ansible.builtin.debug:
    var: updates
  when: updates.0 is defined

…that was a handful. We first do an apt-get update through the Ansible apt module. Even though this changes files in /var/lib/apt/ we don’t really care – we only want Ansible to mark a web server as changed if it actually upgraded any packages. We therefore force the change flag to never be set by setting the changed_when meta parameter. We do this in many tasks throughout this playbook for the same reason.

Next we run an apt-get --simulate dist-upgrade and store the command output in a variable called “apt_simulate” for use by later tasks. We do this through the Ansible command module since the apt module does not have support for --simulate. The command module will notice that we’re running apt-get directly and warn us that we might want to use the apt module instead. We tell it to skip that warning through the warn option.

The next task then picks the lines of stdout that started with Inst to get a full list of all the packages that will be updated.

The list of packages is useful for the sysadmin to know, so we print it using the Ansible debug module.

When starting to use Ansible playbooks for routines like this, it can be quite useful to ask for sysadmin confirmation before doing any actual changes. If you want to request such a confirmation, this is a good place to do it.

# request manual ack before proceeding with package upgrade
- name: Pause
  ansible.builtin.pause:
  when: updates.0 is defined

We now know what will be updated (if anything), and we’ve got sysadmin confirmation if we’re about to do any changes. Let’s get to work!

# if a new kernel is incoming, remove old ones to avoid full /boot
- name: Apt-get autoremove
  ansible.builtin.command: apt-get -y autoremove
  args:
    warn: false
  when: '"Inst linux-image-" in apt_simulate.stdout'
  changed_when: 0

Most Debian/Ubuntu admins have at some time ended up with a full /boot when upgrading kernels because of old kernel packages staying around. While there are other ways to avoid this (especially in newer distro versions), it doesn’t hurt to make sure to get rid of any old kernel packages that are no longer needed.

# do the actual apt-get dist-upgrade
- name: Apt-get dist-upgrade
  ansible.builtin.apt:
    upgrade: dist # upgrade all packages to latest version

Finally the actual command we set out to do! This is pretty self explanatory.

…but…what did we do? Did we upgrade libc? Systemd? The kernel? Something else that needs a reboot? Newer Debian-based systems create the file /var/run/reboot-required if a reboot is necessary after a package upgrade. Let’s look at that……

# check if we need a reboot
- name: Check if reboot needed
  ansible.builtin.stat: path=/var/run/reboot-required
  register: file_reboot_required

Using the Ansible stat module, the result of a stat of the file /var/run/reboot-required has now been stored in the variable “file_reboot_required”.

We can now check the “exists” flag for the remaining commands that are about to do the server reboot etc, but that would be quite a lot of clutter. There is a more elegant way of skipping the rest of the task-list for the current web server to skip straight to the next.

# "meta: end_play" aborts the rest of the tasks in the current «tasks:»
# section, for the current webserver
# "when:" clause ensures that the "meta: end_play" only triggers if the
# current webserver does _not_ need a reboot
- name: Stop the play
  ansible.builtin.meta: end_play
  when: not file_reboot_required.stat.exists

In other words, we stop the current task-list for the current web server, unless the file /var/run/reboot-required exists. If the file exists we need a reboot, but if not we can just skip the reboot and continue on with the next web server.

This means that the rest of the task-list will only be executed if the current web server needs a reboot, so let’s start prepping just that.

# add nagios downtime for the webserver
- name: Set nagios downtime for host
  delegate_to: "{{ icinga_server }}" # do this on the monitoring server
  community.general.nagios:
    action: downtime
    comment: OS Upgrades
    service: all
    minutes: 30
    host: '{{ inventory_hostname }}'
    author: "{{ lookup('ansible.builtin.env','USER') }}"

{% raw %}

False positives in the monitoring system is bad, so we use the Ansible Nagios module to SSH to the Icinga server and set downtime for the web server we’re about to reboot, as well as all services on it.

Next we take the web server out of the loadbalancer cluster.

{% raw %}

- name: Disable haproxy backend {{ inventory_hostname }}
  delegate_to: "{{ item }}"
  community.general.haproxy:
    state: disabled
    host: "{{ inventory_hostname | splitpart(0, '.') }}"
    socket: /var/run/haproxy.sock
    wait: true
    #drain: yes # requires ansible 2.4
  loop: "{{ groups.loadbalancers }}"

Using the same HAProxy module that we earlier used to ensure that all HAProxy backend servers were enabled, we now disable the web server we’re about to reboot on all the loadbalancer servers. state: disabled means we want the server to end up in “MAINT” mode. Optimally we’d want the drain parameter as well, as the combination of the drain and wait flags ensure that all active connections to the web server get to finish gracefully before proceeding to the reboot. The drain option was added in Ansible 2.4, and some of our management nodes don’t have new enough Ansible versions to support that parameter. Use it if you can.

Since Ansible re-uses SSH connections to servers for consecutive tasks, we need to jump through a couple of hoops when rebooting.

- name: Reboot node
  ansible.builtin.shell: sleep 2 && shutdown -r now "Reboot triggered by ansible"
  async: 1
  poll: 0
  ignore_errors: true

# poll SSH port until we get a tcp connect
- name: Wait for node to finish booting
  become: false
  local_action: wait_for host=
    port=
    state=started
    delay=5
    timeout=600

# give SSHD time to start fully
- name: Wait for SSH to start fully
  ansible.builtin.pause:
    seconds: 15

We first do a reboot through the Ansible shell module with a sleep and some flags to avoid getting an Ansible connection error.

The second block waits until the SSH port on the web server starts accepting connections. Before Ubuntu 16.04 this was enough, but in 16.04 SSH accepts connections before it properly accepts logins during boot, so we do an extra wait to ensure that we can log into the web server.

Ansible 2.3 has a wait_for_connection module which can probably replace the second and third block, but again some of our management nodes have older versions.

We’ve now rebooted the server. Before we re-add it to the loadbalancing cluster, we need to make sure that the applications work as they should.

# verify that services are back up
- name: Verify that site1.example.com is up
  ansible.builtin.uri:
    url: http://localhost/
    status_code: 200
    follow_redirects: none
  retries: 60
  delay: 2
  register: probe
  vars:
    http_headers:
      Host: site1.example.com
  until: probe.status == 200

- name: Verify that site2.example.com is up
  ansible.builtin.uri:
    url: http://localhost/
    status_code: 200
    follow_redirects: none
  retries: 60
  delay: 2
  register: probe
  vars:
    http_headers:
      Host: site2.example.com
  until: probe.status == 200

This is essentially what we did to ensure functioning servers before we did the package upgrade, except that this will keep retrying for 2 minutes if it does not get an immediate 200 response. It’s not uncommon for web servers to take a while to start. The uri module will be retried up to 60 times with 2 second delay between each retry, until it returns a 200 response code.

Now we’re pretty much done. We’ve upgraded and rebooted the web server, and have confirmed that the virtualhosts respond with 200. It’s time to clean up.

# reenable disabled services
- name: Re-enable haproxy backend {{ inventory_hostname }}
  delegate_to: "{{ item }}"
  community.general.haproxy:
    state: enabled
    host: "{{ inventory_hostname | splitpart(0, '.') }}"
    socket: /var/run/haproxy.sock
    wait: true
  with_items: "{{ groups.loadbalancers }}"

# remove nagios downtime for the host
- name: Remove nagios downtime for host
  delegate_to: "{{ icinga_server }}" # do this on the monitoring server
  community.general.nagios:
    action: delete_downtime
    host: '{{ inventory_hostname }}'
    service: all

These just undo what we did before the reboot. Note that the wait flag on the HAProxy module asserts that the web server actually ends up in an “UP” state in the loadbalancers after it is brought out of maintenance mode. In other words we’ll notice (and the Ansible playbook will abort) if the HAProxy probe thinks the web server is unhealthy.

Heavily loaded web servers often need a bit of time to get “warm”. To ensure stability we wait a few minutes before we proceed to the next web server.

# wait a few minutes between hosts, unless we're on the last
- name: Waiting between hosts
  ansible.builtin.pause:
    minutes: 10
  when: inventory_hostname != ansible_play_hosts[-1]

Result

The end result is a playbook that we can trust to do its own thing without much oversight. If anything fails it’ll stop in its tracks, meaning that at most one web server should end up in a failed state.

Using Ansible for system updates

December 24, 2017

The process

Prerequisites

The playbook

Result

References

Jimmy Olsen

Time tracking systems - software

Time tracking systems - general thoughts

Creating and using a script to Install Arch linux through wifi