Backing up the Rados Object Gateway

This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

Amazon S3 has been around for a while, and it has become increasingly popular to use S3 or S3-like solutions as an object store. In many cases S3 replaces NFS as the chosen type of file system.

And with good reason. Separating application instance from application state is almost always a good decision. And by changing from an architecture that requires low-level access to the host running the application to using a REST interface, we can now deploy the application almost anywhere. A bunch of guests might be started in OpenStack, in an external cloud provider, or maybe in docker containers within a CoreOS cluster. It doesn’t really matter, you just need access to the HTTPS endpoint.

So at some point someone figured out that Amazon was on to something, and several Open Source implementations have started to appear: Ceph/RadosGW and Riak/Riak-CS offer S3 implementations, and OpenStack Swift are some alternatives. Other cloud providers have jumped on the bandwagon as well, with both Google and Microsoft offering their own take on this.

We saw demand for this from our own customers, and launched a service we call ‘Situla’. This is based on Ceph/Rados Gateway, but all in all it’s an S3 endpoint in our data-centers, in close proximity to our customers. To put some sugar on top, we build it to be site-redundant across three availability zones - ensuring that if one of our data-centers burns to the ground, if you get 200 OK on your PUT - you should be home free.

However, we’ve had a hard time finding backup solutions tailored for this - so we had to roll our own solution.

Failed attempts

We initially tried our luck with FUSE. However, the S3 API isn’t made for being mounted in a POSIX-like manner. This already slows down to something that looks like a halt pretty fast.

We tried to set up the amount of disk needed for backing up on a separate server, and use s3cmd to periodically synchronize content from the buckets that needed to be backed up and over to the backup server and then using LVM snapshots to implement something similar to incremental backup. From a functionality standpoint, s3cmd has us covered - but alas, it uses a lot of resources. Being sure that the backup is intact is also a challenge - reading the s3cmd logs and looking for error messages is one way to do this - but they don’t really provide you with enough information to conclude about the end result.

All good things are three; Bareos plugin development

We use Bareos, a fork of Bacula, for doing most of our backups, and already have a functioning installation moving files to tape drives. All hosts that are being backed up are running either bareos-fd or bacula-fd, and full/incremental backups are handled by Bareos. A pretty conventional method of backing up data.

Bacula has had support for plugins since forever. This support comes as a C API, and it makes it possible for a backup agent to provide virtual files to Bacula. This is fine and all, but requires knowledge of Bacula development that takes some time to get into. However, the Bareos project has fairly recently added a wrapper plugin for the same API, exposing a python interface - making it possible for a sysadmin with some coding experience to do something with this.

The dump server

I set up a VM for backup purposes. This server has the task of running an instance of the Bareos agent, with my plugin - and a local installation of Rados Gateway, to avoid clobbering the same instances of the services as our customers use. It’s also nice to be able to configure the server as I wish for debugging purposes - watching the access logs as I see Bareos work. I also found that Bareos can pause a currently running job, while moving files from a spool area to tape drives - this pause can happen in the middle of backing up a file, which in turn means setting up the dump server with vastly different timeout values than we use for our production servers.

The plugin API

Writing a plugin for Bareos basically means writing a script that mimics file operations. This may sound like implementing a FUSE file system, but you only need to implement a small subset.

Bareos will ask the plugin for a file to backup, its name, ctime, atime and mtime - and depending on whether or not this is a full backup, or just an incremental backup, ask for the file to be opened, or just ask for a new file. This approach works pretty well together with the object list API in S3, which will return 1000 objects in each call.

When Bareos has asked for a file to be opened, one makes the request to the S3 endpoint - and if the request status is 200 OK, we know we can proceed. If it’s 404 File Not Found, we return an ordinary “file not found” error code. If it’s 503 Gateway Timeout for one or the other reason, we sleep for 10 seconds and try again a few times before we give up.

When a file has been successfully opened, Bareos will ask to get chunks of the file. We do this by calling .read(chunkSize) on a request object, and copy the buffer to Bareos. This in turn means that we don’t buffer up the entire file in memory - but actually leave the IO to be done by Bareos/RadosGW. This allows us to specify this VM with fairly low specs.

This is an excerpt of the function that performs the core IO. Note that we’ve gutted the s3cmd-source code to achieve this:

def try_open(self, s3bucket, s3object):
    request = self.s3.create_request('OBJECT_GET', bucket = s3bucket, object = s3object)
    self.req = self.s3.recv_file_streamed(request)

def plugin_io(self, context, IOP):
    if IOP.func == bIOPS['IO_OPEN']:
        self.FNAME = IOP.fname
        self.retry = 0
        sp = str(self.FNAME).split('/',3)
        s3bucket = sp[2]
        s3object = sp[3]
        try:
            self.try_open(s3bucket = s3bucket, s3object = s3object)

        except S3Error, e:
            if e.status == 404:
                IOP.io_errno = 2
                bareosfd.JobMessage(context, bJobMessageType['M_INFO'], "Attempt to open %s failed (404)\n" % (s3object))
            else:
                if self.retry == 0:
                    retry = 1
                    bareosfd.JobMessage(context, bJobMessageType['M_INFO'], "Attempt to open %s failed (503), retrying once in 10 seconds\n" % (s3object))
                    time.sleep(10)
                    self.try_open(s3bucket = s3bucket, s3object = s3object)
                    return bRCs['bRC_OK']
            IOP.status = -1
            return bRCs['bRC_Error']

        except S3DownloadError, e:
            if self.retry == 0:
                retry = 1
                bareosfd.JobMessage(context, bJobMessageType['M_INFO'], "Attempt to open %s failed (HTTP Conn. err.), retrying once in 10 seconds\n" % (s3object))
                time.sleep(10)
                self.try_open(s3bucket = s3bucket, s3object = s3object)
            else:
                return bRCs['bRC_Error']
        return bRCs['bRC_OK']

    elif IOP.func == bIOPS['IO_CLOSE']:
        ConnMan.put(self.req['conn'])
        return bRCs['bRC_OK']

    elif IOP.func == bIOPS['IO_READ']:
        IOP.buf = bytearray(IOP.count)
        IOP.io_errno = 0
        try:
            buf = self.req['resp'].read(IOP.count)
            IOP.buf[:] = buf
            IOP.status = len(buf)
        except:
            IOP.status = -1
            return bRCs['bRC_Error']

        return bRCs['bRC_OK']

The end result

We can backup files stored in an S3 endpoint mostly as if they were files in a ordinary file system. If the buckets contain a lot of objects, we shard the buckets into multiple jobs - aiming for no job bigger than 250GB. This makes it easy for us to scale the solution as we have data growth.

This have proven to be quite reliable, it performs pretty well - and it scales, which is very important.

We intend to release a complete version of this Bareos-plugin to the public, eventually. Feel free to contact me if you want to know more!

Backing up the Rados Object Gateway

December 11, 2015

Failed attempts

All good things are three; Bareos plugin development

The dump server

The plugin API

The end result

Trygve Vea

Just-Make-toolbox

Containerized Development Environment

Ansible-runner