Blue Box

USER RESOURCES // Upgrading OpenStack: A Best Practices Guide

Upgrading OpenStack: A Best Practices Guide.

Jesse Keating, Dustin Lundquist, Leslie Lundquist

In our experience at IBM Blue Box, the team has spent thousands of hours running Private Cloud based on OpenStack. This experience has given our engineers a few ideas about how to do things better! In this ongoing world of OpenStack major upgrades every six months, cloud operators such as us must continue to upgrade to each new release of OpenStack, and we must strive to do that with as little interruption of service to our customer base.

This document summarizes the best practices we have learned from our experience. After outlining the important distinction in the OpenStack Architecture between the Control Plane and the Data Plane, the document goes on to describe a preferred order of operations for stopping and restarting OpenStack modules for an upgrade while live customers go on operating without service interruption.

After reading this document, you will gain from our hard-earned experience. You are likely to understand why this recommended order of operations is so effective in maintaining cloud stability throughout the upgrade process, and you will know how to continue to work with OpenStack more effectively.

If you have not performed an OpenStack upgrade before, we hope that the sample Ansible code included in this document will give you a running start at a smooth upgrade process, using the best practices available.

Concepts: Control Plane vs. Data Plane

One of the most important distinctions to keep in mind when working with OpenStack is the distinction between the Control Plane and the Data Plane. A good way to imagine the difference would be by analogy to a banking ATM:

If an ATM fails and you can’t see your account balance, it is a lot less upsetting than if an ATM fails, and you can’t withdraw any cash at that time. But the most upsetting scenario would arise if an ATM fails and the bank loses your money as a result!

Another way to consider the distinction relies upon the knowledge that the Control Plane/Data Plane architecture began with the original Arpanet networking architecture. On the Internet, the Data Plane is responsible for delivering data packets, based on routing, while the Control Plane is responsible for computing the packet routing, and especially for managing changes in network topology that otherwise might result in significant latency or packet loss. This networking metaphor also is helpful for understanding the complexities of OpenStack.

Now let’s consider how this concept applies to the OpenStack cloud:

Essentially, one could think of OpenStack as a control plane that was developed for the many existing Data Plane products and architectures already on the market, such as KVM, Xen, and hypervisors (now Docker, too). These hypervisors are in some sense “actually” providing the “service” to the customers, meaning that the hypervisors run the virtual machines that modify and store the data; they operate on the data itself. OpenStack works in conjunction with (available hardware and) these virtualization products to create a fully-functional Cloud.

More specifically, for cloud operations, the Control Plane handles requests for user account logins, as well as CRUD (Create, Read, Update, Delete) requests for various resources such as instances, ssh keys, security groups, block volumes, Glance images, Swift objects, and so forth. If the Control Plane should crash in a IBM Blue Box, for example, a user might not be able to log in or see certain VM instance information in Horizon.

Just to help clarify all the portions of the upgrade you are about to undertake, the next figure is a general illustration of parts of the Control Plane and Data Plane and how they communicate using the RabbitMQ utility and the various agents, as listed.

Control Plane and Data Plane figure

In contrast to the Control Plane, the Data Plane handles tasks such as updating the underlying database, file access, and other tasks that used to be referred to as I/O or File System tasks. If the Data Plane should crash, valuable data may be lost. The Data Plane informs the Control Plane when situations occur such as running out of file space, so that action can be taken by control software, by uses, or by administrators to remedy the situation. It is also important to know that, if the Data Plane crashes, a customer loses access to their running compute instances. This lossage creates an outage for whatever services the customer is running in their instances. In contrast, a Control Plane outage does not disrupt network access to the customer’s instances.

The Relevant History of OpenStack

To help clarify the relationships among the modules of OpenStack, it is important to recall that, historically, OpenStack began with Nova. (There were two projects in OpenStack: Nova and Swift.) Nova still could be considered to be the “kitchen sink” of OpenStack, since it still includes many functions that overlap with other OpenStack modules. Glance and Cinder were cleanly extracted from Nova in the Bexar and Folsom releases. However, for example, Nova still contains Nova-network functionality, which has been deprecated and is being phased out in favor of Neutron.

With this historical knowledge in mind, it makes sense that when performing an OpenStack cloud upgrade, the mutual dependencies among, for example, Glance, Cinder, and Nova would dictate the order of their upgrading procedure. Many software engineers are new to the world of OpenStack, and they may not possess this historical knowledge to help them track the dependencies. This document aims to give new engineers enough information to make good decisions when working with OpenStack in the context of high-availability Cloud services.

Minimizing Cloud Disruption When Updating OpenStack

When updating OpenStack, stability for users is key. You can minimize disruption to your OpenStack cloud by restarting a few services at a time during the update process. If only a few services at a time are restarting, the cloud as a whole will remain stable for users, because each individual service is down for as short a time as possible. A specific order of operations (orchestration), given in the next section, makes it possible to upgrade smoothly.

Another idea that we consider a best practice is not to try to perform every necessary upgrade at the same time. We separated our total cloud upgrade process into a few different areas, which we thought of as a shortcut to make our specific OpenStack upgrade process easier.

Best Practices Tip #1: Take a few shortcuts. Life will be much easier for your customers if you do these types of upgrades as separate upgrades from your OpenStack upgrade:

  • Neutron OVS Plugin to ML2 Plugin
  • Nova-network to Neutron (part of the deprecation mentioned previously)
  • ML2 OVS Mechanism Driver to Linux Bridge Mechanism Driver
  • Newer Kernel (we did it for newer customers as they came onto Linux Bridge and ML2)
  • Qemu

General Strategies for Smooth Cloud Operation and Updates

Generally speaking, one of the best techniques for creating a smooth upgrade process is to avoid inter-project version dependencies when upgrading OpenStack. How to accomplish it? Break down the problem! Load all your new code, be sure everything is working correctly with the new code, then start turning on the new features.

For example, if you introduce a new functionality in Nova that depends on a new functionality in Neutron, don’t turn on the new feature in Nova until all the new code has been installed for Neutron as well, because the new features will not work until their related functionalities are available.

We consider this particular approach—avoiding dependencies—to be one of the all-time “best practices” for minimizing cloud disruption at any time. It is always embodied in our regular deployment playbooks. You will see it illustrated in the code examples throughout the following sections.

Best Practice #2: Practice extreme re-use of everyday code. We insert the update code in our regular deployment playbooks, but conditionalized, so that the updates only are executed in the update scenario. Using the same code helps avoid having to maintain separate code bases for deployment and updates.

Orchestrating Your Updates

Four popular open source configuration management tools—Chef, Ansible, Puppet, Salt—can all make it easier to configure, and maintain thousands of OpenStack server and software deployments. In general, all four are great tools and suitable for the task, but each has special strengths and distinctive assets to consider. The choice of deployment methodology should align with your functional criteria and, to a certain extent, your role in the organization.

When upgrading OpenStack, the process requires that steps must be executed and completed in a specific order. This means that tools like Puppet, which rely on eventual consistency, won’t work without explicitly tracking state variables on Puppet Master, which is a lot more work.

At IBM Blue Box, we use a tool called Ursula to manage our playbooks. It is a set of input files to Ansible. Samples from our playbooks are included in the next several sections of this document for illustrative purposes. All of these Ansible playbooks are available on github:

Ursula https://github.com/blueboxgroup/ursula

The entry called upgrade.yml in the root level is our upgrade playbook. All of the code is Open Source, and you are free to use it.

Because Ansible allows for speed and accuracy of deployment across our entire data center, we are able to push out small, frequent updates to each of our customers’ private clouds as well as major upgrades (such as the upgrade from Havana to Juno) that are the primary focus of this document. An update could be made in response to a security patch, a small bug that we found, or just a tweak to Horizon. Our ability to perform continuous deployment to production OpenStack clusters has been critical to our ability to architect our product for our customers’ advantage. We have preferred the “push” methodology of Ansible, which sends out updates from a centralized node that pushes out to all the servers simultaneously, to any other method. For us, the scale of a cloud is closer to hundreds of nodes than thousands of nodes, so your preferences may vary, depending on the size of your cloud. We do recommend using Ansible, however, with the few additions that we have made in Ursula.

Best Practices Tip #3: Use a tool that allows for precise orchestration of actions across hosts. We use our own tool, called Ursula, which is backed by Ansible, as an alternative to Puppet.

Larger Orchestration: The Order of OpenStack Upgrades

The order in which you update the OpenStack services definitely matters if you want to avoid service outages for your customers, keeping each service up and interoperating as long as possible, thereby minimizing disruption through the cloud as a whole. Here is the order we found most successful:

  • Glance
  • Cinder
  • Nova
  • Neutron
  • Swift
  • Keystone
  • Horizon

Although OpenStack primarily handles the control plane, Cinder and Neutron also interact with the data plane. For this reason, the configuration for these services has been split into control and data parts. To avoid downtime or potential data loss, we'll take care to upgrade both parts of these services, as shown later in this document.

More About Orchestration: Patterning Your Upgrades

Here are more general practices we have found most helpful, which will be covered in the following sections:

  • Notice the repeating patterns in OpenStack updates. Our Ansible code in the following sections will show you what we mean.
  • During update, delay the restarts of each service. You will see this principle illustrated in the sample code that follows.
  • Fail immediately if anything goes wrong (zero tolerance for failure). Again, you will see this principle illustrated in the sample code.
  • Non-destructive (idempotent) re-runs for each and every task.

Utilize the Repeating Pattern in OpenStack Upgrades

Every OpenStack upgrade requires a similar sequence of actions, as follows:

  • Put new code and config in place.

We modeled this pattern into our orchestration for the update; the update pattern repeats for service after service within OpenStack.

Specifics: Upgrading from Havana to Juno

The material in this section was excellently presented by Jesse Keating at OpenStack Summit in Vancouver, B.C. (2015). It can be generalized to create an approach for OpenStack upgrades to later versions of OpenStack, as discussed in the later sections of this document.

Note: Before we could upgrade from Havana to Juno, we had to upgrade our database. This is because at BlueBox, we use PerconaXtraDB, based on MySQL 5.5, in a two-node cluster with one arbiter. MySQL 5.5 can’t handle Neutron migration; therefore, to get to a new version of Neutron, one must first get to a new MySQL.

First, Upgrade the Database

Make sure you have a good backup of your database before upgrading, of course, just in case the unexpected occurs. To accomplish the database upgrade, we used Ansible across the two DB hosts and the Arbiter. On each DB host, we performed the following actions:

  • Stop the DB.
  • Remove the packages.
  • Put in the updated configuration for the package.
  • Modify the compatibility settings. (The package will start as soon as you install it, and if it starts without this, it will corrupt your database.)
  • Turn off replications.
  • Install new packages.
  • Run upgrade migration.
  • Restart DB (again).
  • Repeat on other host.
  • Remove compatibility settings.
  • Restart DB (again again).

Best Practice Tip #4: Use an Ansible playbook. These DB upgrade actions were combined into an Ansible playbook. On our database servers, we are not tolerant of failure at all; therefore, having a playbook helps eliminate any type of error induced when code must be re-entered by humans for a special case such as an upgrade scenario.

Notice in the list of steps given for upgrading the database, several restarts are required. It’s possible that re-executing certain code sequences could overwrite information previously loaded. It’s important to know that everything that needs to be preserved through these restarts will also be preserved as the process moves forward.

Best Practice Tip #5: Design your playbook to be non-destructive if it could possibly be executed more than once in an upgrade scenario. Ideally, your code will detect whether a specific portion has been executed and skip a part that needs not be done again. Because our code operated in this way (it is idempotent), we could run this playbook over and over, as needed, without fear of missing a step.

Next, Upgrade the Arbiter

When we are running our cloud, we use two database hosts and an arbiter host, running on another compute node, to help maintain the necessary quorum. That way, we avoid the “split-brain” problem that can occur when you only have the two database hosts, and each one wants to believe that it is the Master.

When making the database host upgrades, we also used an Ansible playbook on the arbiter to accomplish the upgrade. On the arbiter, we performed these actions:

  • Purge all previous contents of the old package/configuration file.
  • Fix the filesystem permissions.
  • Run the role as if new.

Don’t Touch the Rabbit!

For our messaging between services in our clouds, we use RabbitMQ, an open source message broker software (sometimes called message-oriented middleware) that implements Advanced Message Queuing Protocol (AMQP). We do not use the clustered RabbitMQ implementation. Instead, we rely on an active/passive setup managed by a floating IP address. While it may be tempting to use your OpenStack upgrade maintenance window to perform a RabbitMQ upgrade, don't do it. Any change to the RabbitMQ service could leave OpenStack services unable to send or receive messages until they are restarted, thus adding unwanted control-plane downtime.

Best Practice Tip #6: Learn the limitations of your particular OpenStack release. For example, don’t update RabbitMQ, especially under Havana code base. The services that use Rabbit will stop talking until you restart all the services, which will create a negative customer service impact. It is not necessary.

NOTE: Under the Kilo release, the behavior of OpenStack services with respect to RabbitMQ is greatly improved.

Re-use Deployment Code Wherever Possible

As mentioned previously, our strategy is to re-use deployment code. In fact, we enter the code to perform the upgrades into our regular deployment code, as noted below. You will see how we conditionalized the code, so that the database migrations only ever run in an upgrade scenario. We can look at this process for updating OpenStack modules, one by one.

Glance

Glance has no real surprises. In our upgrade playbook, all we really do is run the Glance role, passing in a few extra things. Here is a section of our playbook that shows how we do a portion of this upgrade:


- name: upgrade glance
	hosts: controller
	max_fail_percentage: 1
	tags: glance
	
  roles:
  - role: glance
    force_sync: true
    restart: False
    database_create:
      changed: false

Notice that we do not restart the Glance control services right away, until all the new source code has been updated, and that we do force a database sync. The service is restarted at the end: We let the database synchronization code take care of the restart after the configuration changes are made.


name: stop glance services before db sync
    service: name={{ item }} state-stopped
    with_items:
	-glance-api
	-glance-registry
   when: database_create.changed or force_sync|default(‘false’)|bool

name: sync glance database
    command: glance-manage db_sync
    when: database_create.changed or force_sync|default(‘false’)|bool
    run_once: true
    changed_when: true
    notify:
	- restart glance services
   # we want this to always be changed so it can notify the service

meta: flush_handlers

name: start glance services
    service: name={{ item }} state=started
    with_items:
	-glance-api
	-glance-registry

Notice that we stop the db services before we migrate the database. In a regular deployment, this final Glance service start will be a no-op because the services will have been restarted already.

Cinder

Cinder is more complicated than Glance, because it is separated into Cinder data and Cinder control roles, whereas Glance has only a single role. Some plays are targeted to the places where Cinder is running on our volume hosts. There, we are running the Cinder data deployment role, but not restarting the services, and we are explicitly stopping the Cinder volume service.


# Cinder block
- name: stage cinder data software
  hosts: cinder_volume
  max_fail_percentage: 1
  tags: 
	- cinder
	- cinder-volume

  roles:	
	- role: cinder-data
	  restart: False

	- role: stop-services
	  services:
	  - cinder-volume

Best Practices Tip #7: Find even more ways to reuse code! Notice that we are calling a new role, the stop-services role. This is a good example of code reuse, because this role can be used over and over.

Next, we go through our controller, and just as with Glance, we are forcing a resynchronization of the database, because the database lives on the controllers. Also, we are introducing Cinder v2 with our upgrade, so we wanted to add the Keystone service entry for cinder v2 to the catalog. In addition, we are also explicitly starting the Cinder volume services again, after the database migration has been completed. Here is some sample code for how we did that:


- name: stage cinder control software and stop services
  hosts: controller
  max_fail_percentage: 1
  tags: 
	- cinder
	- cinder-control

  roles:
	- role: cinder-control
	  force_sync: true
	  restart: False
	  database_create:
		changed: false

- name: start cinder data services
  hosts: cinder_volume
  max_fail_percentage: 1
  tags:
	- cinder
	- cinder-volume

  tasks: 
	- name: start cinder data services
	  service: name=cinder-volume state=started

- name: ensure cinder v2 endpoint
  hosts: controller[0]
  max_fail_percentage: 1
  tags:
	- cinder
	- cinder-endpoint

  tasks:
	- name: cinder v2 endpoint
	  keystone_service: name={{ item.name }}
				    type={{ item.type }}
				    description=‘{{ item.description }}’
				    public_url={{ item.public_url }}
				    internal_url={{ item.internal_url }}
				    admin_url={{ item.admin_url }}
				    region=RegionOne
				    auth_url={{endpoints.auth_url }}
				    tenant_name=admin
				    login_user=provider_admin
				login_password={{ example.prov_admin_password }}
	   with_items: keystone.services
	   when: endpoints[item.name] is defined and endpoints[item.name]
		and item.name == ‘cinderv2’

Nova

The Nova update is relatively straightforward. The command stage means to install the new files onto the server, so they are ready to go.


#Nova block
- name: stage nova compute
  hosts: compute
  max_fail_percentage: 1
  tags:
nova
nova-data

roles:
role: nova-data
	restart: False
	when: ironic.enabled == false

	- role: stop-services
	  services:
	    - nova-compute
       when: ironic.enabled == False

- name: stage nova control and stop services
  hosts: controller
  max_fail_percentage: 1
  tags:
	- nova
	- nova-control

roles:
  	- role: nova-control
	  force_sync: true
	  restart: False
	  database_create:
		changed: false

- name: start nova compute
  hosts: compute
  max_fail_percentage: 1
  tags:
	- nova
	- nova-data
  
  tasks: 
	- name: start nova compute
	  service: name=nova-compute state=started
	  when: ironic.enabled == False

The previous code captures all the actions that are really needed to perform the Nova upgrade.

Neutron

When moving from Havana to any release later than Havana, you need to stamp the database as part of the upgrade. Previous to IceHouse, there were not Neutron data migrations, but now there are, so you must give it a starting point. What this means is that, as you can see in the code that follows, outside of our normal pattern, we need to first stamp the database (making sure we have not already done that) with the new Havana version as a starting point.

Here’s the normal pattern:


# Neutron block
- name: stage neutron core data
  hosts: compute:network
  max_fail_percentage: 1
  tags:
    - neutron
    - neutron-data

roles:
  - role: neutron-data
    restart: False
	 
- name: stage neutron network
  hosts: network
  max_fail_percentage: 1
  tags:
	- neutron
	- neutron-network

roles:
	- role: neutron-data-network
	  restart: False

- name: stage neutron control plane
  hosts: controller
  max_fail_percentage: 1
  tags:
	- neutron
	- neutron-control

Here is the database stamping portion:


pre_tasks:
name: check db version
    command: neutron-db-manage —config-file /etc/neutron/neutron.conf
		  —config-file /etc/neutron/plugins/m12/,12_plugin.ini
		  current
    register: neutron_db_ver
    run_once: True 

name: stamp neutron to havana
    command: neutron-db-manage —config-file /etc/neutron/neutron.conf
		  —config-file /etc/neutron/plugins/m12/,12_plugin.ini
		  stamp havana
    when: not neutron_db_ver.stout|search(‘juno’)
    run_once: True

After stamping the database, you can run the control role, which will upgrade the database to the Juno version.


roles:
	- role: neutron-control
	  force_sync: true
	  restart: False
	  database_created:
		changed: false

Then we can restart all of the Neutron services, including all of the agents. Notice that we have addressed the Control Plane first by updating the server API before the agents, which contact the Data Plane. (This pattern holds for updating Cinder and Nova as well.)


- name: restart neutron data service
  hosts: compute:network
  max_fail_percentage: 1
  tags:
	- neutron
	- neutron-data

tasks: 
  - name: restart neutron data service
    service: name=neutron-linuxbridge-agent state=restarted

  - name: restart neutron data network service
    hosts: compute:network
    max_fail_percentage: 1
    tags:
	- neutron
	- neutron-network

tasks: 
  - name: restart neutron data network agent services
    service: name={{ item }} state=restarted
    with_items
	- neutron-13-agent
	- neutron-dhcp-agent
	- neutron-metadata-agent

This portion of the upgrade process is the only time in which some of the network connectivity might go down; however, the downtime is very short. Customers would probably see the outage as just a short “network blip.”

Swift

In upgrading Swift, all you really must do is run the roles as if in deployment.


- name: upgrade swift
  hosts: swiftnode
  any_errors_fatal: true
  tags: swift

roles:
role: haproxy
	haproxy_type: swift
	tags: [‘openstack’, ‘swift’, ‘control’]

role: swift-object
	tags: [‘openstack’, ‘swift’, ‘data’]

role: swift-account
	tags: [‘openstack’, ‘swift’, ‘data’]

role: swift-container
	tags: [‘openstack’, ‘swift’, ‘data’]

role: swift-proxy
	tags: [‘openstack’, ‘swift’, ‘control’]

Notice the clear distinction in the Swift HAProxy roles between Control Plane roles and Data Plane roles, shown in the tags of the previous sample code.

Keystone

Keystone is essentially entirely control plane, so there is only one role: run the role, and all is well. Just remember to run the role in a way that forces the database sync, as shown in the accompanying code.


- name: upgrade keystone
	hosts: controller
	max_fail_percentage: 1
	tags: keystone
	
	roles:
	- role: keystone
	  force_sync: true
     restart: False
     database_create:
		changed: false

Horizon

Horizon is just a Web application, so to upgrade, just run the role and that is all. There are no databases to worry about here.


- name: upgrade horizon
	hosts: controller
	max_fail_percentage: 1
	tags: horizon
	
	roles:
	- role: horizon

Potential Upgrading Pitfalls

This section covers several additional factors that could arise in your upgrade, as potential pitfalls. These include: Keystone PKI tokens, Virtual interface plugging between Neutron and Nova, and deleted Nova instances.

  • Keystone PKI tokens were introduced in the IceHouse version of OpenStack. When you are using PKI tokens, they invalidate all of your other tokens until you restart your services. Essentially, PKI tokens can break your Keystone services when you restart Keystone unless you are very aware of this fact. As a best operating practice, we do not recommend these tokens, because they are not actually any faster.
  • Neutron/Nova “vif_plugging_is_fatal” version dependency. Starting with IceHouse or Juno, when you are interacting with Neutron, you can tell Nova that you need to wait for Neutron to plug the virtual interface. This approach addresses a whole range of failure modes that can occur when an instance boots before the network is ready. If this feature is turned on for one side of the Nova / Neutron interaction but you have not upgraded the other side, you might not be able to finish your instance booting. In other words, this situation will cause the builds to break until both sides of the code are upgraded. There are two workaround options:
    1. You can use an intermediate Nova config state that does not turn on this feature, in which case it will assume that if the call to Neutron is successful, the interface has come up. Eventually, the interface probably will come up … probably.
    2. You can accept the fact that this for the small period of time that both services will not be correct. This is the option that we preferred.
  • Database content regarding deleted Nova instances is not actually deleted, just marked as deleted. On a long-running database or heavily used OpenStack deployment, this behavior can mean that all of this deleted data is transferred, and the database migration can take hours and hours (meaning Control Plane downtime for your customers, because you cannot have services running while you are migrating your database). There is no supported tool for trimming the database. Again, there are some options available for how to proceed:
    1. You can create your own tool to trim the database.
    2. You can just know that you are going to incur this downtime.

We ran a database migration with deleted data in a test environment and measured how long it was taking to purge the database of deleted data versus how long it was taking just to migrate all the deleted data. We found that the time spent is roughly the same, whichever way we went, because of how we use our cloud environment. Your results may differ, based on the time your cloud has been in operation and how your customers are using the database.

Best Practices Tip #8: Run some tests! Know ahead of time what your migration will be like, and inform your customers accordingly!

Tips for Upgrading from Juno to Kilo, and Beyond

Note that in this Havana to Juno upgrade scenario, we started with Glance because it is a relatively simple service, and it tested our model well without having interdependencies.

When upgrading from Juno to Kilo, your order of upgrade might not be exactly the same as ours. Your order may vary according to your configuration. You will need to do some testing.

For example, in a cloud with about 6000 compute nodes, it’s important that you should be able to upgrade all of your other infrastructure—including the Nova conductor, API, scheduler, etc—before you upgrade your Nova-compute infrastructure, to minimize your compute downtime. The compute nodes are still running, doing any long-range tasks they have to do, but of course they cannot talk to the database. You can migrate the database and bring all of those services back up, without touching the compute nodes.

As a result, there will be a brief time in which there is a version mismatch: you will be running new Nova-control with old Nova-compute infrastructure. The use of the Nova conductor, an optional service, makes this scenario workable because it acts as a translator between the compute nodes and the database, using message-object versioning.

At this point, you can roll through your compute nodes and slowly restart them, upgrading them as they gracefully shut down: they will restart with the new code. Then you will be able to turn on any Nova-compute features that depend on the compute nodes having that new code.

In other words, when your cloud’s compute set is very large, this is a good method. In a cloud with fewer compute nodes, the amount of time needed to set this up is better spent just restarting everything, as we did in these examples.

The size barrier for utilizing this method of upgrading by allowing Nova-compute version mismatches would occur when your orchestration takes more than a few minutes to shut down all the Nova-compute nodes (of course it leaves all the VMs running: but the customer cannot launch new instances or manipulate existing instances), or when your SLA allows less than a few minutes of compute Control Plane downtime. If your SLA allows enough downtime, it is better to do the method we have shown and upgrade everything at once.

Note: In Juno, only Nova supports object versioning. With Kilo, Cinder also supports object versioning, which makes upgrades easier. Kilo is one of the last upgrades in which you have to stop the services. In Kilo and Liberty, upgrades will be automatically smoother because the database can be migrated live. It will create the new tables where the data will live. The new code will look for both places and slowly migrate the data over time.

Regarding preparation time, for a small-to-medium-size cloud operation, it took about a month to write and prove the upgrade playbook and the Method of Operation, so we could understand where the playbook might fail and how to restart it as needed. For each customer cloud site, we informed each customer about the potential downtime they might experience after pulling down their Nova database from their latest backup, restoring it onto a VM test machine, and running the procedures to figure out how long it would take.

For upgrading from Juno to Kilo, most of these scripts could remain the same. You don’t have to do the database upgrade nor stamp the database as we did for this update, but these scripts detect that scenario in test cases already, and would not perform those actions. What to look at for updating from Juno to Kilo is whether you would need to do a coordinated shutdown of the Nova services before performing the database migrations. When moving from Kilo to Liberty, however, scripts should look a lot different.

Best Practices Tip #9: Every release of OpenStack comes with release notes, which include upgrade concerns for that release. Be sure to read the release notes.

Note: Our operating system is Ubuntu, and we do our own packages of OpenStack, so that we can lay it down independently of old OpenStack versions. We did not update our operating system as a part of these updates. These scripts probably would work with modification. You would need to figure out the best time to incur the overhead for your OS upgrade. Our best guess would be: plan to upgrade your operating system first.

Data Plane, Control Plane, and Service Level Agreements

Another aspect to consider is that SLAs are not usually created with the distinction between a Control Plane and a Data Plane outage in mind. The consequences to the customer are so much less drastic for a Control Plane outage than for a Data Plane outage, so perhaps Cloud providers could give ourselves some room for success by including this distinction in future SLAs.

Best Practices Tip #10: Always take note of the distinction between the Control Plane and the Data Plane in your code and operations practices. Whenever your program must interact with the Data Plane, remember that valuable data is kept there, and treat it with respect.

Q

We are your Uptime Army. Armed with knowledge, experience and energy drinks.


How can we help you?