Bookmark this page

Managing Rolling Updates

Objectives

  • Tune the behavior of the serial directive when batching hosts for execution, abort the play if it fails for too many hosts, and create tasks that run only once for each batch or for all hosts in the inventory.

Overview

Ansible has several features that enable rolling updates, a strategy that staggers deployments to batches of servers. With this strategy, infrastructure deployment can complete with zero downtime.

Ansible can halt the deployment and limit the errors to servers in a particular batch when an unforeseen problem occurs. With tests and monitoring in place, you can configure playbook tasks to perform the following actions:

  • Roll back configuration for hosts in the affected batch.

  • Quarantine affected hosts to enable analysis of the failed deployment.

  • Send notifications of the deployment to stakeholders.

Controlling Batch Size

By default, Ansible runs a play by running one task for all hosts before running the next task. Imagine that your play has a task that cannot succeed due to some error. If your play runs on all hosts in parallel, they all fail and abort the play when Ansible reaches that task, and no host completes the play. As a result, it is likely that none of the hosts are working correctly, which could lead to an outage.

To avoid this, you could process some of the hosts through the play before starting the next batch of hosts. If too many hosts fail, then you could have Ansible abort the entire play before all of your hosts attempt to run it.

The following sections describe how to configure your play to abort if too many hosts in a batch fail.

Setting a Fixed Batch Size

To process hosts through a play in batches, use the serial directive in your play. The serial directive specifies how many hosts should be in each batch. Ansible processes each batch of hosts through the play before starting the next batch. If all hosts in the current batch fail, the entire play is aborted, and Ansible does not start the next batch.

Consider the beginning portion of this play:

---
- name: Update Webservers
  hosts: web_servers
  serial: 2

In this example, the serial directive instructs Ansible to process hosts in the web_servers host group in batches of two hosts. The play is repeated with a new batch if the play runs without error.

This process continues until all the hosts in the play are processed. If the total number of hosts in the play is not divisible by the batch size, then the last batch might contain fewer hosts than the indicated value of the serial directive. In the previous example, the last batch contains one host if the total number of web servers is an odd value.

Remember, if you use an integer with the serial directive, then as the number of hosts in the play increases, the number of batches needed to complete the play also increases. With a serial value of 2, a host group with 200 hosts requires 100 batches to complete, but a host group with 20 hosts requires ten batches.

Setting Batch Size as a Percentage

You can also specify a percentage for the value of the serial directive:

---
- name: Update Webservers
  hosts: web_servers
  serial: 25%

The play runs on the percentage of the hosts that you specify.

Ansible applies the percentage to the total number of hosts in the host group. If the resulting value is not an integer number of hosts, then the value is truncated (rounded down) to the nearest integer. The batch size cannot be zero hosts. If the truncated value is zero, Ansible changes the batch size to one host.

The following table illustrates the use of a percentage and the resulting size and number of batches:

DescriptionHost group 1Host group 2Host group 3
Total number of hosts31319
serial value25%25%25%
Exact number of hosts0.753.254.75
Rounded down034
Resulting batch size134
Number of batches355
Size of last batch113

Setting Batch Sizes to Change During the Play

You can change the batch size as the play runs. For example, you could test a play on a batch of one host; if that host fails, the entire play aborts. However, if the play is successful on one host, you could increase the batch size to 10 percent of your hosts, then 50 percent of the managed hosts, and then the remainder.

You can gradually change the batch size by setting the serial directive to a list of values. This list can contain any combination of integers and percentages and resets the size of each batch in sequence. If the value is a percentage, then Ansible computes the batch size based on the total size of the host group and not the size of the group of unprocessed hosts.

Consider the following example:

---
- name: Update Webservers
  hosts: web_servers
  serial:
    - 1
    - 10%
    - 100%

The first batch contains a single host.

The second batch contains 10% of the total hosts in the web_servers host group. Ansible computes the actual value according to the previously discussed rules.

The third batch contains all the remaining unprocessed hosts in the play. This setting enables Ansible to process all the remaining hosts efficiently.

If unprocessed hosts remain after the last batch corresponding to the previous serial directive entry, the last batch repeats until all hosts are processed. Consider the following play, which runs against a web_servers host group with 100 hosts:

---
- name: Update Webservers
  hosts: web_servers
  serial:
    - 1
    - 10%
    - 25%

The first batch contains one host and the second batch contains ten hosts (10% of 100). The third batch processes 25 hosts (25% of 100), leaving 64 unprocessed hosts (1 + 10 + 25 processed). Ansible continues executing the play in batch sizes of 25 hosts (25% of 100) until fewer than 25 unprocessed hosts remain. In this example, the final batch processes the remaining 14 hosts (1 + 10 + 25 + 25 + 25 + 14 = 100).

Aborting the Play

By default, Ansible tries to get as many hosts to complete a play as possible. If a task fails for a host, the host is dropped from the play, but Ansible continues to run the remaining tasks for other hosts. The play only stops if all hosts fail.

However, if you use the serial directive to organize hosts into batches, then if all hosts in the current batch fail, Ansible stops the play for all remaining hosts, not just the remaining hosts in the current batch. If all hosts in a batch fail, the play aborts and the next batch does not start.

Ansible keeps a list of the active servers for each batch in the ansible_play_batch variable. Ansible removes a host that fails a task from the ansible_play_batch list, and updates this list after every task.

Consider the following hypothetical playbook, which runs against a web_servers host group with 100 hosts:

---
- name: Update Webservers
  hosts: web_servers
  tasks:
    - name: Step One
      ansible.builtin.shell: /usr/bin/some_command
    - name: Step Two
      ansible.builtin.shell: /usr/bin/some_other_command

If 99 web servers fail the first task, but one host succeeds, Ansible continues to the second task. When Ansible runs the second task, Ansible only runs that task for the one host that previously succeeded.

If you use the serial directive, playbook execution continues, provided that hosts remain in the current batch with no failures. Consider this modification to the hypothetical playbook:

---
- name: Update Webservers
  hosts: web_servers
  serial: 2
  tasks:
    - name: Step One
      ansible.builtin.shell: /usr/bin/some_command
    - name: Step Two
      ansible.builtin.shell: /usr/bin/some_other_command

If the first batch of two contains a host that succeeds and a host that fails, then the batch completes, and Ansible moves on to the second batch of two. If both hosts in the second batch fail on a task in the play, Ansible aborts the entire play does not start any more batches.

In this scenario, running the playbook can produce the following states:

  • One host completes the play.

  • Three hosts are in an error state.

  • The rest of the hosts remain unaltered.

Specifying Failure Tolerance

By default, Ansible only halts play execution when all hosts in a batch fail. However, you might want a play to abort if more than a certain percentage of hosts in the inventory fail, even if no entire batch fails. It is also possible to "fail fast" and abort the play if any tasks fail.

You can add the max_fail_percentage directive to a play to alter the default Ansible failure behavior. When the number of failed hosts in a batch exceeds this percentage, Ansible halts playbook execution.

Consider the following hypothetical playbook that runs against the web_servers host group, which contains 100 hosts:

---
- name: Update Webservers
  hosts: web_servers
  max_fail_percentage: 30%
  serial:
    - 2
    - 10%
    - 100%
  tasks:
    - name: Step One
      ansible.builtin.shell: /usr/bin/some_command
    - name: Step Two
      ansible.builtin.shell: /usr/bin/some_other_command

The first batch contains two hosts. Because 30% of 2 is 0.6, a single host failure causes execution to stop.

If both hosts in the first batch succeed, then Ansible continues with the second batch of 10 hosts. Because 30% of 10 is 3, more than three host failures must occur for Ansible to stop playbook execution. If three or fewer hosts experience errors in the second batch, Ansible continues with the third batch.

To implement a "fail fast" strategy, set the max_fail_percentage to zero.

Important

To summarize the Ansible failure behavior:

  • If the serial directive and the max_fail_percentage values are not defined, all hosts are run through the play in one batch. If all hosts fail, then the play fails.

  • If the serial directive is defined, then hosts are run through the play in multiple batches, and the play fails if all hosts in any one batch fail.

  • If the max_fail_percentage directive is defined, the play fails if more than that percentage of hosts in a batch fail.

If a play fails, Ansible aborts all remaining plays in the playbook.

Running a Task Once

In certain scenarios, you might only need to run a task once for an entire batch of hosts rather than once for each host in the batch. To accommodate this scenario, add the run_once directive to the task with a Boolean true (or yes) value.

Consider the following hypothetical task:

- name: Reactivate Hosts
  ansible.builtin.shell: /sbin/activate.sh {{ active_hosts_string }}
  run_once: true
  delegate_to: monitor.example.com
  vars:
    active_hosts_string: "{{ ansible_play_batch | join(' ')}}"

This task runs once and runs on the monitor.example.com host. The task uses the active_hosts_string variable to pass a list of active hosts as command-line arguments to an activation script. The variable contains only those hosts in the current batch that have succeeded for all previous tasks.

Important

Setting the run_once: true directive causes a task to run once for each batch. If you only need to run a task once for all hosts in a play, and the play has multiple batches, then you can add the following conditional statement to the task:

when: inventory_hostname == ansible_play_hosts[0]

This conditional statement runs the task only for the first host in the play.

Revision: do374-2.2-82dc0d7