Bookmark this page

Performing Rolling Configuration Updates

Objectives

  • Perform rolling network configuration updates, and automatically recover from errors.

Batch Node Configuration Strategy

Ansible supports rolling updates, a strategy that staggers deployments to batches of managed nodes. With this strategy, infrastructure deployment can complete with zero downtime.

Ansible can halt the deployment and limit the errors to managed nodes in a particular batch if an unforeseen problem occurs. With tests and monitoring in place, you can configure playbook tasks to perform the following actions:

  • Roll back configuration for managed nodes in the affected batch.

  • Quarantine affected managed nodes to enable analysis of the failed deployment.

  • Send notifications of the deployment to stakeholders.

Configuring Parallelism in Ansible by Using Forks

When Ansible processes a playbook, it runs each play in order. After determining the list of managed nodes for the play, Ansible runs through each task in order. Normally, all managed nodes must complete a task before any managed node starts the next task in the play.

In theory, Ansible could connect to all managed nodes in the play simultaneously for each task. This approach works fine for small lists of managed nodes, but if the play targets hundreds of managed nodes, then these connections can put a heavy load on the control node and the automation execution environment.

The maximum number of simultaneous connections that Ansible makes is controlled by the forks parameter in the Ansible configuration file. This parameter is set to 5 by default, which you can verify by using the following method:

[user@host demo]$ ansible-navigator config list -m stdout
...output omitted...
DEFAULT_FORKS:
  default: 5
  description: Maximum number of forks Ansible will use to execute tasks on target
    hosts.
  env:
  - name: ANSIBLE_FORKS
  ini:
  - key: forks
    section: defaults
  name: Number of task forks
  type: integer
...output omitted...

The previous output indicates that you can modify the forks parameter by using the forks key within the defaults section of an Ansible configuration file:

[defaults]
forks=5

Consider the following scenario. Assume that Ansible is configured with the default value of five forks and that the play has ten managed nodes. Ansible runs the first task in the play on the first five managed nodes, followed by a second round of execution of the first task on the next five managed nodes. After running the first task on all the managed nodes, Ansible runs the next task across all the managed nodes in groups of five nodes at a time. Ansible does this with each task in turn until the play ends.

Most Ansible modules that you use to manage network routers and switches run in the execution environment and not on the network device. Because of the increased load these modules place on the execution environment, you should be careful with increasing the forks parameter.

The ansible-navigator run command offers the -f or --forks option, which you can use to override the current value of the forks parameter.

Controlling Batch Size

By default, Ansible runs a play by running one task for all managed nodes before running the next task. Imagine that your play has a task that cannot succeed due to some error. If your play runs against all managed nodes in parallel, they all fail and abort the play when Ansible reaches that task, and no node completes the play. As a result, it is likely that none of the managed nodes are working correctly, which could lead to an outage.

To avoid this, you could process some managed nodes through the play before starting the next batch of managed nodes. If too many managed nodes fail, then you could have Ansible abort the entire play before all managed nodes attempt to run it.

The following sections describe how to configure your play to abort if too many managed nodes in a batch fail.

Setting a Fixed Batch Size

To process managed nodes through a play in batches, use the serial directive in your play. The serial directive specifies how many managed nodes should be in each batch. Ansible processes each batch of managed nodes through the play before starting the next batch. If all managed nodes in the current batch fail, the entire play is aborted, and Ansible does not start the next batch.

Consider the beginning part of this play:

---
- name: Update Cisco IOS
  hosts: ios
  serial: 2

In this example, the serial directive instructs Ansible to process managed nodes in the ios host group in batches of two managed nodes. The play is repeated with a new batch if the play runs without error.

This process continues until all the managed nodes in the play are processed. If the total number of managed nodes in the play is not divisible by the batch size, then the last batch might contain fewer managed nodes than the indicated value of the serial directive. In the previous example, the last batch contains one host if the total number of managed nodes is an odd value.

Remember, if you use an integer with the serial directive, then as the number of managed nodes in the play increases, the number of batches needed to complete the play also increases. With a serial value of 2, a host group with 200 managed nodes requires 100 batches to complete, but a host group with 20 managed nodes requires ten batches.

Setting Batch Size as a Percentage

You can also specify a percentage for the value of the serial directive:

---
- name: Update Cisco IOS
  hosts: ios
  serial: 25%

The play runs on the percentage of the managed nodes that you specify.

Ansible applies the percentage to the total number of managed nodes in the host group. If the resulting value is not an integer number of managed nodes, then the value is truncated (rounded down) to the nearest integer. The batch size cannot be zero managed nodes. If the truncated value is zero, then Ansible changes the batch size to one managed node.

The following table illustrates the use of a percentage and the resulting size and number of batches:

DescriptionHost group 1Host group 2Host group 3
Total number of managed nodes31319
serial value25%25%25%
Exact number of managed nodes0.753.254.75
Rounded down034
Resulting batch size134
Number of batches355
Size of last batch113

Setting Batch Sizes to Change During the Play

You can gradually change the batch size as the play runs by setting the serial directive to a list of values. This list can contain any combination of integers and percentages and resets the size of each batch in sequence. If the value is a percentage, then Ansible computes the batch size based on the total size of the host group and not the size of the group of unprocessed managed nodes.

Consider the following example:

---
- name: Update Cisco IOS
  hosts: ios
  serial:
    - 1
    - 10%
    - 100%

Assume that the ios host group consists of 20 managed nodes. The first batch contains a single managed node.

The second batch contains 10% of the total managed nodes in the ios host group. Ansible computes the actual value according to the previously discussed rules (20 × 10% = 2).

The third batch contains all the remaining unprocessed managed nodes in the play (20 - 1 - 2 = 17). This setting enables Ansible to process all the remaining managed nodes efficiently.

If unprocessed managed nodes remain after the last batch corresponding to the previous serial directive entry, the last batch repeats until all managed nodes are processed. Consider the following play, which runs against the ios host group with 100 managed nodes:

---
- name: Update Cisco IOS
  hosts: ios
  serial:
    - 1
    - 10%
    - 25%

The first batch contains one managed node and the second batch contains ten managed nodes (100 × 10% = 10). The third batch processes 25 managed nodes (100 × 25% = 25), leaving 64 unprocessed managed nodes (100 - 1 - 10 - 25 = 64). Ansible continues executing the play in batch sizes of 25 managed nodes (25% of 100) until fewer than 25 unprocessed managed nodes remain. In this example, the final batch processes the remaining 14 managed nodes (100 - 1 - 10 - 25 - 25 - 25 = 14).

Aborting the Play

By default, Ansible tries to get as many managed nodes to complete a play as possible. If a task fails for a managed node, then the node is dropped from the play, but Ansible continues to run the remaining tasks for other managed nodes. The play only stops if all managed nodes fail.

However, if you use the serial directive to organize managed nodes into batches, then if all managed nodes in the current batch fail, Ansible stops the play for all remaining managed nodes, not just the remaining managed nodes in the current batch. If all managed nodes in a batch fail, then the play aborts and the next batch does not start.

Ansible keeps a list of the active managed nodes for each batch in the ansible_play_batch variable. Ansible removes a managed node that fails a task from the ansible_play_batch list, and updates this list after every task.

Consider the following hypothetical playbook, which runs against the ios host group with 100 managed nodes:

---
- name: Update Cisco IOS
  hosts: ios
  tasks:
    - name: Step One
      cisco.ios.ios_command:
        command: SOME COMMAND

    - name: Step Two
      cisco.ios.ios_config:
        src: "{{ inventory_hostname }}.cfg"

If 99 managed nodes fail the first task, but one node succeeds, Ansible continues to the second task. When Ansible runs the second task, Ansible only runs that task for the one node that previously succeeded.

If you use the serial directive, playbook execution continues, provided that managed nodes remain in the current batch with no failures. Consider this modification to the hypothetical playbook:

---
- name: Update Cisco IOS
  hosts: ios
  serial: 2
  tasks:
    - name: Step One
      cisco.ios.ios_command:
        command: SOME COMMAND

    - name: Step Two
      cisco.ios.ios_config:
        src: "{{ inventory_hostname }}.cfg"

If the first batch of two contains one managed node that succeeds and one that fails, then the batch completes, and Ansible moves on to the second batch of two. If both managed nodes in the second batch fail on a task in the play, Ansible aborts the entire play and does not start any more batches.

In this scenario, running the playbook can produce the following states:

  • One managed node completes the play.

  • Three managed nodes are in an error state.

  • The rest of the managed nodes remain unaltered.

Specifying Failure Tolerance

By default, Ansible only halts play execution when all managed nodes in a batch fail. However, you might want a play to abort if more than a certain percentage of managed nodes in the inventory fail, even if no entire batch fails. It is also possible to "fail fast" and abort the play if any tasks fail.

You can add the max_fail_percentage directive to a play to alter the default Ansible failure behavior. When the number of failed managed nodes in a batch exceeds this percentage, Ansible halts playbook execution.

Consider the following hypothetical playbook that runs against the ios host group, which contains 100 managed nodes:

---
- name: Update Cisco IOS
  hosts: ios
  max_fail_percentage: 30%
  serial:
    - 2
    - 10%
    - 100%
  tasks:
    - name: Step One
      cisco.ios.ios_command:
        command: SOME COMMAND

    - name: Step Two
      cisco.ios.ios_config:
        src: "{{ inventory_hostname }}.cfg"

The first batch contains two managed nodes. Because 30% of 2 is 0.6, a single node failure causes execution to stop.

If both managed nodes in the first batch succeed, then Ansible continues with the second batch of 10 managed nodes. Because 30% of 10 is 3, more than three node failures must occur for Ansible to stop playbook execution. If three or fewer managed nodes experience errors in the second batch, then Ansible continues with the third batch.

To implement a "fail fast" strategy, set the max_fail_percentage to zero.

Important

To summarize the Ansible failure behavior:

  • If the serial directive and the max_fail_percentage values are not defined, then all managed nodes are run through the play in one batch. If all managed nodes fail, then the play fails.

  • If the serial directive is defined, then managed nodes are run through the play in multiple batches, and the play fails if all managed nodes in any one batch fail.

  • If the max_fail_percentage directive is defined, then the play fails if more than that percentage of managed nodes in a batch fail.

If a play fails, then Ansible aborts all remaining plays in the playbook.

Running a Task One Time

In certain scenarios, you might only need to run a task one time for an entire batch of managed nodes rather than running the task for each node in the batch. To accommodate this scenario, add the run_once directive to the task with a Boolean true (or yes) value.

Consider the following hypothetical task:

- name: Pause 30 seconds
  ansible.builtin.pause:
    seconds: 30
  run_once: true

Important

Setting the run_once: true directive causes a task to run only one time for each batch. If you only need to run a task one time for the entire play, and the play has multiple batches, then you can add the following conditional statement to the task:

- name: Pause 30 seconds
  when: inventory_hostname == ansible_play_hosts[0]
  ansible.builtin.pause:
    seconds: 30
  run_once: true

This conditional statement runs the task only for the first node in the play.

Revision: do457-2.3-7cfa22a