Bookmark this page

Chapter 11.  Building a Large-scale Red Hat Ansible Automation Platform Deployment

Abstract

Goal

Use high availability techniques and automation mesh to scale up your Red Hat Ansible Automation Platform deployment.

Objectives
  • Design a set of distributed Ansible Automation Platform servers to operate at a large scale and with improved reliability and redundancy.

  • Distribute execution of Ansible Playbooks from automation controller control or hybrid nodes to remote execution nodes, communicating with them using automation mesh.

  • Monitor execution of jobs on different execution nodes and maintain and adjust the automation mesh.

Sections
  • Designing a Clustered Ansible Automation Platform Implementation (and Quiz)

  • Deploying Distributed Execution with Automation Mesh (and Guided Exercise)

  • Managing Distributed Execution with Automation Mesh (and Guided Exercise) (and Quiz)

Designing a Clustered Ansible Automation Platform Implementation

Objectives

  • Design a set of distributed Ansible Automation Platform servers to operate at a large scale and with improved reliability and redundancy.

Running Red Hat Ansible Automation Platform at Scale

The new architectural design and features in Red Hat Ansible Automation Platform 2 enables you to run Ansible automation on a large scale and with improved reliability, redundancy, and efficiency.

One of the key changes to Ansible Automation Platform is the new decoupled architecture, which splits the control plane from the execution plane, and the new automation mesh, which is used by nodes in the control plane to delegate automation jobs to nodes in the execution plane.

In earlier versions of Red Hat Ansible Automation Platform, Red Hat Ansible Tower (the predecessor of automation controller) ran in a hybrid mode, in which single servers provided both a web UI and API to control automation jobs, and the execution environment used to run automation jobs.

You can configure automation controller to separate these two modes. This separation means that you can have some nodes that only provide a web UI, an API, and control automation jobs, and many other nodes that can be positioned closer to your managed hosts, and which only execute automation jobs.

You can have multiple control nodes that schedule jobs cooperatively on execution nodes, spreading out the load across multiple systems. You can also add additional execution nodes to scale up your capacity to run jobs.

The control nodes communicate with the execution nodes by using automation mesh.

Note

Automation mesh and the new automation architecture replaces the less powerful isolated nodes feature used by Red Hat Ansible Automation Platform 1.

Automation Mesh

Automation mesh is an overlay network used to distribute work from machines running the automation controller web UI to dedicated, dispersed execution nodes that run jobs. Nodes in the automation mesh establish peer-to-peer connections with each other over the existing network.

Automation mesh provides the following benefits:

  • Automation mesh offers a simple, flexible, and reliable way to independently scale up your control and execution capacity, expanding your automation close to the endpoints with minimal or no downtime.

  • Automation mesh uses an end-to-end encrypted, authenticated network protocol over TCP. By separating automation mesh communication from SSH communication, you can more easily control automation mesh traffic through firewalls.

  • Automation mesh is a multidirectional, multihopped overlay network that enables communication across constrained networks, to access endpoints not directly connected to automation controller. Traffic for a set of execution nodes that cannot be directly reached by the control node can be relayed through one or more hop nodes that can reach both the controller and the execution nodes.

  • Automation mesh provides the ability to reconfigure how traffic is routed across the mesh if one or more nodes becomes unresponsive or unreachable. This helps make Ansible Automation Platform resilient to network disruptions and latency issues.

  • Automation mesh enables operating a single distributed platform. This single distributed platform help reduce the overhead of managing multiple, isolated Ansible Automation Platform clusters.

Types of Nodes on Automation Mesh

Red Hat Ansible Automation Platform 2 separates the control plane, which starts automation jobs, from the execution plane, which runs automation jobs. In each category, it provides the following different types of nodes:

Control Plane

The control plane runs persistent automation controller services, such as the web UI, the task dispatcher, project updates, and management jobs. Control plane nodes can be either hybrid nodes or control nodes.

  • Hybrid nodes: Hybrid nodes perform tasks for both the control plane and the execution plane. They control jobs and run jobs. This is the default node type for control plane nodes.

  • Control nodes: Control plane nodes that use the control node type only perform control plane tasks and do not perform any execution plane tasks.

Execution Plane

The execution plane executes automation functions on behalf of the control plane.

  • Execution nodes: Execution nodes run jobs by using container-based execution environments.

  • Hop nodes: Hop nodes route traffic to other execution nodes. Hop nodes cannot execute automation functions.

The connections between nodes in automation mesh are expressed by peer relationships. If two nodes have a peer relationship, they can directly contact each other over automation mesh.

The default inventory file for the installation script peers all nodes in the control plane with all nodes in the execution plane. You can reconfigure these peering relationships to suit your network infrastructure and requirements.

Figure 11.1: Execution and control plane

Using Instance Groups

An instance group is a group of execution or control nodes. You can use instance groups to run your automation jobs on specific sets of execution nodes.

You can configure a specific inventory of managed hosts to use a particular instance group of execution nodes to run automation jobs by default. You can also configure particular job templates to use a specific instance group by default. If you configure using a specific instance group, then you can improve performance by running jobs on execution nodes that are in the same data center as the managed hosts in the inventory.

You might also have managed hosts that can only be reached by specific execution nodes. By putting those execution nodes in an instance group, you can make sure that jobs for those hosts only run on nodes in that instance group.

Automation mesh creates the default instance group for all execution nodes and hybrid nodes, and the controlplane instance group for all control nodes and hybrid nodes. You cannot remove the default instance group.

You can create additional instance groups, either when you install Red Hat Ansible Automation Platform or through the web UI of automation controller. For example, you might create instance groups to group execution nodes that share a similar geographic location or that you want to serve a particular data center or cluster.

The instance group specified in the job template takes precedence over the one specified in the inventory. This in turn takes precedence over any default instance group specified for the organization that owns the project that was used for the job template.

The control node send jobs to the execution node in the instance group with the most available capacity. If multiple execution nodes have the same capacity, jobs are sent to the first node listed in the instance group. If no instance group is specified for a job, then the default instance group is used.

Planning Network Communication and Firewalls

Automation mesh uses its own network protocol for communications between the control nodes, hub nodes, and execution nodes. This is a TLS-encrypted protocol that connects to port 27199/TCP by default, although you can change this port and the TLS certificates used during installation. The receptor service listens on this port for automation mesh communications.

Execution nodes still communicate directly with managed hosts using the SSH protocol. You can use a hop node to relay communications from a control node to an execution node using port 27199/TCP. By using a hop node, you can block control nodes from directly accessing execution nodes using the SSH protocol.

Important

The initial installation of all nodes requires SSH access from the machine where you run setup.sh.

Requirements for Control Nodes and Hybrid Nodes

Allow network communication using the following network ports:

Network portsServicePurpose
27199/TCPReceptorUsed for communication between the control plane and hop nodes.
80/TCP and 443/TCPHTTP/HTTPSUsed for accessing the automation controller web UI and API.
22/TCPSSHUsed for secure access and administration, including configuration performed by Ansible using the setup.sh script.

If you restrict outbound access, then you must allow outbound access to 80/TCP and 443/TCP in order to download automation execution environments from a container registry.

Requirements for Hop Nodes

Allow network communication using the following network ports:

Network portsServicePurpose
27199/TCPReceptorUsed for communication between the control plane and hop nodes.
22/TCPSSHUsed for secure access and administration, including configuration performed by Ansible using the setup.sh script.

Requirements for Execution Nodes

Allow network communication using the following network ports:

Network portsServicePurpose
22/TCPSSHUsed for secure access and administration, including configuration performed by Ansible using the setup.sh script.

If you restrict outbound access, then you must allow outbound access to 80/TCP and 443/TCP in order to download automation execution environments from a container registry.

Execution nodes that are in a restricted environment might block SSH access to all machines except for hop nodes.

Planning for Automation Mesh

Your use of automation mesh might evolve in stages.

  1. Initially, you might start with one hybrid node that acts as both a control node and an execution node.

    If you plan to add more nodes through automation mesh in the future, set up the initial hybrid node with an external, shared database or database cluster. (Otherwise the initial automation controller would be a single point of failure and would have load for both automation controller operations and database operations.)

    Figure 11.2: Single automation controller in hybrid mode
  2. You might add more hybrid nodes for resilience. The installer automatically configures peering among the control nodes.

    You might want to add a load balancer in front of the hybrid nodes to distribute user requests among them, or you might allow users to access whichever control node is most convenient.

    Figure 11.3: Control plane resilience with multiple hybrid mode automation controllers
  3. You might segregate the control plane and the execution plane by adding an execution node and converting the hybrid nodes to control nodes that provide the control interface but do not run jobs.

    Figure 11.4: Separated control plane and execution plane
  4. You might add more execution nodes for more execution capacity. Having more execution nodes also provides more resilience for the execution plane, in case an execution node fails.

    Figure 11.5: Execution plane resilience with multiple execution nodes
  5. Your network configuration might use an internal firewall to restrict access to certain hosts. You might add a hop node that has controlled access to both networks to reach restricted execution nodes that can manage the restricted hosts.

    Figure 11.6: Added hop node to relay communications

    Important

    You might need to work with your networking and security teams to ensure that the hop node can access the restricted network.

  6. You might add more hop nodes for resilience.

    In this example, the two hop nodes normally communicate with specific execution nodes, but both hop nodes can communicate with each of the execution nodes. If one hop node fails or loses connectivity, then you can update automation mesh peering relationships to direct traffic to the affected execution nodes through the other hop node.

    Figure 11.7: Multiple hop nodes that can access certain execution nodes
  7. You might organize execution nodes into instance groups. You can configure your job templates and inventories to run jobs for particular managed hosts on execution nodes in particular instance groups that are close to (or have access to) those managed hosts.

    Figure 11.8: Using instance groups to localize job execution

Providing Resilient Services

In addition to having multiple control nodes, you should consider deploying multiple private automation hub servers. If your job templates routinely use Ansible Content Collections from your private automation hub, or if you store custom automation execution environment images in the container registry provided by private automation hub, then that service becomes a single point of failure. Having multiple private automation hub servers can help reduce this risk.

You can also use a load balancer for multiple control nodes or multiple private automation hub servers. Using a load balancer can avoid down time if one or more nodes or servers fails or becomes unreachable.

Figure 11.9: Reference architecture for large scale with redundancy

Note

Building more complex architectures that provide disaster recovery capabilities and use high availability database services are out of scope for this course.

If you are interested in these topics, then refer to the Deploying Red Hat Ansible Automation Platform reference architecture in the references at the end of this section.

Revision: do467-2.2-08877c1