Bookmark this page

Managing Distributed Execution with Automation Mesh

Objectives

  • Monitor execution of jobs on different execution nodes and maintain and adjust the automation mesh.

Managing Instance Groups in Automation Controller

When you install automation controller, it automatically creates the controlplane and default instance groups. The controlplane instance group is used by internal jobs that synchronize project contents and perform maintenance tasks. The default instance group runs user jobs that do not specify any other instance group.

You can create additional instance groups when you install automation controller by adding groups to the inventory for the setup.sh command with names that begin with instance_group_, followed by the name of the instance group you want to create.

You can also use the automation controller web UI to quickly create, modify, and delete instance groups without running the installation script.

Important

You cannot delete the controlplane or the default instance groups.

Creating Instance Groups

  • Navigate to the web UI for one of the control nodes and log in as the admin user.

  • Navigate to AdministrationInstance Groups and then click AddAdd instance group.

  • Enter a name for the instance group and then click Save.

Assigning Execution Nodes to an Instance Group

You can assign existing execution nodes to an instance group. If you created instance groups for geographic locations, then you might add execution nodes within those geographic locations to the appropriate instance groups. Using execution nodes that are in close geographic proximity to managed hosts reduces latency.

  • Navigate to AdministrationInstance Groups and then click the link for an instance group.

  • Click the Instances tab and then click Associate.

  • Select the nodes that you want to add to the instance group, and then click Save.

The following screen capture shows hosts that belong to the Test Servers instance group.

Running a Health Check on the Nodes

Automation controller monitors the health of instances. On the instance group’s Instances tab in the web UI, hover over the health status for each instance to see the date and time stamp for the last health check. If desired, you can manually run a health check for one or more instances.

  1. To view the date and time of the last health check, hover over the health status icon for an instance, expand the brief details for an instance, or click the link for the instance name to view the full instance details.

    Figure 11.16: Display last health check date and time
  2. To manually run a health check, select one or more instances and then click Health Check.

Disassociating a Node from an Instance Group

Disassociating a node from an instance group removes the node from the instance group. You might do this so that you can associate the node with a different instance group or because you need to remove the node from your cluster.

  1. Navigate to AdministrationInstance Groups and then click the name of the desired instance group.

  2. On the Instances tab, select the node that you want to remove from the instance group, and then click Disassociate. In the pop-up window, click Disassociate to confirm.

Important

Disassociating a node from an instance group does not remove the node from your cluster.

If you want to remove a node from your cluster, then you must add node_state=deprovision to the appropriate node or group in your installation script’s inventory file and then run the installation script again.

Assigning Default Instance Groups to Inventories and Job Templates

You can explicitly configure inventories and job templates to use a particular instance group by default. If you do not specify an instance group for a job in either the job template or the inventory, then automation controller launches the job using the default instance group. When you install Ansible Automation Platform, the default instance group includes all hybrid and execution nodes (all nodes that can run jobs).

However, automation controller might select an execution node in the default instance group that is geographically distant on the network from the managed host, resulting in less than optimal performance.

Configuring an Inventory to Use Instance Groups

If you configure an inventory to select an instance group, then when a job template uses that inventory, automation controller assigns jobs to the execution nodes in that instance group. Configure an inventory to select an instance group using the following procedure:

  1. Navigate to ResourcesInventories and then click the Edit Inventory icon for the inventory that should use an instance group.

  2. Click the search icon for Instance Groups.

  3. Select the desired instance group from the list, click Select, and then click Save.

Configuring a Job Template to Use Instance Groups

Like inventories, you can configure job templates to use instance groups. If an instance group is defined in both an inventory and in a job template that uses the inventory, then the instance group defined in the job template takes precedence.

  1. Navigate to ResourcesTemplates and then click the Edit Template icon for the job template that you want to modify.

  2. Click the search icon for Instance Groups.

  3. Select the desired instance group from the list and click Select.

  4. Click Save.

Running a Job Template with Instance Groups

A job always runs in an instance group. The job might use an instance group defined in the job template, in the inventory, or it might use the default instance group.

The Details tab for a job displays the name of the instance group and the name of the execution node used to run that job.

  1. Navigate to ResourcesTemplates and then click the Launch Template icon for the desired job template.

  2. After the job completes, click the Details tab.

  3. Job details display the name of the instance group and the name of the execution node used by the job.

Testing the Resilience of Automation Mesh

You can manually test the resilience of the control and execution planes by making the nodes unavailable and then running jobs to verify that they still succeed.

Testing Control Plane Resilience

  1. Navigate to ResourcesProjects and click the Sync Project icon for the Demo Project resource.

  2. Navigate to ViewsJobs and then click the link for the Demo Project job. The job has the Source Control Update type.

  3. Click the Details tab and notice that the job used the controlplane instance group. Make note of the execution node used by the job.

  4. Navigate to AdministrationInstances and then disable the execution node identified in the previous step. Set Enabled to off. The status changes to Disabled.

  5. Navigate to ViewsJobs and then click the Relaunch Job icon for the Demo Project job.

  6. After the job completes, click the Details tab and notice that the job used a different execution node.

  7. Navigate to AdministrationInstances and then enable the previously disabled execution node. Set Disabled to off. The status changes to Enabled.

Testing Execution Plane Resilience

  1. Navigate to ResourcesTemplates and click the Launch Template icon for the Demo Job Template resource.

  2. After the job completes, click the Details tab. Make note of the instance group and execution node used by the job.

  3. Navigate to AdministrationInstance Groups and then click the link for the instance group identified in the previous step.

  4. On the Instances tab, disable the previously identified execution node by setting Enabled to off. The status changes to Disabled.

  5. Navigate to ViewsJobs and then click the Relaunch Job icon for the Demo Job Template job.

  6. After the job completes, click the Details tab and notice that the job used a different execution node.

  7. Navigate to AdministrationInstances and then enable the previously disabled execution node. Set Disabled to off. The status changes to Enabled.

Monitoring Automation Mesh from the Web UI

Navigate to AdministrationTopology View in the automation controller web UI to see an overview of the current status of automation mesh and its nodes.

  • Healthy nodes are marked with a checkmark and are colored green.

  • Unavailable nodes are marked with an exclamation point and are colored red.

  • Disabled nodes are marked with a circle and are colored gray.

In the following example, the control2, controller, exec1, and hop1 nodes are healthy. The exec2 node is in an error state and the exec3 node is disabled.

You can use the icons at the upper-right of the Topology View page to zoom in and out, resize the diagram to fit the screen, reset the zoom to its default level, and to turn on and off the descriptive legend. You can click and drag to change the position of the diagram, and use your mouse wheel to zoom in and out.

If you hover over any of the nodes, the web UI highlights the lines representing the peer relationships between that node and the other nodes in automation mesh. If you click any of the nodes, the web UI displays additional information about that node under Details to the right of the diagram.

Monitoring Automation Mesh from the Command Line

This section demonstrates useful commands for monitoring and troubleshooting automation mesh. Log in as the awx user on one of the automation controller machines and run the following commands.

Listing Nodes and Instance Groups

You can use the awx-manage list_instances command to list all the instances in the mesh. The command shows the status of each node.

  • Active and available nodes appear in green. These nodes display a Healthy status in the automation controller web UI.

  • Unavailable nodes appear in red and the version of ansible-runner displays as question marks. These nodes display an Error status in the automation controller web UI.

  • Disabled nodes contain the [DISABLED] text and appear in gray. These nodes display Disabled in the automation controller web UI.

In the following example, the control2, controller, exec1, and hop1 nodes are active and available. The exec2 node is unavailable and the exec3 node is disabled.

[awx@controller ~]$ awx-manage list_instances
[controlplane capacity=53 policy=100%]
	control2.lab.example.com capacity=16 node_type=control   ...
	controller.lab.example.com capacity=37 node_type=control ...

[default capacity=16 policy=100%]
	exec1.lab.example.com capacity=16 node_type=execution version=ansible-runner...
	exec2.lab.example.com capacity=0 node_type=execution version=ansible-runner-???
	[DISABLED] exec3.lab.example.com capacity=0 node_type=execution version=ansi...

[ungrouped capacity=0]
	hop1.lab.example.com node_type=hop heartbeat="2022-06-01 17:46:51"

Monitoring Automation Mesh from the Command Line

You can use the receptorctl command to test communication on the automation mesh. The receptorctl command provides several subcommands, including:

  • receptorctl status to get the status of the entire automation mesh.

  • receptorctl ping to test connectivity between the current node and another node in the automation mesh.

  • receptorctl traceroute to determine the route and latency of communication on the automation mesh between the current node and another node.

The command requires that you specify the systemd socket unit for the automation mesh receptor service. The following examples use the /var/run/awx-receptor/receptor.sock unit.

Use the status subcommand to view the entire mesh, including all of the nodes and how the nodes are connected.

In this example, the Route section indicates that communication to the exec3.lab.example.com execution node is routed through the hop1.lab.example.com hop node.

[awx@controller ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock \
> status
Node ID: controller.lab.example.com
Version: 1.2.3
System CPU Count: 4
System Memory MiB: 5752

Connection                 Cost
exec1.lab.example.com      1
exec2.lab.example.com      1
control2.lab.example.com   1
hop1.lab.example.com       1

Known Node                 Known Connections
control2.lab.example.com   controller.lab.example.com: 1 exec1.lab.example.co...
controller.lab.example.com control2.lab.example.com: 1 exec1.lab.example.com:...
exec1.lab.example.com      control2.lab.example.com: 1 controller.lab.example...
exec2.lab.example.com      control2.lab.example.com: 1 controller.lab.example...
exec3.lab.example.com      hop1.lab.example.com: 1
hop1.lab.example.com       control2.lab.example.com: 1 controller.lab.example...

Route                      Via
control2.lab.example.com   control2.lab.example.com
exec1.lab.example.com      exec1.lab.example.com
exec2.lab.example.com      exec2.lab.example.com
exec3.lab.example.com      hop1.lab.example.com
hop1.lab.example.com       hop1.lab.example.com

Node                       Service   Type       ...  Tags
exec1.lab.example.com      control   StreamTLS  ...  {'type': 'Control Service'}
exec2.lab.example.com      control   StreamTLS  ...  {'type': 'Control Service'}
control2.lab.example.com   control   StreamTLS  ...  {'type': 'Control Service'}
controller.lab.example.com control   StreamTLS  ...  {'type': 'Control Service'}
exec3.lab.example.com      control   StreamTLS  ...  {'type': 'Control Service'}
hop1.lab.example.com       control   StreamTLS  ...  {'type': 'Control Service'}

Node                       Secure Work Types
exec1.lab.example.com      ansible-runner
exec2.lab.example.com      ansible-runner
control2.lab.example.com   local, kubernetes-runtime-auth, kubernetes-inclust...
controller.lab.example.com local, kubernetes-runtime-auth, kubernetes-inclust...
exec3.lab.example.com      ansible-runner

Use the ping subcommand to test connectivity between the current host and another host.

The following example tests connectivity between the controller.lab.example.com and exec2.lab.example.com hosts.

[awx@controller ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock \
> ping exec2.lab.example.com
Reply from exec2.lab.example.com in 1.461675ms
Reply from exec2.lab.example.com in 504.934µs
Reply from exec2.lab.example.com in 528.547µs
Reply from exec2.lab.example.com in 722.001µs

Use the traceroute subcommand to view the route between nodes. In the following example, the controller.lab.example.com node connects to the exec3.lab.example.com node through the hop1.lab.example.com node.

[awx@controller ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock \
> traceroute exec3.lab.example.com
0: controller.lab.example.com in 507.316µs
1: hop1.lab.example.com in 1.032767ms
2: exec3.lab.example.com in 820.719µs

Revision: do467-2.2-08877c1