Abstract
| Goal |
Configure workloads to run on a dedicated set of cluster nodes and prevent other workloads from using those cluster nodes. |
| Sections |
|
| Lab |
|
Pod scheduling is the process of determining in which node to place a pod in the OpenShift cluster.
The OpenShift built-in scheduler identifies the most suitable node for the pods when you create them. The default scheduler meets the needs of most OpenShift users.
However, in some situations, OpenShift administrators might want more control over which node is used for pod placement. You can use the OpenShift advanced scheduling features to configure pods to run on a particular node or alongside a specific pod.
OpenShift includes the following advanced scheduling features:
Scheduler profiles, to control how OpenShift schedules pods on nodes.
Pod affinity rules, to keep sets of pods close to each other, on the same nodes. For example, you can run a REST service and its database on the same node to minimize network latency.
Pod anti-affinity rules, to keep sets of pods far away from each other, on different nodes. For example, you can run replica pods of the same deployment in different nodes, so that if a node fails, then you do not lose all the pods for the workload.
Node affinity, to keep sets of pods running on the same group of nodes. For example, you can configure a pod to run on nodes with a specific CPU.
Node selectors, to schedule pods to a specific set of nodes. For example, you can schedule a pod to a node that provides special hardware that the pod needs.
Taints and tolerations, to avoid scheduling pods to a specific set of nodes. For example, you can block a pod to run on a node that is reserved for OpenShift cluster services or control plane services.
OpenShift also includes the pod disruption budget resource, which controls how many instances can be down at the same time during voluntary disruptions, such as when scaling down, updating applications, or draining a node for maintenance.
The previous diagram shows some of the advanced scheduling features.
For example, the pod with a node selector for the CPU=fast label can be placed only on the second node, which contains the CPU=fast label.
For a pod that has a preferred node affinity rule for the CPU=fast label, OpenShift first tries to schedule it on the node with the CPU=fast label.
If OpenShift cannot place the pod on that node, then OpenShift schedules the pod on another node.
If you create a deployment with the app=custom label and a pod anti-affinity rule for the same label, then OpenShift places the replicas for the deployment on different nodes.
If you taint a node with the type=mission-critical label, then OpenShift can schedule a pod on that node only if the pod has the correct toleration that matches the taint.
It is common in production environments to use more than one advanced scheduling feature at a time. For example, your cluster includes a workload that has a node selector to get GPU-enabled nodes, and pod anti-affinity rules to avoid downtime in the event of missing a node. The same workload also has a pod disruption budget to specify accepted degradation during a cluster upgrade, so the workload could run slower but still be available to its users.
The default OpenShift pod scheduler determines the placement of pods onto nodes within the cluster. The scheduler reads data from the pod and identifies a suitable node based on configured profiles. After identifying the most suitable node, the scheduler creates a binding that associates the pod with a specific node, without modifying the pod.
The OpenShift default pod scheduler determines the placement of a pod onto a particular node according to the following procedure:
Node filtering: The OpenShift scheduler filters the nodes based on the configured constraints or requirements by means of functions called predicates. These predicates include available CPU capacity on the node to satisfy a pod's CPU resource request, free ports, or volume availability, among others.
Prioritize the filtered list of nodes: In this step, the scheduler assesses each node by using a set of priority or scoring functions, and assigns each node a score from 0 to 10. A score of 0 indicates a poor fit, and a score of 10 indicates an excellent fit for hosting the pod. Additionally, OpenShift administrators can assign a numeric weight to each scoring function in the scheduler's configuration. With this weight attribute, administrators can prioritize certain scoring functions.
Select the best node: OpenShift sorts the nodes based on their scores and selects the node with the highest score. If multiple nodes receive the same score, then OpenShift selects one node randomly.
If a pod does not specify its resource requests, then the scheduler could place it on a node that is already full, which could lead to poor performance or even killing the pod if the node is out of memory.
The OpenShift scheduler profile controls how OpenShift schedules pods onto nodes.
You can set the scheduler profile by using the oc edit scheduler cluster command and modifying the spec.profile parameter.
The following scheduler profiles are available:
LowNodeUtilization:
By using this profile, OpenShift distributes pods evenly across nodes to ensure a low resource usage per node.
This profile provides the default scheduler behavior.
HighNodeUtilization:
With this profile, OpenShift places as many pods as possible onto as few nodes as possible.
This profile minimizes the node count but increases the resource usage per node.
NoScoring:
This profile aims for quick scheduling cycles by disabling all the score plug-ins.
Thus, by using this profile you sacrifice better scheduling decisions for faster ones.
In some situations, you might want more control over which node is used for pod placement. You can use the following OpenShift advanced scheduling features to configure pods to run on a particular node or alongside a specific pod.
Because a node selector specifies target node labels on your pod, OpenShift schedules the pod only on nodes that match the target labels. Thus, with a node selector, you must define the target labels in the pod specification file and label the nodes with those labels.
You can use node selectors to place pods on specific nodes. You can also define cluster-wide node selectors to place pods on specific nodes anywhere in the cluster, and define project node selectors to place pods in a project on specific nodes.
For example, you can label nodes in your cluster by using their geographical location. Then, application developers can use node selectors to deploy pods on nodes in a geographical region for availability and latency.
Affinity refers to a pod property that controls the pod preference for a node. Anti-affinity is a pod property that restricts the placement of the pod on a node.
You can use affinity and anti-affinity rules to control the following scenarios:
The node to place a pod in, which is called node affinity.
The node to place a pod in, relative to other pods, which is called pod affinity.
With node affinity, you can specify in a pod the affinity in relation to a group of nodes. For example, you can configure a pod to run on nodes with a specific CPU or in a specific availability zone.
Node affinity, which is conceptually similar to node selectors, provides a more flexible way to specify constraints on node selection. Whereas node selectors specify the target nodes by using a set of label key-value pairs, node affinity enables a conditional approach with logical operators in the matching process.
With node affinity, you can define required and preferred rules. Because required rules work similarly to node selectors, the OpenShift scheduler schedules the pod only if the conditions are met. By using preferred rules, the OpenShift scheduler tries to schedule the pod in a node that meets the rules. If the scheduler does not find a node, then the scheduler schedules the pod on a node that does not meet the rules.
With pod affinity and pod anti-affinity, you can control the nodes for your pod based on the labels of other pods.
By using pod affinity rules, you can locate a pod on the same group of nodes as other pods based on its label selector. As an example, you can configure the scheduler to place pods of two services onto the same group of nodes, because the services communicate with each other regularly.
By using pod anti-affinity rules, you can prevent the scheduler from locating a pod on the same group of nodes as other pods based on its label selector. For example, you can configure the scheduler to prevent a pod of a particular service from being on the same nodes as pods of another service, because the two services interfere and reduce the performance of the services. Another typical scenario is to use node topology labels, such as availability zones in a cloud provider or racks in a data center, and to use anti-affinity rules to avoid putting pods of the same workload on the same failure domain.
Pod affinity, pod anti-affinity rules, and node affinity rules each have two types: required and preferred.
Required rules must be met before the scheduler places a pod on a node. If you use required rules and those rules are not met, then the scheduler cannot find an appropriate node to place the pod.
Preferred rules specify conditions that the scheduler tries to enforce, but the scheduler still schedules the pod if it cannot find a matching node.
You can configure pod and node affinity and anti-affinity rules in the pod specification YAML files by setting the spec.config.affinity parameter.
For that parameter, you can set the nodeAffinity, podAffinity, and podAntiAffinity properties, and specify a required rule, or a preferred rule, or both.
If you specify both required and preferred rules, then the node must first meet the required rules and after the scheduler attempts to meet the preferred rules.
You can configure taints on nodes so they schedule a pod only if the pod has a matching toleration. The node with a taint repels all the pods except those pods that have a toleration for that taint. For example, OpenShift automatically taints the control plane nodes, so that the pods that manage the control plane can be scheduled on them. However, the control plane nodes reject any other data plane pods that users deploy.
You can apply taints to a node in the node specification YAML file by setting the spec.taints parameter, and apply tolerations to a pod in the pod specification YAML file by setting the spec.tolerations parameter.
Taints and tolerations consist of a key, value, and effect.
The key is any string, up to 253 characters.
The value is any string, up to 63 characters.
The effect is one of the following options:
NoSchedule: If the pod does not match the taint, then the node prevents scheduling it.
Existing pods on the node continue running on the node.
PreferNoSchedule: If the pod does not match the taint, then the scheduler tries to avoid scheduling it in the node.
Existing pods on the node continue running on the node.
NoExecute: If the pod does not match the taint, then the node prevents scheduling it.
The scheduler removes any existing pods on the node that do not have a matching toleration.
Tolerations also include the operator parameter.
The following values for the operator parameter are possible:
Equal: The toleration matches the taint if the key, value, and effect parameters match.
This behavior is the default.
Exists: The toleration matches the taint if the key and effect parameters match.
You must leave a blank value parameter.
A pod disruption budget is a policy resource to control the disruption of pods during voluntary disruptions, such as scaling down, updating applications, or draining a node for maintenance. Pod disruption budgets do not apply on node failures.
By setting up pod disruption budgets, you can manage the availability and stability of your applications during disruptions, to reduce the risk of service degradation or downtime when maintaining or updating your cluster.
The configuration for a pod disruption budget consists of the following parts:
A label selector to target a specific set of pods. By using the label selector, you define the pods that the pod disruption budget protects. Only pods that match the label selector are subject to the budget constraints. The pod selector in a pod disruption budget typically selects only pods from the same deployment.
The availability level, which specifies the minimum number of pods that must be available simultaneously. You can define the following availability levels:
minAvailable: Defines the minimum number of pods that must always be available, even during a disruption.
A pod is available when it has the Ready condition with the True value.
maxUnavailable: Defines the maximum number of pods that can be unavailable during a disruption.
For more information about pod scheduling on OpenShift, refer to the Controlling Pod Placement onto Nodes (Scheduling) section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#controlling-pod-placement-onto-nodes-scheduling
For more information about pod disruption budgets, refer to the Understanding How to Use Pod Disruption Budgets to Specify the Number of Pods That Must Be Up section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#nodes-pods-pod-distruption-about_nodes-pods-configuring