DO380 - ch04s04

Bookmark this page

Node Selectors and Taints

Objectives

Configure a workload to run on dedicated nodes, and prevent other workloads from running on those nodes.

Ensure and Prevent Workloads on Nodes

OpenShift advanced scheduling features include node selectors, which ensure that a pod is placed in certain nodes, and taints and tolerations, which prevent certain nodes from running some workloads.

Use node selectors to schedule pods to a dedicated set of nodes. For example, schedule a pod to a node that provides special hardware that the pod needs.

Use taints and tolerations to avoid scheduling pods to a dedicated set of nodes. For example, block a pod to run on a node that is reserved for OpenShift cluster services or for control plane services.

Node Selectors

By using node selectors, you can specify target node labels on your pod, so the OpenShift scheduler tries to place the pod only on those nodes that match the target labels. If the OpenShift scheduler does not find a node that matches the target labels, then your pod will be unavailable.

You can use pod node selectors to place pods on dedicated nodes. You define pod node selectors as attributes on the pod template.

Use project node selectors to place pods in a project on dedicated nodes. You define project node selectors as annotations in the project CR.

Use cluster-wide node selectors to place pods on dedicated nodes anywhere in the cluster. You define cluster-wide node selectors as attributes on the OpenShift scheduler CR.

Specific syntax details for node selectors are explained later in this section.

Figure 4.2: Node selectors

The previous diagram shows how pod node selectors work. In the figure, the worker01 node has the key1=value1 label, but the worker02 node does not have that label. For the first pod, OpenShift can schedule it in either node, because it does not have a node selector.

The second pod has a node selector that is defined as an attribute in the pod template. Then, OpenShift can schedule the pod only in nodes with a label that matches the node selector. Thus, the OpenShift scheduler can place the pod only in the worker01 node.

Node Labels

To ensure that the OpenShift scheduler finds a node that matches the target labels, first label your nodes. You can add labels directly to a node resource or indirectly by using a machine set resource.

If your cluster is installed in a way that supports machine set resources, then you must use machine sets to set labels on nodes. In that case, do not label nodes directly. If your cluster does not support machine set resources, because for example it is installed by using the user-provisioned infrastructure method, then you must label nodes directly.

Note

If your cluster supports machine set resources, then OpenShift can delete and re-create the nodes at any time. Thus, if you directly label the node, then the node loses the label outside the machine set when OpenShift re-creates it.

You can add one or more labels to a node by using the following command:

[user@host ~]$ oc label node node1 key1=value1 ... keyn=valuen
node/node1 labeled

You can remove labels from a node by using the following command:

[user@host ~]$ oc label node node1 key1- ... keyn-
node/node1 unlabeled

Pod Node Selectors

Use node selectors on a pod to specify target node labels where the OpenShift scheduler tries to place the pod.

You can add one or more node selectors to a pod by using the following syntax:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
...output omitted...
spec:
  nodeSelector:
    key1: value1
    key2: value2
...output omitted...

In the previous example, the OpenShift scheduler places the pod only in nodes with both the key1=value1 and key2=value2 labels.

You can read the node selectors for a pod by reviewing the Node-Selectors parameter as follows:

[user@host ~]$ oc describe pod myapp-95664666d-5l22c
...output omitted...
Node-Selectors:     key1:value1
                    key2:value2
...output omitted...

Cluster-wide Node Selectors

By default, the OpenShift scheduler tries to place the pods onto any available nodes in your cluster. However, you can modify the OpenShift scheduler default behavior so it includes default node selectors to any new pods that are created in the cluster. By using cluster-wide node selectors, you specify the default node selectors that OpenShift adds to the pods. You require administrative privileges to specify cluster-wide node selectors.

To define cluster-wide node selectors, you can edit the OpenShift scheduler CR as follows:

[user@host ~]$ oc edit scheduler cluster

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
...output omitted...
spec:
  defaultNodeSelector: key1=value1,key2=value2
...output omitted...

After you edit the OpenShift scheduler CR, wait for the pods in the openshift-kube-apiserver project to redeploy.

In the previous example, any new pods that you create in the cluster have the key1=value1 and key2=value2 node selectors, by default.

Project Node Selectors

You can specify node selectors for a project, so OpenShift includes those node selectors to any new pods that are created in the project. You require administrative privileges to specify project node selectors.

To define project node selectors, you can create a YAML file for an OpenShift project CR, and define the node selectors in the metadata.annotations section, as follows:

apiVersion: v1
kind: Namespace
metadata:
  name: myproject
  annotations:
    openshift.io/node-selector: key3=value3,key4=value4
...output omitted...

In the previous example, any new pods that you create in the project have by default the key3=value3 and key4=value4 node selectors.

Levels for Node Selectors

You can define node selectors at the following levels: pod, project, and cluster-wide.

Project node selectors take precedence over cluster-wide node selectors. Thus, if you define both project and cluster-wide node selectors, then OpenShift applies only the project node selectors to the new pods but not the cluster-wide node selectors.

Pod node selectors are additive to cluster-wide and project node selectors. Thus, if you define both pod and project node selectors, then OpenShift creates the pod with both the pod node selectors and the project node selectors. If you define both pod and cluster-wide node selectors, then OpenShift creates the pod with both the pod node selectors and the cluster-wide node selectors.

Important

Creating a pod node selector with the same key string but with a different value string from a project node selector or from a cluster-wide node selector causes a conflict, and OpenShift fails to create your pod.

Taints and Tolerations

You can use taints on nodes to prevent certain nodes from running some workloads. The OpenShift scheduler schedules pods in those nodes only if the pods have a matching toleration.

Figure 4.3: Node taints and pod tolerations

In the previous figure, the worker01 node has a taint, but the worker02 node does not have a taint. The OpenShift scheduler cannot place the first pod in the worker01 node; the node rejects the pod because the toleration does not match the taint. Thus, OpenShift can schedule the pod only in the worker02 node.

OpenShift can schedule the second pod in both nodes, because the toleration matches the taint.

Node Taints

Taints consist of a key, a value, and an effect.

The key is any string, up to 253 characters.
The value is any string, up to 63 characters.
The effect is one of the following options:
- NoSchedule: If the pod does not match the taint, then the node prevents scheduling it. Existing pods on the node continue running on the node.
- PreferNoSchedule: If the pod does not match the taint, then the scheduler tries to avoid scheduling it in the node. Existing pods on the node continue running on the node.
- NoExecute: If the pod does not match the taint, then the node prevents scheduling it. The scheduler removes any existing pods on the node without a matching toleration.

You can apply a taint to a node in the node specification YAML file by setting the spec.taints parameter or by using the following command:

[user@host ~]$ oc adm taint nodes node1 key1=value1:NoSchedule
node/node1 tainted

After you apply a taint to a node, you can verify the taint by using the following command:

[user@host ~]$ oc describe nodes node1
Name:               worker01
Roles:              worker
...output omitted...
Taints:             key1=value1:NoSchedule
...output omitted...

To remove a taint from a node, use the following command:

[user@host ~]$ oc adm taint nodes node01 key1=value1:NoSchedule-
node/node01 untainted

Pod Tolerations

Tolerations consist of a key, a value, and an effect, similar to the node taints. Tolerations also include the operator parameter. The following values for the operator parameter are possible:

Equal: The toleration matches the taint if the key, the value, and the effect parameters match. This behavior is the default.
Exists: The toleration matches the taint if the key and the effect parameters match. You must leave the value parameter blank.

When you define a toleration with the NoExecute effect, you can also define the tolerationSeconds parameter. The tolerationSeconds parameter defines the time that a pod stays within a node before the node evicts it.

You can apply a toleration to a pod in the pod specification YAML file by setting the spec.tolerations parameter as follows:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
...output omitted...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
...output omitted...

In the previous example, OpenShift can schedule the pod in nodes with the key1=value1:NoExecute taint. If OpenShift schedules the pod in a node with the taint, then the pod runs in the node for 3600 seconds before the node evicts it and OpenShift reschedules the pod to another node.

Pod Scheduling and Node Conditions

By default, OpenShift taints and evicts nodes by condition. For example, OpenShift automatically taints nodes under some conditions, such as memory or disk pressure.

The following taints are built into OpenShift:

Taint	Definition
`node.kubernetes.io/not-ready`	The node is not ready. This taint corresponds to the `Ready=False` node condition.
`node.kubernetes.io/unreachable`	The node is unreachable from the node controller. This taint corresponds to the `Ready=Unknown` node condition.
`node.kubernetes.io/memory-pressure`	The node has memory pressure issues. This taint corresponds to the `MemoryPressure=True` node condition.
`node.kubernetes.io/disk-pressure`	The node has disk pressure issues. This taint corresponds to the `DiskPressure=True` node condition.
`node.kubernetes.io/network-unavailable`	The node network is unavailable.
`node.kubernetes.io/unschedulable`	The node is unschedulable.
`node.cloudprovider.kubernetes.io/uninitialized`	This taint sets the node as unusable when you start the node controller in an external cloud provider. After the cloud controller manager initializes the node, the kubelet removes the taint.
`node.kubernetes.io/pid-pressure`	The node has process identifier (PID) pressure. This taint corresponds to the `PIDPressure=True` node condition.

If a node reports one of the conditions from the previous table, then OpenShift applies the taint to the node until the condition clears.

For example, you can verify how OpenShift automatically applies the node.kubernetes.io/unschedulable taint to an unschedulable pod by following the next example. First, verify the taints for a node by using the following command:

[user@host ~]$ oc describe nodes node1 | grep Taints
Taints:     <none>

Then, mark the node as unschedulable.

[user@host ~]$ oc adm cordon node1
node/node1 cordoned

Finally, verify that OpenShift automatically applies the node.kubernetes.io/unschedulable taint with the NoSchedule effect to the node.

[user@host ~]$ oc describe nodes node1 | grep Taints
Taints:     node.kubernetes.io/unschedulable:NoSchedule

OpenShift applies the taints to the nodes from the previous table with the NoSchedule effect, except for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints where OpenShift uses the NoExecute effect.

For the taints with the NoSchedule effect, the OpenShift scheduler schedules the pod to the node if the pod has a matching toleration, or if the node returns to a normal state, but running pods remain unaltered.

For the taints with the NoExecute effect, the OpenShift scheduler schedules the pod to the node if the pod has a matching toleration, or if the node returns to a normal state, and starts evicting and rescheduling running pods to different nodes. For pods that do not tolerate the taint and that are running on the node, OpenShift evicts them immediately. For pods that match the toleration, OpenShift evicts them only if the tolerationSeconds parameter is defined on the pod. The tolerationSeconds parameter defines the time in seconds that a pod remains in a node until the node evicts it.

The tolerationSeconds parameter is useful in certain scenarios, such as when an application must create local files when running for the first time. In this case, waiting for a few seconds for the pod to recover might be preferable to directly re-creating the pod on another node. If you define the tolerationSeconds parameter for an application, then the node waits the specified time before evicting the pods.

Project Tolerations

You can specify tolerations for a project, so OpenShift includes those tolerations to any new pods that are created in the project. You require administrative privileges to specify project tolerations. If you define project and pod tolerations, then OpenShift creates the pod with both of these tolerations.

To define project tolerations, you can create a YAML file for an OpenShift project CR and define the toleration in the metadata.annotations section as follows:

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: myproject
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: >-
      [{"operator":"Equal","effect":"NoSchedule","key":"key1","value":"value1"}]

In the previous example, any new pods that you create in the project have by default the key1=value1:NoSchedule toleration.

Tolerating All Taints

You can configure a pod to tolerate all the node taints by using the operator: "Exists" toleration with no key or value parameters.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
...output omitted...
spec:
  tolerations:
  - operator: "Exists"

OpenShift does not remove the pods with the operator: "Exists" toleration from a node that has taints.

References

For more information about node selectors, refer to the Placing Pods on Specific Nodes Using Node Selectors section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#nodes-scheduler-node-selectors

For more information about taints and tolerations, refer to the Controlling Pod Placement Using Node Taints section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#nodes-scheduler-taints-tolerations

Discuss Red Hat OpenShift Administration III: Scaling Deployments in the Enterprise

Go to community

Welcome to Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise!

Syed

12 wrz 2023

We are excited to launch a space dedicated to the Red Hat Training course Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to DO378.Read more about Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise here.

Revision: do380-4.14-397a507