Bookmark this page

Monitor Process Activity

Objectives

  • Define load average and determine resource-intensive server processes.

Describe Load Average

Load average is a measurement that the Linux kernel provides, to represent the perceived system load for a period of time. It can be used as a rough gauge of how many system resource requests are pending, to determine whether system load increases or decreases.

The kernel collects the current load number every five seconds based on the number of processes in runnable and uninterruptible states. This number is accumulated and reported as an exponential moving average over the most recent 1, 5, and 15 minutes.

Load Average Calculation

The load average represents the perceived system load for a period of time. Linux determines load average by reporting how many processes are ready to run on a CPU and how many processes are waiting for disk or network I/O to complete.

  • The load number is a running average of the number of processes that are ready to run (in process state R) or are waiting for I/O to complete (in process state D).

  • Some UNIX systems consider only CPU usage or run queue length to indicate system load. Linux also includes disk or network usage, because the high usage of these resources can significantly impact system performance as CPU load. For high load averages with minimal CPU activity, examine disk and network activity.

Load average is a rough measurement of how many processes are currently waiting for a request to complete before they do anything else. The request might be for CPU time to run the process. Alternatively, the request might be for a critical disk I/O operation to complete, and the process cannot be run on the CPU until the request completes, even though the CPU is idle. Either way, system load is impacted, and the system appears to run more slowly because processes are waiting to run.

Interpret Load Average Values

The uptime command is one way to display the current load average. It prints the current time, how long the machine has been up, how many user sessions are running, and the current load average.

[user@host ~]$ uptime
 15:29:03 up 14 min,  2 users,  load average: 2.92, 4.48, 5.20

The three values for the load average represent the load over the last 1, 5, and 15 minutes. It indicates whether the system load appears to be increasing or decreasing.

If the main contribution to load average is from processes that are waiting for the CPU, then you can calculate the approximate load value per CPU to determine whether the system is experiencing significant waiting.

Use the lscpu command to determine the number of CPUs on a system.

In the following example, the system is a dual-core single-socket system with two hyper threads per core. Linux treats this CPU configuration as a four-CPU system for scheduling purposes.

[user@host ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
...output omitted...

Imagine that the only contribution to the load number is from processes that need CPU time. Then you can divide the displayed load average values by the number of logical CPUs in the system. A value below 1 indicates adequate resource use and minimal wait times. A value above 1 indicates resource saturation and some processing delay.

# From lscpu, the system has four logical CPUs, so divide by 4:
#                               load average: 2.92, 4.48, 5.20
#           divide by number of logical CPUs:    4     4     4
#                                             ----  ----  ----
#                       per-CPU load average: 0.73  1.12  1.30
#
# This system's load average appears to be decreasing.
# With a load average of 2.92 on four CPUs, all CPUs were in use ~73% of the time.
# During the last 5 minutes, the system was overloaded by ~12%.
# During the last 15 minutes, the system was overloaded by ~30%.

An idle CPU queue has a load number of 0. Each process that waits for a CPU adds a count of 1 to the load number. If one process is running on a CPU, then the load number is 1, and the resource (the CPU) is in use, but no requests are waiting. If that process runs for an entire minute, then its contribution to the one-minute load average is 1.

However, processes that are uninterruptibly sleeping for critical I/O due to a busy disk or network resource are also included in the count and increase the load average. Although not indicating CPU use, these processes are added to the queue count, because they wait for resources and cannot run on a CPU until they get the resources. This metric is still considered as system load due to resource limitations that cause processes not to run.

Until resource saturation, a load average remains below 1, because tasks are seldom found waiting in the queue. Load average increases only when resource saturation causes requests to remain queued, and the load calculation routine counts them. When resource use approaches 100%, each extra request starts experiencing service wait time.

Real-time Process Monitoring

The top command displays a dynamic view of the system's processes and a summary header followed by a process or thread list. Unlike the static ps command output, the top command continuously refreshes at a configurable interval and provides column reordering, sorting, and highlighting. You can make persistent changes to the top settings. The default top output columns are as follows:

  • The process ID (PID).

  • The process owner username (USER).

  • Virtual memory (VIRT) is all the memory that the process uses, including the resident set, shared libraries, and any mapped or swapped memory pages (labeled VSZ in the ps command).

  • Resident memory (RES) is the physical memory that the process uses, including any resident, shared objects (labeled RSS in the ps command).

  • Process state (S) can be one of the following states:

    • D = Uninterruptible Sleeping

    • R = Running or Runnable

    • S = Sleeping

    • T = Stopped or Traced

    • Z = Zombie

  • CPU time (TIME) is the total processing time since the process started. It can be toggled to include a cumulative time of all previous children.

  • The process command name (COMMAND).

Table 8.3. Fundamental Keystrokes in top Command

KeyPurpose
? or h Help for interactive keystrokes.
l, t, m Toggles for load, threads, and memory header lines.
1 Toggle for individual CPUs or a summary for all CPUs in the header.
s Change the refresh (screen) rate, in decimal seconds (such as 0.5, 1, 5).
b Toggle reverse highlighting for Running processes; the default is bold only.
Shift+b Enables bold use in display, in the header, and for Running processes.
Shift+h Toggle threads; show process summary or individual threads.
u, Shift+u Filter for any username (effective, real).
Shift+m Sort process listing by memory usage, in descending order.
Shift+p Sort process listing by processor use, in descending order.
k Kill a process. When prompted, enter PID, and then signal.
r Renice a process. When prompted, enter PID, and then nice_value.
Shift+w Write (save) the current display configuration for use at the next top restart.
q Quit.
f Manage the columns by enabling or disabling fields. You can also set the sort field for top.

Note

The s, k, and r keystrokes are not available when the top command is started in a secure mode.

References

ps(1), top(1), uptime(1), and w(1) man pages

Revision: rh124-9.3-770cc61