Bookmark this page

Chapter 10.  Troubleshooting Kernel Issues

Abstract

Goal

Identify kernel issues and assist Red Hat Support in resolving kernel issues.

Objectives
  • Configure systems to enable kernel crash dumps.

  • Compile SystemTap scripts into kernel modules for deployment on remote systems.

Sections
  • Configuring Kernel Crash Dumps (and Guided Exercise)

  • Kernel Debugging with SystemTap (and Guided Exercise)

Lab
  • Troubleshooting Kernel Issues

Configuring Kernel Crash Dumps

Objectives

  • Configure systems to enable kernel crash dumps.

Generating Kernel Crash Dumps

When an application crashes, the Linux kernel captures its memory image in a core dump. Core dumps contain the application's memory at the moment that it stopped working. Application vendors analyze core dumps to determine why an application crashed.

Similarly, when an operating system crashes, it captures the kernel's memory image in a crash dump. Operating system vendors analyze crash dumps to determine why a system crashed.

kdump and kexec

In Red Hat Enterprise Linux, the kdump service captures kernel crash dumps. The kdump service uses the kexec system call to boot a secondary Linux kernel. The secondary kernel is also known as the capture kernel. Without restarting the system, the capture kernel boots from a reserved memory area in the primary kernel. After booting, the capture kernel copies the primary kernel's memory image to a crash dump file.

Configuring kdump

By default, RHEL 8 installs the kexec-tools package, which provides the kdump service. The package provides command-line utilities to manage the kdump service. Alternatively, navigate to the Crash Dump tab of the web console to manage the kdump service in a graphical interface.

Memory Reservation

The capture kernel's reserved memory size depends on a system's architecture and on the total available physical memory. For x86_64 architectures, the minimum reserved memory to capture dumps is 160 MB.

On most systems, the kdump service automatically calculates the required memory. To enable this feature, add the crashkernel=auto setting in the GRUB_CMDLINE_LINUX parameter of the /etc/default/grub configuration file.

GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto no_timer_check net.ifnames=0 console=ttyS0,115200n8"

Note

The crashkernel=auto setting requires x86_64 systems to have at least 1 GB of memory installed. Size requirements for ARM and IBM Power architectures vary. For more information, consult the references at the end of this section.

If you modify the /etc/default/grub file, you must regenerate the GRUB2 configuration.

For systems that use BIOS firmware, use the following command.

[root@host ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

For systems that use UEFI firmware, use the following command.

[root@host ~]# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

Reboot the system to implement the new amount of reserved memory.

The kdump Service

Enable and start the kdump service to generate crash dumps.

[root@host ~]# systemctl enable kdump
Created symlink from /etc/systemd/system/multi-user.target.wants/kdump.service to /usr/lib/systemd/system/kdump.service.
[root@host ~]# systemctl start kdump
[root@host ~]# systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Wed 2021-11-10 12:38:14 EST; 3s ago
  Process: 3482 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 3482 (code=exited, status=0/SUCCESS)

Nov 10 12:38:13 host.lab.example.com systemd[1]: Starting Crash recovery kernel arming...
Nov 10 12:38:14 host.lab.example.com kdumpctl[3482]: kdump: kexec: loaded kdump kernel
Nov 10 12:38:14 host.lab.example.com kdumpctl[3482]: kdump: Starting kdump: [OK]
Nov 10 12:38:14 host.lab.example.com systemd[1]: Started Crash recovery kernel arming.

Modify the /etc/kdump.conf configuration file to alter the behavior and collection settings of kernel crash dumps.

Designating Crash Dump Targets

By default, the kdump service stores crash dump files in the /var/crash directory.

[root@host ~]# ls -la /var/crash
total 4
drwxr-xr-x.  3 root root   42 Feb 17 01:08 .
drwxr-xr-x. 20 root root 4096 Feb 17 01:07 ..
drwxr-xr-x.  2 root root   42 Feb 18 01:28 127.0.0.1-2021-11-09-13:11:30
[root@host ~]$ ls -la /var/crash/127.0.0.1-2021-11-09-13\:11\:30/
total 117964
drwxr-xr-x. 2 root root        67 Nov  9 13:11 .
drwxr-xr-x. 3 root root        43 Nov  9 13:11 ..
-rw-r--r--. 1 root root     41567 Nov  9 13:11 kexec-dmesg.log
-rw-------. 1 root root 120707664 Nov  9 13:11 vmcore
-rw-r--r--. 1 root root     39868 Nov  9 13:11 vmcore-dmesg.txt

The vmcore file contains the crash dump. The vmcore-dmesg.txt file contains the kernel log at the time of the crash.

Large crash dump files can be difficult to send quickly to Red Hat Support. To expedite the crash dump analysis, send the smaller vmcore-dmesg.txt file first for a preliminary assessment.

Modify the path option in the /etc/kdump.conf configuration file to change the crash dump directory.

path /var/crash

The kdump service offers crash dump targets other than local files. The following options are available in the /etc/kdump.conf configuration file.

Table 10.1. /etc/kdump.conf Options for Configuring Dump Target

OptionDescription
raw [partition] Run the dd command to copy the crash dump to the specified partition.
nfs [nfs share] Mount and copy the crash dump to the specified location in the path option on the NFS share.
ssh [user@server] Run the scp command to transfer the crash dump to the location in the path option on the remote server with the specified user account for authentication.
sshkey [sshkeypath] Used with the ssh crash dump type to specify the location of the SSH key to use for authentication.
[fs type] [partition] Mount the specified partition with the specified file system type to the /mnt directory and write the crash dump to the specified path location in the path option.
path [path] Specifies the path to save the crash dump to on the target. If no dump target is specified, then the path is assumed to be from the root of the local file system.

Core Collection

By default, the makedumpfile utility generates kernel core dumps. The core_collector option in the /etc/kdump.conf configuration file modifies the collection parameters.

core_collector makedumpfile -l --message-level 1 -d 31

The -c,-l, and -p options change the compression algorithm of the core dump.

Table 10.2. makedumpfile Compression Options

OptionUse for crash dump data compression
-c zlib
-l lzo
-p snappy

Message levels filter message types in crash dumps. The previous example uses message level 1, which includes only a progress indicator in the crash output message. The following table lists some of the message levels.

Table 10.3. makedumpfile Message Levels

Message levelDescription
0 Do not include any messages.
1 Include only progress indicator.
4 Include only error messages.
31 Include all messages.

Dump levels filter page types in crash dumps. Dump levels can filter out zero pages, cached pages, user data pages, and free pages. Dump level filtering decreases the size of the crash dump.

The previous example uses dump level 31, which excludes zero pages, cached pages, user data pages, and free pages. This dump level generates the smallest crash dump.

Table 10.4. makedumpfile Dump Levels

Dump levelDescription
0 Include all page types.
1 Do not include zero pages.
31 Exclude zero pages, cached pages, user data pages, and free pages.

For SSH dump targets, specify the scp utility in place of makedumpfile.

core_collector scp

Note

The full list of message and dump levels is in the makedumpfile(8) man page.

Using kdumpctl

The kdumpctl command, from the kexec-tools package, performs common kdump administration tasks.

[root@host ~]# kdumpctl -h
kdump: Usage: /bin/kdumpctl {start|stop|status|restart|reload|rebuild|propagate|showmem}

The kdumpctl status command verifies the status of the kdump service.

[root@host ~]# kdumpctl status
kdump: Kdump is operational

The kdumpctl showmem command displays the current reserved memory for the capture kernel.

[root@host ~]# kdumpctl showmem
kdump: Reserved 192MB memory for crash kernel

The kdumpctl propagate command simplifies the setup of SSH crash dump targets. It determines from the sshkey parameter in the /etc/kdump.conf file which SSH key to use. If the key does not exist, then the kdumpctl utility automatically creates it. The ssh-copy-id command is then automatically invoked to copy the key to the target SSH server.

[root@host ~]# kdumpctl propagate
WARNING: '/root/.ssh/kdump_id_rsa' doesn't exist, using default value '/root/.ssh/kdump_id_rsa'
Generating new ssh keys... done.
The authenticity of host 'server.lab.example.com (172.25.250.11)' can't be established.
ECDSA key fingerprint is 62:88:d6:2a:57:b1:3b:cd:9e:3c:52:e6:e3:94:f9:59.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@server.lab.example.com's password: redhat

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@server.lab.example.com'"
and check to make sure that only the key(s) you wanted were added.

/root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on server.lab.example.com

Note

The Kdump Helper lab generates a script to automatically configure kdump targets. The lab is available on the Red Hat Customer Portal at https://access.redhat.com/labs/kdumphelper/.

Kernel Crash Dump Triggers

Systems generate crash dumps when their kernel encounters an unrecoverable error. By default, any other type of error does not generate a crash dump. Administrators can enable crash dumps to troubleshoot specific errors.

OOM Events

When a system runs out of memory, the oom killer process kills other processes to free up system memory and to keep the system operational. In many use cases, using the oom killer is preferred to triggering a panic because killing targeted processes attempts to kill the process that causes the memory issue while protecting other critical or long-running processes from crashing.

However, configuring a system to panic instead is appropriate when confidence in the overall system memory stability might be uncertain, and the recommended recovery method is to immediately halt possible memory data corruption and reboot the system cleanly.

The following command temporarily configures a system to panic on OOM-killer events:

[root@host ~]# echo 1 > /proc/sys/vm/panic_on_oom

To make the configuration permanent, use the following commands:

[root@host ~]# echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
[root@host ~]# sysctl -p

Hung Processes

Applications sometimes experience bugs and their processes appear to hang.

The following command temporarily configures a system to panic when processes hang for longer than a specific timeout value:

[root@host ~]# echo 1 > /proc/sys/kernel/hung_task_panic

To make the configuration permanent, use the following commands:

[root@host ~]# echo "kernel.hung_task_panic=1" >> /etc/sysctl.conf
[root@host ~]# sysctl -p

The default timeout value is 120 seconds. The timeout value is configured in the /proc/sys/kernel/hung_task_timeout_secs file.

[root@host ~]# cat /proc/sys/kernel/hung_task_timeout_secs
120

Soft Lockup

Soft lockups occur when a task is executing in kernel space on a CPU without rescheduling.

The following command temporarily configures a system to panic when soft lockups occur:

[root@host ~]# echo 1 > /proc/sys/kernel/softlockup_panic

To make the configuration permanent, use the following commands:

[root@host ~]# echo "kernel.softlockup_panic=1" >> /etc/sysctl.conf
[root@host ~]# sysctl -p

Note

Do not enable the softlockup_panic or nmi_watchdog kernel parameters on a virtualized RHEL 8 machine. The virtualized environment might trigger inauthentic soft lockups that rarely require a system panic.

Non-Maskable Interrupts

A non-maskable interrupt (NMI) usually occurs when a system detects a critical hardware error. NMIs are automatically generated by the NMI Watchdog program, if it is enabled. NMIs are manually generated by pressing the physical NMI button on system hardware or a virtual NMI button from the system's out-of-band management interface, such as HP's iILO or Dell's iIDRAC.

The following command temporarily configures a system to panic when NMIs are detected:

[root@host ~]# echo 1 > /proc/sys/kernel/panic_on_io_nmi

To make the configuration permanent, use the following commands:

[root@host ~]# echo "kernel.panic_on_io_nmi=1" >> /etc/sysctl.conf
[root@host ~]# sysctl -p

Magic SysRq

The "Magic" SysRq key is a key sequence to diagnose an unresponsive system. The following command temporarily enables the SysRq key:

[root@host ~]# echo 1 > /proc/sys/kernel/sysrq

To make the configuration permanent, use the following commands:

[root@host ~]# echo "kernel.sysrq=1" >> /etc/sysctl.conf
[root@host ~]# sysctl -p

When enabled, certain SysRq commands trigger system events. Use the Alt+PrintScreen+[CommandKey] key sequence to enter SysRq commands. The following table summarizes the SysRq commands and their associated events.

Table 10.5. SysRq Commands and Associated Events

SysRq commandEvent
m Dump information about memory allocation.
t Dump thread state information.
p Dump CPU registers and flags.
c Crash the system.
s Sync mounted file systems.
u Remount file systems read-only.
b Initiate system reboot.
9 Power off the system.
f Start OOM killer.
w Dump hung processes.

Alternatively, issue SysRq commands by writing their associated key characters to the /proc/sysrq-trigger file. For example, the following command initiates a system crash.

[root@host ~]# echo 'c' > /proc/sysrq-trigger

The c character is often used to test system crash dumps.

The early kdump Feature

early kdump is a feature of the kdump mechanism to capture the core dump of a booting kernel. In earlier versions than Red Hat Enterprise Linux  8, the kdump service starts later in the boot sequence, typically alongside system services such as sshd. This delayed start prevents kdump from capturing core dumps if a system crashes before system services start. In Red Hat Enterprise Linux 8, early kdump starts the kdump service earlier in the boot sequence, and creates core dumps even if the system crashes before system services start.

The follow commands enable the early kdump feature.

  1. Ensure that a kdump initramfs image exists for the current kernel.

    [root@host ~]# ls /boot/initramfs-`u name -r`kdump.img
    /boot/initramfs-4.18.0-305.el8.x86_64kdump.img
  2. Rebuild the initramfs of the booting kernel with early kdump support.

    [root@host ~]# dracut -f --add earlykdump
  3. Append the rd.earlykdump kernel boot parameter to the kernelopts line in grub.

    [root@host ~]# grub2-editenv - set "kernelopts=root=/dev/mapper/rhel-root ro crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap console=tty0 console=ttyS0,115200 rd.earlykdump
  4. To implement the changes, reboot the system.

    [root@host ~]# reboot

Analyzing Kernel Crash Dumps

Analyzing a crash dump is complex and requires knowledge of the Linux kernel. Specific tools help administrators to analyze high-level crash dump information.

Preparing a System for Crash Dump Analysis

To analyze a crash dump, install the following packages:

  • The kernel-debuginfo package that matches the version of the kernel where the dump was created. This information is in the vmcore-dmesg.txt file that is stored alongside the kernel crash dump, or by running the strings command against the vmcore file.

    [root@host]# strings vmcore | head
    KDUMP
    Linux
    host.example.com
    4.18.0-305.el8.x86_64
    #1 SMP Thu Apr 29 08:54:30 EDT 2021
    ...output omitted...
  • The crash package.

Analyzing Crash Dumps

The crash command requires two parameters: the debug version of the kernel image (from the kernel-debuginfo package), and the kernel crash dump vmcore file. If the vmcore file is omitted, then the crash session runs against the currently running kernel.

[root@host ~]# crash /usr/lib/debug/lib/modules/4.18.0-305.el8.x86_64/vmlinux
/var/crash/127.0.0.1-2021-11-09-13:11:30/vmcore

The crash prompt offers various useful commands.

  • files <PID>: Shows the open files for the specified process.

  • ps: Lists every processes that was running at the time of the crash.

  • fuser <PATHNAME>: Displays which processes were using a certain file or directory.

The help command displays command usage information. The exit command quits the crash prompt.

References

Installing and Configuring kdump kdump(8), kexec(8), grub-mkconfig(1), kdump.conf(5), and makedumpfile(8) manual pages

Revision: rh342-8.4-6dd89bd