Red Hat Enterprise Linux Diagnostics and Troubleshooting
Abstract
| Goal |
Identify kernel issues and assist Red Hat Support in resolving kernel issues. |
| Objectives |
|
| Sections |
|
| Lab |
|
When an application crashes, the Linux kernel captures its memory image in a core dump. Core dumps contain the application's memory at the moment that it stopped working. Application vendors analyze core dumps to determine why an application crashed.
Similarly, when an operating system crashes, it captures the kernel's memory image in a crash dump. Operating system vendors analyze crash dumps to determine why a system crashed.
In Red Hat Enterprise Linux, the kdump service captures kernel crash dumps. The kdump service uses the kexec system call to boot a secondary Linux kernel. The secondary kernel is also known as the capture kernel. Without restarting the system, the capture kernel boots from a reserved memory area in the primary kernel. After booting, the capture kernel copies the primary kernel's memory image to a crash dump file.
By default, RHEL 8 installs the kexec-tools package, which provides the kdump service. The package provides command-line utilities to manage the kdump service. Alternatively, navigate to the tab of the web console to manage the kdump service in a graphical interface.
The capture kernel's reserved memory size depends on a system's architecture and on the total available physical memory. For x86_64 architectures, the minimum reserved memory to capture dumps is 160 MB.
On most systems, the kdump service automatically calculates the required memory. To enable this feature, add the crashkernel=auto setting in the GRUB_CMDLINE_LINUX parameter of the /etc/default/grub configuration file.
GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto no_timer_check net.ifnames=0 console=ttyS0,115200n8"Note
The crashkernel=auto setting requires x86_64 systems to have at least 1 GB of memory installed. Size requirements for ARM and IBM Power architectures vary. For more information, consult the references at the end of this section.
If you modify the /etc/default/grub file, you must regenerate the GRUB2 configuration.
For systems that use BIOS firmware, use the following command.
[root@host ~]# grub2-mkconfig -o /boot/grub2/grub.cfgFor systems that use UEFI firmware, use the following command.
[root@host ~]# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfgReboot the system to implement the new amount of reserved memory.
Enable and start the kdump service to generate crash dumps.
[root@host ~]#systemctl enable kdumpCreated symlink from /etc/systemd/system/multi-user.target.wants/kdump.service to /usr/lib/systemd/system/kdump.service. [root@host ~]#systemctl start kdump[root@host ~]#systemctl status kdump● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Wed 2021-11-10 12:38:14 EST; 3s ago Process: 3482 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 3482 (code=exited, status=0/SUCCESS) Nov 10 12:38:13 host.lab.example.com systemd[1]: Starting Crash recovery kernel arming... Nov 10 12:38:14 host.lab.example.com kdumpctl[3482]: kdump: kexec: loaded kdump kernel Nov 10 12:38:14 host.lab.example.com kdumpctl[3482]: kdump: Starting kdump: [OK] Nov 10 12:38:14 host.lab.example.com systemd[1]: Started Crash recovery kernel arming.
Modify the /etc/kdump.conf configuration file to alter the behavior and collection settings of kernel crash dumps.
By default, the kdump service stores crash dump files in the /var/crash directory.
[root@host ~]#ls -la /var/crashtotal 4 drwxr-xr-x. 3 root root 42 Feb 17 01:08 . drwxr-xr-x. 20 root root 4096 Feb 17 01:07 .. drwxr-xr-x. 2 root root 42 Feb 18 01:28 127.0.0.1-2021-11-09-13:11:30 [root@host ~]$ls -la /var/crash/127.0.0.1-2021-11-09-13\:11\:30/total 117964 drwxr-xr-x. 2 root root 67 Nov 9 13:11 . drwxr-xr-x. 3 root root 43 Nov 9 13:11 .. -rw-r--r--. 1 root root 41567 Nov 9 13:11 kexec-dmesg.log -rw-------. 1 root root 120707664 Nov 9 13:11 vmcore -rw-r--r--. 1 root root 39868 Nov 9 13:11 vmcore-dmesg.txt
The vmcore file contains the crash dump. The vmcore-dmesg.txt file contains the kernel log at the time of the crash.
Large crash dump files can be difficult to send quickly to Red Hat Support. To expedite the crash dump analysis, send the smaller vmcore-dmesg.txt file first for a preliminary assessment.
Modify the path option in the /etc/kdump.conf configuration file to change the crash dump directory.
path /var/crash
The kdump service offers crash dump targets other than local files. The following options are available in the /etc/kdump.conf configuration file.
Table 10.1. /etc/kdump.conf Options for Configuring Dump Target
| Option | Description |
|---|---|
raw
| Run the dd command to copy the crash dump to the specified partition. |
nfs
| Mount and copy the crash dump to the specified location in the path option on the NFS share. |
ssh
| Run the scp command to transfer the crash dump to the location in the path option on the remote server with the specified user account for authentication. |
sshkey
| Used with the ssh crash dump type to specify the location of the SSH key to use for authentication. |
| Mount the specified partition with the specified file system type to the /mnt directory and write the crash dump to the specified path location in the path option. |
path
| Specifies the path to save the crash dump to on the target. If no dump target is specified, then the path is assumed to be from the root of the local file system. |
By default, the makedumpfile utility generates kernel core dumps. The core_collector option in the /etc/kdump.conf configuration file modifies the collection parameters.
core_collector makedumpfile -l --message-level 1 -d 31
The -c,-l, and -p options change the compression algorithm of the core dump.
Table 10.2. makedumpfile Compression Options
| Option | Use for crash dump data compression |
|---|---|
-c
|
zlib
|
-l
|
lzo
|
-p
|
snappy
|
Message levels filter message types in crash dumps. The previous example uses message level 1, which includes only a progress indicator in the crash output message. The following table lists some of the message levels.
Table 10.3. makedumpfile Message Levels
| Message level | Description |
|---|---|
0
| Do not include any messages. |
1
| Include only progress indicator. |
4
| Include only error messages. |
31
| Include all messages. |
Dump levels filter page types in crash dumps. Dump levels can filter out zero pages, cached pages, user data pages, and free pages. Dump level filtering decreases the size of the crash dump.
The previous example uses dump level 31, which excludes zero pages, cached pages, user data pages, and free pages. This dump level generates the smallest crash dump.
Table 10.4. makedumpfile Dump Levels
| Dump level | Description |
|---|---|
0
| Include all page types. |
1
| Do not include zero pages. |
31
| Exclude zero pages, cached pages, user data pages, and free pages. |
For SSH dump targets, specify the scp utility in place of makedumpfile.
core_collector scp
Note
The full list of message and dump levels is in the makedumpfile(8) man page.
The kdumpctl command, from the kexec-tools package, performs common kdump administration tasks.
[root@host ~]# kdumpctl -h
kdump: Usage: /bin/kdumpctl {start|stop|status|restart|reload|rebuild|propagate|showmem}The kdumpctl status command verifies the status of the kdump service.
[root@host ~]# kdumpctl status
kdump: Kdump is operationalThe kdumpctl showmem command displays the current reserved memory for the capture kernel.
[root@host ~]# kdumpctl showmem
kdump: Reserved 192MB memory for crash kernelThe kdumpctl propagate command simplifies the setup of SSH crash dump targets. It determines from the sshkey parameter in the /etc/kdump.conf file which SSH key to use. If the key does not exist, then the kdumpctl utility automatically creates it. The ssh-copy-id command is then automatically invoked to copy the key to the target SSH server.
[root@host ~]#kdumpctl propagateWARNING: '/root/.ssh/kdump_id_rsa' doesn't exist, using default value '/root/.ssh/kdump_id_rsa' Generating new ssh keys... done. The authenticity of host 'server.lab.example.com (172.25.250.11)' can't be established. ECDSA key fingerprint is 62:88:d6:2a:57:b1:3b:cd:9e:3c:52:e6:e3:94:f9:59. Are you sure you want to continue connecting (yes/no)?yes/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys root@server.lab.example.com's password:redhatNumber of key(s) added: 1 Now try logging into the machine, with: "ssh 'root@server.lab.example.com'" and check to make sure that only the key(s) you wanted were added. /root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on server.lab.example.com
Note
The Kdump Helper lab generates a script to automatically configure kdump targets. The lab is available on the Red Hat Customer Portal at https://access.redhat.com/labs/kdumphelper/.
Systems generate crash dumps when their kernel encounters an unrecoverable error. By default, any other type of error does not generate a crash dump. Administrators can enable crash dumps to troubleshoot specific errors.
When a system runs out of memory, the oom killer process kills other processes to free up system memory and to keep the system operational. In many use cases, using the oom killer is preferred to triggering a panic because killing targeted processes attempts to kill the process that causes the memory issue while protecting other critical or long-running processes from crashing.
However, configuring a system to panic instead is appropriate when confidence in the overall system memory stability might be uncertain, and the recommended recovery method is to immediately halt possible memory data corruption and reboot the system cleanly.
The following command temporarily configures a system to panic on OOM-killer events:
[root@host ~]# echo 1 > /proc/sys/vm/panic_on_oomTo make the configuration permanent, use the following commands:
[root@host ~]#echo "vm.panic_on_oom=1" >> /etc/sysctl.conf[root@host ~]#sysctl -p
Applications sometimes experience bugs and their processes appear to hang.
The following command temporarily configures a system to panic when processes hang for longer than a specific timeout value:
[root@host ~]# echo 1 > /proc/sys/kernel/hung_task_panicTo make the configuration permanent, use the following commands:
[root@host ~]#echo "kernel.hung_task_panic=1" >> /etc/sysctl.conf[root@host ~]#sysctl -p
The default timeout value is 120 seconds. The timeout value is configured in the /proc/sys/kernel/hung_task_timeout_secs file.
[root@host ~]# cat /proc/sys/kernel/hung_task_timeout_secs
120Soft lockups occur when a task is executing in kernel space on a CPU without rescheduling.
The following command temporarily configures a system to panic when soft lockups occur:
[root@host ~]# echo 1 > /proc/sys/kernel/softlockup_panicTo make the configuration permanent, use the following commands:
[root@host ~]#echo "kernel.softlockup_panic=1" >> /etc/sysctl.conf[root@host ~]#sysctl -p
Note
Do not enable the softlockup_panic or nmi_watchdog kernel parameters on a virtualized RHEL 8 machine. The virtualized environment might trigger inauthentic soft lockups that rarely require a system panic.
A non-maskable interrupt (NMI) usually occurs when a system detects a critical hardware error. NMIs are automatically generated by the NMI Watchdog program, if it is enabled. NMIs are manually generated by pressing the physical NMI button on system hardware or a virtual NMI button from the system's out-of-band management interface, such as HP's iILO or Dell's iIDRAC.
The following command temporarily configures a system to panic when NMIs are detected:
[root@host ~]# echo 1 > /proc/sys/kernel/panic_on_io_nmiTo make the configuration permanent, use the following commands:
[root@host ~]#echo "kernel.panic_on_io_nmi=1" >> /etc/sysctl.conf[root@host ~]#sysctl -p
The "Magic" SysRq key is a key sequence to diagnose an unresponsive system. The following command temporarily enables the SysRq key:
[root@host ~]# echo 1 > /proc/sys/kernel/sysrqTo make the configuration permanent, use the following commands:
[root@host ~]#echo "kernel.sysrq=1" >> /etc/sysctl.conf[root@host ~]#sysctl -p
When enabled, certain SysRq commands trigger system events. Use the key sequence to enter SysRq commands. The following table summarizes the SysRq commands and their associated events.
Table 10.5. SysRq Commands and Associated Events
| SysRq command | Event |
|---|---|
m
| Dump information about memory allocation. |
t
| Dump thread state information. |
p
| Dump CPU registers and flags. |
c
| Crash the system. |
s
| Sync mounted file systems. |
u
| Remount file systems read-only. |
b
| Initiate system reboot. |
9
| Power off the system. |
f
| Start OOM killer. |
w
| Dump hung processes. |
Alternatively, issue SysRq commands by writing their associated key characters to the /proc/sysrq-trigger file. For example, the following command initiates a system crash.
[root@host ~]# echo 'c' > /proc/sysrq-triggerThe c character is often used to test system crash dumps.
early kdump is a feature of the kdump mechanism to capture the core dump of a booting kernel. In earlier versions than Red Hat Enterprise Linux 8, the kdump service starts later in the boot sequence, typically alongside system services such as sshd. This delayed start prevents kdump from capturing core dumps if a system crashes before system services start. In Red Hat Enterprise Linux 8, early kdump starts the kdump service earlier in the boot sequence, and creates core dumps even if the system crashes before system services start.
The follow commands enable the early kdump feature.
Ensure that a
kdump initramfsimage exists for the current kernel.[root@host ~]#
ls /boot/initramfs-`u name -r`kdump.img/boot/initramfs-4.18.0-305.el8.x86_64kdump.imgRebuild the
initramfsof the booting kernel withearly kdumpsupport.[root@host ~]#
dracut -f --add earlykdumpAppend the
rd.earlykdumpkernel boot parameter to thekerneloptsline ingrub.[root@host ~]#
grub2-editenv - set "kernelopts=root=/dev/mapper/rhel-root ro crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap console=tty0 console=ttyS0,115200 rd.earlykdumpTo implement the changes, reboot the system.
[root@host ~]#
reboot
Analyzing a crash dump is complex and requires knowledge of the Linux kernel. Specific tools help administrators to analyze high-level crash dump information.
To analyze a crash dump, install the following packages:
The
kernel-debuginfopackage that matches the version of the kernel where the dump was created. This information is in thevmcore-dmesg.txtfile that is stored alongside the kernel crash dump, or by running thestringscommand against thevmcorefile.[root@host]#
strings vmcore | headKDUMP Linux host.example.com 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 ...output omitted...The
crashpackage.
The crash command requires two parameters: the debug version of the kernel image (from the kernel-debuginfo package), and the kernel crash dump vmcore file. If the vmcore file is omitted, then the crash session runs against the currently running kernel.
[root@host ~]# crash /usr/lib/debug/lib/modules/4.18.0-305.el8.x86_64/vmlinux
/var/crash/127.0.0.1-2021-11-09-13:11:30/vmcoreThe crash prompt offers various useful commands.
files: Shows the open files for the specified process.<PID>ps: Lists every processes that was running at the time of the crash.fuser: Displays which processes were using a certain file or directory.<PATHNAME>
The help command displays command usage information. The exit command quits the crash prompt.
References
Installing and Configuring kdump
kdump(8), kexec(8), grub-mkconfig(1), kdump.conf(5), and makedumpfile(8) manual pages