Red Hat Enterprise Linux Diagnostics and Troubleshooting
Abstract
| Goal |
Identify hardware issues that can affect a system's ability to operate normally. |
| Objectives |
|
| Sections |
|
| Lab |
|
A simple and direct source of hardware information is the running Linux kernel, which functions as the mediator of all hardware access. The kernel exposes information structures through the /proc and /sys file systems, as well as by sending kernel messages and events.
Kernel messages are written to a preallocated ring buffer that is known as the dmesg buffer. A ring buffer is a sequential memory structure where data overflow starts again at the top of the buffer. Over time, recent messages fill the buffer and overwrite original messages, but the buffer never grows in size.
Examine the dmesg buffer for the following reasons:
Reviewing hardware that was detected at boot time.
Observing driver messages that are sent as hardware is detected, attached, or detached.
Observing warning or error messages as events occur.
To display ring buffer messages, use the dmesg or the journalctl -k command. The journald service can be configured for persistent storage, which is preferred over using the /var/log/dmesg file pointer to the ring buffer. The systemd-journal service captures the kernel ring buffer. The rsyslog service retrieves information from the kernel ringer buffer with the imjournal plug-in.
The first two lines of the dmesg command contain the kernel version and kernel parameters that are used on the last system boot, which is useful for troubleshooting boot issues.
[root@host ~]# dmesg | head -n2
[ 0.000000] Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt3)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/vda3 ro no_timer_check net.ifnames=0 crashkernel=autoBecause of the quantity and complexity of dmesg log entries, view the log with filters to focus on relevant information. This example displays the memory that was made available during booting.
[root@host ~]# dmesg | grep "Memory"
[ 0.000000] Memory: 261668K/2096600K available (12292K kernel code, 2100K rwdata, 3816K rodata, 2348K init, 3320K bss, 271600K reserved, 0K cma-reserved)
[ 0.108065] x86/mm: Memory block size: 128MBYou can filter messages by their syslog facility and severity by using the -f and -l options.
[root@host ~]# dmesg -f kern -l warn
[ 0.025358] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[ 6.095391] printk: systemd: 16 output lines suppressed due to ratelimitingMake message time stamps easier to read by using the -T option.
[root@host ~]# dmesg -f kern -l warn -T
[Wed Sep 22 07:17:39 2021] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[Wed Sep 22 07:17:45 2021] printk: systemd: 16 output lines suppressed due to ratelimitingModern systems typically have multiple CPUs, each with multiple cores per socket, and possibly multiple hyper-threads per core, each with different levels of local and shared caches. The lscpu command provides a quick summary of the local configuration. The following example shows lscpu output, and an explanation of the pertinent information.
[root@host ~]#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4
On-line CPU(s) list: 0-3 Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel CPU family: 6 Model: 23 Model name: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz Stepping: 10 CPU MHz: 2833.000 BogoMIPS: 5665.57 Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K L2 cache: 6144K NUMA node0 CPU(s): 0-3
Flags: fpu vme vmx ...
The architecture of the CPU. | |
The number of logical cores that are available to the kernel for task scheduling. | |
The number of hyper-threads per core. | |
The number of cores per socket. | |
The number of physical sockets. | |
The number of distinct memory buses on NUMA systems. | |
The hardware-assisted virtualization support for the CPU. | |
The sizes of the relative caches. | |
The mapping of the logical CPUs to the NUMA address buses. | |
The list of extended technologies that the CPU supports. |
Note
A CPU might display support for a specific feature flag, but that flag does not guarantee that the feature is available or in use. For example, the Intel vmx flag indicates a processor that is capable of supporting hardware virtualization, but the feature might be disabled in the firmware.
Use the dmidecode -t memory command to retrieve information about physical memory banks, including the type, speed, and location.
[root@host ~]# dmidecode -t memory
# dmidecode 2.12
...output omitted...
Handle 0x0007, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0005
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: SODIMM
Set: None
Locator: ChannelA-DIMM1
Bank Locator: BANK 1
Type: DDR3
Type Detail: Synchronous
Speed: 1600 MHz
Manufacturer: Hynix/Hyundai
Serial Number: 0E80EABA
Asset Tag: None
Part Number: HMT41GS6BFR8A-PB
Rank: Unknown
Configured Clock Speed: 1600 MHz
...output omitted...To identify physical storage devices, use the lsscsi command, to list the physical SCSI driver compatible devices that are attached to the system, including SSD, USB, SATA, and SAS disks.
[root@host ~]# lsscsi -v
[0:0:0:0] disk ATA SAMSUNG MZ7LN512 4L0Q /dev/sda
dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0]
[5:0:0:0] cd/dvd HL-DT-ST DVDRAM GUB0N LV20 /dev/sr0
dir: /sys/bus/scsi/devices/5:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0]The hdparm command can provide detailed information for individual storage devices.
[root@host ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: SAMSUNG MZ7LN512HCHP-000L1
Serial Number: S1ZKNXAG806853
Firmware Revision: EMT04L0Q
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
...output omitted...The lspci and lsusb commands query system buses to discover connected, active peripherals. The lspci command displays hardware devices that are connected to the system PCI bus. The hardware might be integrated on the motherboard, or be physically connected through a PCI host adapter. To display further device details, use the increasing verbose options (-v, -vv, -vvv).
[root@host ~]# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM
Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express
x16 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated
Graphics Controller (rev 06)
...output omitted...The lsusb command displays hardware devices that are connected to any system USB bus. The hardware might be integrated on the motherboard, or be physically connected through a USB port. This command also uses verbose options to show more device details.
[root@host ~]# lsusb
Bus 003 Device 002: ID 8087:8000 Intel Corp.
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 1058:083a Western Digital Technologies, Inc.
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 009: ID 05e3:0608 Genesys Logic, Inc. Hub
Bus 001 Device 008: ID 17a0:0001 Samson Technologies Corp. C01U condenser microphone
Bus 001 Device 006: ID 04f2:b39a Chicony Electronics Co., Ltd
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
...output omitted...Red Hat Enterprise Linux 8 provides the rasdaemon daemon to report hardware errors. The rasdaemon service processes reliability, availability, and serviceability (RAS) error events that are generated by kernel tracing. These trace events are logged in /sys/kernel/debug/tracing and are reported by rsyslog and journald.
To use rasdaemon, install the package, and then start and enable the service.
[root@host ~]#yum install rasdaemon...output omitted... [root@host ~]#systemctl enable --now rasdaemonCreated symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /usr/lib/systemd/system/rasdaemon.service.
The ras-mc-ctl command from the rasdaemon package works with error detection and correction (EDAC) drivers. The --help option displays the command options.
[root@host ~]# ras-mc-ctl --help
Usage: ras-mc-ctl [OPTIONS...]
--quiet Quiet operation.
--mainboard Print mainboard vendor and model for this hardware.
--status Print status of EDAC drivers.
--print-labels Print Motherboard DIMM labels to stdout.
--guess-labels Print DMI labels, when bank locator is available.
--register-labels Load Motherboard DIMM labels into EDAC driver.
--delay=N Delay N seconds before writing DIMM labels.
--labeldb=DB Load label database from file DB.
--layout Display the memory layout.
--summary Presents a summary of the logged errors.
--errors Shows the errors stored at the error database.
--error-count Shows the corrected and uncorrected error counts using sysfs.
--help This help message.This example summarizes memory controller events:
[root@host ~]# ras-mc-ctl --summary
Memory controller events summary:
Corrected on DIMM Label(s): 'CPU_SrcID#0_Ha#0_Chan#0_DIMM#0'
location: 0:0:0:-1 errors: 1
No PCIe AER errors.
No Extlog errors.
MCE records summary:
1 MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read errors
2 No ErrorsThis example lists errors that the memory controller reports:
[root@host ~]# ras-mc-ctl --errors
Memory controller events:
1 3172-02-17 00:47:01 -0500 1 Corrected error(s): memory read error at CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 location: 0:0:0:-1, addr 65928, grain 7, syndrome 0 area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0
...output omitted...
MCE events:
1 3171-11-09 06:20:21 -0500 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x01000c16, status=0x8c00004000010090, addr=0x1018893000, misc=0x15020a086, walltime=0x57e96780, cpuid=0x00050663, bank=0x00000007
2 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x0000abcd, walltime=0x57e967ea, cpuid=0x00050663, bank=0x00000001
3 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x00001234, walltime=0x57e967ea, cpu=0x00000001, cpuid=0x00050663, apicid=0x00000002, bank=0x00000002When physical memory is suspected of causing errors or being damaged, you can run an thorough, repetitive memory test with the memtest86+ package. Because live memory testing on an active system is not recommended or might cause instability, memtest86+ installs a separate boot entry that loads a memtest86+ bootable program instead of a regular Linux kernel. The MemTest86 open-source project now supports BIOS, UEFI, and ARM systems, and can use a USB device to load the bootable memory test application.
For BIOS-based RHEL systems, enable the boot entry by using standard boot loader utilities:
Install the memtest86+ package to add the
memtest86+application to the/bootdirectory.[root@host ~]#
yum install memtest86+Run the
memtest-setupcommand to add a new template to the/etc/grub.d/directory.[root@host ~]#
memtest-setupUpdate the
grub2boot loader configuration.[root@host ~]#
grub2-mkconfig -o /boot/grub2/grub.cfg
References
dmesg(1), lscpu(1), dmidecode(8), lspci(8), lsusb(8), rasdaemon(8), and ras-mc-ctl(8) man pages