Bookmark this page

Chapter 4.  Identifying Hardware Issues

Abstract

Goal

Identify hardware issues that can affect a system's ability to operate normally.

Objectives
  • Identify system hardware devices and hardware issues.

  • Manage kernel modules and parameters.

  • Identify and resolve virtualization issues.

Sections
  • Identifying Hardware Issues (and Guided Exercise)

  • Managing Kernel Modules (and Guided Exercise)

  • Resolving Virtualization Issues (and Guided Exercise)

Lab
  • Identifying Hardware Issues

Identifying Hardware Issues

Objectives

  • Identify system hardware devices and hardware issues.

Reviewing Kernel Messages

A simple and direct source of hardware information is the running Linux kernel, which functions as the mediator of all hardware access. The kernel exposes information structures through the /proc and /sys file systems, as well as by sending kernel messages and events.

Kernel messages are written to a preallocated ring buffer that is known as the dmesg buffer. A ring buffer is a sequential memory structure where data overflow starts again at the top of the buffer. Over time, recent messages fill the buffer and overwrite original messages, but the buffer never grows in size.

Examine the dmesg buffer for the following reasons:

  • Reviewing hardware that was detected at boot time.

  • Observing driver messages that are sent as hardware is detected, attached, or detached.

  • Observing warning or error messages as events occur.

To display ring buffer messages, use the dmesg or the journalctl -k command. The journald service can be configured for persistent storage, which is preferred over using the /var/log/dmesg file pointer to the ring buffer. The systemd-journal service captures the kernel ring buffer. The rsyslog service retrieves information from the kernel ringer buffer with the imjournal plug-in.

Exploring Kernel Messages

The first two lines of the dmesg command contain the kernel version and kernel parameters that are used on the last system boot, which is useful for troubleshooting boot issues.

[root@host ~]# dmesg | head -n2
[    0.000000] Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt3)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/vda3 ro no_timer_check net.ifnames=0 crashkernel=auto

Because of the quantity and complexity of dmesg log entries, view the log with filters to focus on relevant information. This example displays the memory that was made available during booting.

[root@host ~]# dmesg | grep "Memory"
[    0.000000] Memory: 261668K/2096600K available (12292K kernel code, 2100K rwdata, 3816K rodata, 2348K init, 3320K bss, 271600K reserved, 0K cma-reserved)
[    0.108065] x86/mm: Memory block size: 128MB

You can filter messages by their syslog facility and severity by using the -f and -l options.

[root@host ~]# dmesg -f kern -l warn
[    0.025358] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    6.095391] printk: systemd: 16 output lines suppressed due to ratelimiting

Make message time stamps easier to read by using the -T option.

[root@host ~]# dmesg -f kern -l warn -T
[Wed Sep 22 07:17:39 2021] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[Wed Sep 22 07:17:45 2021] printk: systemd: 16 output lines suppressed due to ratelimiting

Identifying CPU Capabilities

Modern systems typically have multiple CPUs, each with multiple cores per socket, and possibly multiple hyper-threads per core, each with different levels of local and shared caches. The lscpu command provides a quick summary of the local configuration. The following example shows lscpu output, and an explanation of the pertinent information.

[root@host ~]# lscpu
Architecture:          x86_64           1
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4                2
On-line CPU(s) list:   0-3
Thread(s) per core:    1                3
Core(s) per socket:    4                4
Socket(s):             1                5
NUMA node(s):          1                6
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Model name:            Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
Stepping:              10
CPU MHz:               2833.000
BogoMIPS:              5665.57
Virtualization:        VT-x             7
L1d cache:             32K              8
L1i cache:             32K
L2 cache:              6144K
NUMA node0 CPU(s):     0-3              9
Flags:                 fpu vme vmx ...  10

1

The architecture of the CPU.

2

The number of logical cores that are available to the kernel for task scheduling.

3

The number of hyper-threads per core.

4

The number of cores per socket.

5

The number of physical sockets.

6

The number of distinct memory buses on NUMA systems.

7

The hardware-assisted virtualization support for the CPU.

8

The sizes of the relative caches.

9

The mapping of the logical CPUs to the NUMA address buses.

10

The list of extended technologies that the CPU supports.

Note

A CPU might display support for a specific feature flag, but that flag does not guarantee that the feature is available or in use. For example, the Intel vmx flag indicates a processor that is capable of supporting hardware virtualization, but the feature might be disabled in the firmware.

Identifying Memory

Use the dmidecode -t memory command to retrieve information about physical memory banks, including the type, speed, and location.

[root@host ~]# dmidecode -t memory
# dmidecode 2.12
...output omitted...
Handle 0x0007, DMI type 17, 34 bytes
Memory Device
  Array Handle: 0x0005
  Error Information Handle: Not Provided
  Total Width: 64 bits
  Data Width: 64 bits
  Size: 8192 MB
  Form Factor: SODIMM
  Set: None
  Locator: ChannelA-DIMM1
  Bank Locator: BANK 1
  Type: DDR3
  Type Detail: Synchronous
  Speed: 1600 MHz
  Manufacturer: Hynix/Hyundai
  Serial Number: 0E80EABA
  Asset Tag: None
  Part Number: HMT41GS6BFR8A-PB
  Rank: Unknown
  Configured Clock Speed: 1600 MHz
...output omitted...

Identifying Storage Devices

To identify physical storage devices, use the lsscsi command, to list the physical SCSI driver compatible devices that are attached to the system, including SSD, USB, SATA, and SAS disks.

[root@host ~]# lsscsi -v
[0:0:0:0]    disk    ATA      SAMSUNG MZ7LN512 4L0Q  /dev/sda
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0]
[5:0:0:0]    cd/dvd  HL-DT-ST DVDRAM GUB0N     LV20  /dev/sr0
  dir: /sys/bus/scsi/devices/5:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0]

The hdparm command can provide detailed information for individual storage devices.

[root@host ~]# hdparm -I /dev/sda
/dev/sda:

ATA device, with non-removable media
  Model Number:       SAMSUNG MZ7LN512HCHP-000L1
  Serial Number:      S1ZKNXAG806853
  Firmware Revision:  EMT04L0Q
  Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
...output omitted...

Identifying Peripherals on the USB and PCI Buses

The lspci and lsusb commands query system buses to discover connected, active peripherals. The lspci command displays hardware devices that are connected to the system PCI bus. The hardware might be integrated on the motherboard, or be physically connected through a PCI host adapter. To display further device details, use the increasing verbose options (-v, -vv, -vvv).

[root@host ~]# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM
Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express
x16 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated
Graphics Controller (rev 06)
...output omitted...

The lsusb command displays hardware devices that are connected to any system USB bus. The hardware might be integrated on the motherboard, or be physically connected through a USB port. This command also uses verbose options to show more device details.

[root@host ~]# lsusb
Bus 003 Device 002: ID 8087:8000 Intel Corp.
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 1058:083a Western Digital Technologies, Inc.
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 009: ID 05e3:0608 Genesys Logic, Inc. Hub
Bus 001 Device 008: ID 17a0:0001 Samson Technologies Corp. C01U condenser microphone
Bus 001 Device 006: ID 04f2:b39a Chicony Electronics Co., Ltd
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
...output omitted...

Hardware Error Reporting

Red Hat Enterprise Linux 8 provides the rasdaemon daemon to report hardware errors. The rasdaemon service processes reliability, availability, and serviceability (RAS) error events that are generated by kernel tracing. These trace events are logged in /sys/kernel/debug/tracing and are reported by rsyslog and journald.

To use rasdaemon, install the package, and then start and enable the service.

[root@host ~]# yum install rasdaemon
...output omitted...
[root@host ~]# systemctl enable --now rasdaemon
Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /usr/lib/systemd/system/rasdaemon.service.

The ras-mc-ctl command from the rasdaemon package works with error detection and correction (EDAC) drivers. The --help option displays the command options.

[root@host ~]# ras-mc-ctl --help
Usage: ras-mc-ctl [OPTIONS...]
 --quiet            Quiet operation.
 --mainboard        Print mainboard vendor and model for this hardware.
 --status           Print status of EDAC drivers.
 --print-labels     Print Motherboard DIMM labels to stdout.
 --guess-labels     Print DMI labels, when bank locator is available.
 --register-labels  Load Motherboard DIMM labels into EDAC driver.
 --delay=N          Delay N seconds before writing DIMM labels.
 --labeldb=DB       Load label database from file DB.
 --layout           Display the memory layout.
 --summary          Presents a summary of the logged errors.
 --errors           Shows the errors stored at the error database.
 --error-count      Shows the corrected and uncorrected error counts using sysfs.
 --help             This help message.

This example summarizes memory controller events:

[root@host ~]# ras-mc-ctl --summary
Memory controller events summary:
        Corrected on DIMM Label(s): 'CPU_SrcID#0_Ha#0_Chan#0_DIMM#0'
        location: 0:0:0:-1 errors: 1

No PCIe AER errors.

No Extlog errors.
MCE records summary:
        1 MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read errors
        2 No Errors

This example lists errors that the memory controller reports:

[root@host ~]# ras-mc-ctl --errors
Memory controller events:
1 3172-02-17 00:47:01 -0500 1 Corrected error(s): memory read error at CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 location: 0:0:0:-1, addr 65928, grain 7, syndrome 0  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0
...output omitted...
MCE events:
1 3171-11-09 06:20:21 -0500 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x01000c16, status=0x8c00004000010090, addr=0x1018893000, misc=0x15020a086, walltime=0x57e96780, cpuid=0x00050663, bank=0x00000007
2 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x0000abcd, walltime=0x57e967ea, cpuid=0x00050663, bank=0x00000001
3 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x00001234, walltime=0x57e967ea, cpu=0x00000001, cpuid=0x00050663, apicid=0x00000002, bank=0x00000002

Memory Testing

When physical memory is suspected of causing errors or being damaged, you can run an thorough, repetitive memory test with the memtest86+ package. Because live memory testing on an active system is not recommended or might cause instability, memtest86+ installs a separate boot entry that loads a memtest86+ bootable program instead of a regular Linux kernel. The MemTest86 open-source project now supports BIOS, UEFI, and ARM systems, and can use a USB device to load the bootable memory test application.

For BIOS-based RHEL systems, enable the boot entry by using standard boot loader utilities:

  1. Install the memtest86+ package to add the memtest86+ application to the /boot directory.

    [root@host ~]# yum install memtest86+
  2. Run the memtest-setup command to add a new template to the /etc/grub.d/ directory.

    [root@host ~]# memtest-setup
  3. Update the grub2 boot loader configuration.

    [root@host ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

References

dmesg(1), lscpu(1), dmidecode(8), lspci(8), lsusb(8), rasdaemon(8), and ras-mc-ctl(8) man pages

Revision: rh342-8.4-6dd89bd