Diary and notebook of whatever tech problems are irritating me at the moment.

20100903

Your Linux system keeps falling and it can't get up

Once in a while a Linux PC technician will encounter a system that has problems with lockups (a.k.a. hanging or freezing). Sometimes it is failing hardware but other times it's a software problem. Here are the common causes for this and how to identify which is the source of your problems. While I predominantly use Ubuntu (and some Mandriva) these tests are valid for most any distribution.

1. A kernel crash or panic is rare but generally fatal. As any PC tech knows, the first test is seeing if the Caps Lock, Num Lock, or Scroll Lock LEDs change state when the corresponding keys are pressed as this is performed by the PC, not the keyboard controller. If they don't change then you know you've got a freeze problem. If the keyboard lights are flashing repeatedly during the freeze then it's a panic. More severe problems can prevent the kernel from failing gracefully enough to even do that. These failures can be caused by fundamental compatibility problems with a kernel module and a critical piece of hardware (like a BIOS with a broken ACPI implementation), a device that was given commands it didn't like and has stopped responding (like a video GPU) or just failing hardware (bad RAM, overheating CPU, and loose PCI or AGP cards).

With hardware problems it is best to start by opening the case and blowing the dust out since that is the source of most overheating problems. After cleaning a few systems you'll understand why you should charge extra for smokers, pet owners, and homes with shag carpet. I find it easiest to use an air compressor with a tank and a blow-off nozzle to do the job as canned air is too weak. Leave the system plugged in (but powered off) to keep it grounded as air streams can produce static electricity which can damage electronics. With modern systems the only hazardous voltages are in the power supply so if you don't insert any metal objects into that you won't get shocked. Try spinning all fans with your finger or a plastic probe to see if they turn freely without resistance. Replace any that drag with anything other than a sleeve-bearing fan (like ball or fluid bearings). Power supply fans can be replaced but it requires soldering or swapping connectors as there isn't a standard connection for their internal fans. While it's fun to use the air nozzle to spin up the fans to 100K+ RPM it's bad for the bearings. They also get cleaner when you hold the blades stationary (I use the end of a big nylon tie). Start with the power supply and then the CPU and work your way around to the front case vents. Keep wires tied up and away from the fans so they don't jam the blades or block airflow. Check for cards and memory modules that are not fully seated in their slots and partially-connected drive cables. If possible, remove heat sinks from CPUs and other chips and check for adequate heat sink grease (should be an even but thin layer across the entire mating surface). Check that the chips are properly seated in the sockets and that the heatsink is pressing down evenly on them else they may tilt and lose contact. Check for failing capacitors. If you find any bad caps it's probably easier to replace the motherboard unless you have good soldering skills. Use your eyes and nose - if something looks or smells burnt then it probably is. Keep in mind that power supplies often have stronger "electrical" smell to them due to hand-soldering during manufacturing.

Laptops are harder to clean. Most can be opened by prying off the top bezel around the keyboard, usually starting with the section enclosing the display hinges. Some have screws and some just latch. Then the keyboard can be removed and the top mounting plate. There are how-to disassembly videos on the Internet for popular models that are often modded by hackers. Some laptops have externally removable heat sinks for easy cleaning. Just because it's easy doesn't mean that users clean them. I once recycled a high-end Sager laptop (about $5K USD) that had overheated and failed. The heatsink had plugged with lint and it kept shutting down so the parents gave it to their kids to play with. The kids laid it on their bed and had it running (bottom fans so no airflow whatsoever) and it overheated enough melt the case around the heatsink. Made me sick to throw it out but the motherboard wasn't practical to fix after that. Modern CPUs will reduce their clock speed when overheating but can't reduce power dissipation entirely and can still overheat and fail when running at minimum levels.

Intermittent failures are harder to diagnose so continuous monitoring with hardware or software tools is needed. Hardware temperature monitoring can be performed with a cooking probe, thermocouple meter, or a dedicated PC temperature monitor that mounts in a drive bay. The CPU, GPU, power supply fan exhaust, and hard drives are the ones to focus on. Temperature limits for devices vary. CPUs and GPUs can often hit 60°C but 50°C is rather hot for a hard drive.

For fan monitoring you can leave the case cover off and keep an eye on them or install a PC fan monitor/controller with a display which mounts in a drive bay (and often includes temperature monitoring probes). Thermostatically-controlled fans will vary a lot but if you are having an overheating problem due to inadequate speed from a non-faulty fan then it may be too far out of the primary air stream to have an adequate response time. Better models have adjustable thresholds or remote sensors but the best solution is one controlled by the motherboard via a 4-pin PWM fan connector. Be careful here - I burned out a CPU fan controller when I used a CPU fan that consumed more current (amperes) than the motherboard's controller could handle so check the specifications before plugging it in. I had to convert mine to a drive connector which meant it ran at maximum speed and sounded like a vacuum cleaner.

A voltmeter is useful for monitoring power supply voltages on the connectors under various loads. Generally supply voltages should be within 10% of the stated voltages on the power supply label. CPUs and GPUs usually have a local regulator on their boards as they need voltages that differ greatly from the normal 12/5/3.3 volts that most power supplies provide. The BIOS often has control over the CPU voltages and configuration so a bug in the BIOS (or incorrect manual settings) can cause erratic lock-ups by making the CPU unstable. Usually a CMOS reset or BIOS update can fix this. One way to test for an unstable or faulty CPU is to underclock it via manual settings in the BIOS (or jumper or switch settings on really old motherboards) and see if stability improves.

To verify what the CPU needs for power and clock rates you first need to identify exactly what one you have as manufacturers have many versions and steppings and their requirements may differ. To see what you have use the command "less /proc/cpuinfo". Use that information to search for exact specifications and compare it to your system. Pay close attention to power requirements as some motherboards, even with the same socket, can't handle some CPUs. This results in unstable CPU voltages and intermittent failures, especially under heavy loads. This problem tends to occur with long-lived socket designs where the CPU family is expanded to include models with higher power requirements (essentially changing the motherboard requirements) that earlier motherboard designs can't meet even though the CPU fits in their sockets. I've damaged a few boards that way. Heatsinks and fans also need to meet the requirements of the CPU. Mass-market PC systems usually have very little power margin between the shipped CPU requirements and the system cooling capabilities so failing to upgrade them both can result in instability. Many of these cheap systems use a ducted case fan for cooling and just replacing a failed one requires tracking down the specifications for the fan and finding a replacement that matches in airflow (CFM) and features (connector type and thermostatic control). Standard CPU fan/heatsink combos usually can't be used as they don't fit in the case or the motherboard lacks mounting holes for them.

Most modern motherboards have built-in sensors as do CPUs, GPUs, and storage devices. These can be queried by software for status information, monitoring, and logging. Some BIOSes report the sensor values and error conditions and advanced servers often have separate hardware modules for remote monitoring of them. The standards that the sensor systems conform to are imprecise so custom drivers and algorithms are needed by external software for each implementation. Software tools include lm-sensors and smartmontools.

The lm-sensors utilities report what thermal/fan/voltage sensors you have on your motherboard (if available and supported) and their current status. You first run "sensors-detect" to identify what kernel modules are needed and have it add them to /etc/modules and reboot (or just load them with modprobe). Then just run "sensors" to get the current status or use a graphical application like the Gnome Sensors Applet, KSensors or the XFCE4 Sensors panel plug-in. Note that wildly extreme readings may not indicate a fault but rather an unused sensor input or an unsupported implementation.

Most modern hard drives and SSDs have a monitoring and diagnostic system called SMART which can be access with smartmontools. While SMART can tell you about problems, it is not good at predicting failures. You use the smartctl program and specify the storage device to query. For most systems the primary storage device is named "sda" by the kernel so the command would be "smartctl -a /dev/sda | less". Most modern drives report temperature, log errors, and have built-in self tests that smartctl can activate. While the underlying registers on the drives are well-defined, what they represent is not so conversion data is needed by smartctl to interpret the values. It will tell you if it recognizes the model or not. The obvious status to check is the "overall-health self-assessment test" which tells you if any of the register values exceed an alarm threshold. More specifically the parameters of type "Pre-fail" are important. Also note the "worst" temperature value as it could indicate a prior significant overheating incident which is most likely to occur under heavy load (like during a backup or a RAID rebuild). Graphical tools include GSmartControl and Palimpsest disk utility in DeviceKit (a.k.a. gnome-disk-utility) but root access may be needed by them. Another is hddtemp which only reads the temperature but has a daemon that can be monitored through the sensor monitoring tools mentioned above.

RAM can be tested with Memtest86+ which is installed in Ubuntu by default. Reboot and hold the left Shift key down before Grub loads and starts booting. You'll get the Grub menu with Memtest86+ listed. You can also download a bootable ISO or USB image from the Memtest86+ site to test with. In the early days of PCs the memory had parity checking but modern RAM doesn't so the only way to identify a failure is by using a RAM test. If you are worried about memory problems then get ECC memory. This costs only a little more than standard RAM but the motherboard has to support it and it can reduce performance and limit overclocking. With ECC memory the BIOS can provide much more memory diagnostic information and testing. For example, wiping unused memory locations is a standard process that is performed at a user-definable interval to see if any bits changed state by themselves. Servers often use ECC memory but usually these are registered ECC memory modules which are sometimes called "server memory". The "registered" aspect isn't a certification - it's a signal amplifier built into the module for use in systems that have more modules than the motherboard's northbridge can communicate with directly. These are not compatible with standard memory or motherboards that use it. RAM memory modules have an SPD device that indicates it's specifications. To read it (and other BIOS information) use the command "dmidecode | less". Another source of intermittent memory problems is faulty configuration by the BIOS, either manually by the user or a faulty automatic configuration. A CMOS reset or BIOS update can often fix this.

Diagnosing a freezing system is difficult since you can't check log messages easily with a frozen system and the logs are often truncated as a result of it. The kernel (and Grub) have built-in remote communication options which can help with this. These out-of-band remote connections can be made through a serial console or with Netconsole and another system. A serial console can be used like an SSH connection but requires a hardware RS-232 serial port which is rare on modern systems. On Ubuntu 10.04 (Lucid Lynx) there is a Memtest86+ serial console configuration already in the menu that can be used to test memory remotely but it's probably more useful for headless (i.e. no display) servers. Netconsole requires a network connection (it uses UDP) and another system running a syslog server. For kernel crashes the Linux Kernel Crash Dump tools can be used to obtain crash data that is useful for diagnostics or reporting kernel bugs but I haven't used it yet.

Check the kernel messages and logs with "dmesg | less", "less /var/log/kern.log", "less /var/log/syslog". There are many different log files including compressed backups of previous logs. Some require you to be root to access them. With Ubuntu you just add "sudo" before the commands or just get a root login with "sudo su". Midnight Commander's internal editor is helpful for reading logs including the compressed ones. The built-in editor is not the default in Ubuntu - you have to enable it within MC with F9 > Options > Alt-I > Alt-S (use Esc 9 instead of F9 when connecting through a serial console).

Most distros have boot options that can be issued through the boot loader to the kernel to change its behavior or deactivate specific functions. Ubuntu and Debian have many but every distro has it's own. These can help to isolate problems or provide long-term stability when added permanently to the boot loader options.

2. An input error resulting in the loss of keyboard/mouse control acts like a freeze but isn't. The first clue is to see if there is any screen activity at all (most desktops at least have a clock applet running). Hardware causes include a faulty peripheral, USB hub, PS/2 port, or KVM switch. With PS/2 ports a failure with one device usually prevents the other from working. A simple test is to plug in a USB mouse or keyboard and see if they work during the freeze. When a kernel bug is responsible the keyboard works in the BIOS setup and Grub menu but fails during boot (I've had problems with a bug related to an Intel i8042 PS/2 controller). These can be intermittent between boots but once it's working during a session it usually stays working. It can also be a bug in X.org if they work in a tty terminal but not in X (as when booting into Ubuntu's recovery mode). I've encountered a freezing problem that affects only the mouse. It often occurs when an OpenGL game crashes. Besides restarting X with the keyboard (knowing the menu hotkeys helps here), I've found that launching the game again and then exiting usually fixes the problem.

Check your X session logs during the freeze by switching to a tty or connecting remotely through a SSH or serial console connection. Login, then do "less /home/<username>/.xsession-errors" and see if there are any crash messages from running applications. Most desktop applications will log messages there if they don't have their own log. If you have no control at all, reboot but don't log in to a graphical session (at the display manager login screen) as the session log will be overwritten as soon as you do. Don't just hit the reset or power button when rebooting - try the Magic Sysrq keys first or connect remotely and issue a "reboot" or "init 6" command.

An example of another traumatic but non-system freeze is when Nautilus hangs as this makes it difficult to do anything with the Gnome desktop until killed (it usually restarts automatically just like Windows Explorer). A Nautilus error would show up in .xsession-errors while a crash would also show up in the kernel logs. If the session log is rather big, making it hard to isolate messages related to a particular application, you can open a terminal window and try running the suspect application from there as any error messages would show up in that window instead. You can also capture the messages to a file by copying the screen or using shell I/O redirection to a file which is helpful when submitting bug reports.

X.org input driver errors will show up in it's log at /var/log/X.#.log where the # represents the instance that was running. Normally it's X.0.log unless you have multiple sessions running like multiple X logins or a non-Xinerama dual-head configuration. A different session ID could also be used if X crashes back to the display manger login screen and it thinks another session is still running (due to a leftover lock file) when you login again.

3. Outright X.org crash. When it involves screen corruption it's obvious but that symptom isn't always present. Sometimes this happens when switching to or from a tty terminal or when an OpenGL application is running full-screen. Sometimes it happens with the display manger at the login screen. If the keyboard lights don't toggle then try switching to a tty. If that doesn't work then try killing X with left-Alt+SysReq+K (or Ctrl+Alt+Backspace if enabled). If that doesn't do anything either then try a remote connection (or just pinging it). If that also fails then you are facing a kernel crash (which can be caused by a misbehaving video device due to integration between X, the drivers, and the kernel). If you do get remote access then save and review the logs including that of the display manger (/var/log/gdm/:0-greeter.log). These crashes are usually caused by video driver problems. In Ubuntu 8.10 through 10.04 (and several other distros) almost any Intel 8xx series graphics device will cause problems. There is a lot of architectural changes occurring with video which involves the kernel, X.org, and DRI and there has been a lot of breakage. Some drivers are not keeping up with the changes and some latent driver bugs are being discovered. The older Intel devices are currently the worst (of the "supported" devices) even though Intel is the one of the companies that is pushing these changes and has engineers working on it. But not all video crashes are the fault of the driver. Some may be kernel bugs with the motherboard chipset that the video GPU is triggering. This was a common problem with AGP ports and video device manufacturers like Nvidia wrote their own AGP modules for specific chipsets.

To save time, instead of analyzing the logs for something that indicates a driver problem, search the Internet for distro bugs relating to the one you have. Identify your graphics device with "lspci | less" or "lshw | less" and then search with Google for the device part number and the distro like "Ubuntu 10.04" or "Lucid". Check the release notes for known problems and possible workarounds. With Ubuntu there are usually workarounds in the Ubuntu help wiki.

4. CPU overload. Some process is hogging the CPU and slowing everything down to a crawl. More common with single cores but can still happen with multicores due to memory bottlenecks. If you can't get a graphical process management tool like Gnome System Monitor to load then switch to a terminal and use the "top" command to see who the culprit is (probably Flash but X.org driver faults can cause overloads without a outright freeze or crash). Identify it by process ID and kill it using top's built-in kill option (press k). You can also list processes with "ps -A" then use "kill -s <signal> <process number>". If there are multiple instances of the same process then use "killall -s <signal> <process name>". The signal is 15 (terminate) by default which means "ask nicely". If that doesn't work then use 9 (kill) which isn't as friendly.

5. I/O overload. Something is hogging the hard drive/SSD which can slow everything down to the point of being non-responsive. You can usually identify this by the hard drive activity LED being lit continuously. You can narrow down the list of processes responsible with the lsof command with "lsof | less" but you'll find the output can be overwhelming. If you know which file it is then you can identify the process responsible with fuser. Interactions between Firefox's database and the EXT filesystem can cause I/O overload intermittently but it's not as often with newer versions. A lack of storage space can cause it if applications that are trying to write to the disk don't handle failed writes well, especially logs and temporary files. They may hang and start hogging the CPU also. To check available storage space use "df -Th".

6. Memory hogging. Some process is eating memory and increasing in size. If the RAM is used up then the swap partition is used which can manifest itself as #5 also. Eventually the system runs out of memory and the kernel starts killing processes to fix it. Identifying the culprit is essentially the same as for CPU hogs. Use the command "free" to check memory and swap usage.


This is only the start of the diagnosis. Once you identify the source of the problem then you can try to find a workaround, file bug reports, and test patches. This all seems rather complicated but after you've fixed a few dozen systems you eventually recognize specific symptoms and behavior patterns right away and can quickly narrow down the problem. What differentiates real technicians and hackers from the amateurs is the stubborn resolve to find the problem.

6 comments:

bjrosen said...

You'll want to add sys_basher to your arsenal. It's available in the Fedora repositories and in the EPEL repository for RHEL/CentOS/SL. For all other distros you can get the source here,

http://www.polybus.com/sys_basher_web/

jhansonxi said...

Thanks for the tip bjrosen. That looks like a good command-line tool for stress testing. For graphical stress testing of the GPU, Phronix Test Suite is another option.

Curtis said...

You should just install Linux then your system wouldn't crash ... oh, wait ...

raxon said...

I have fixed the issue of Linux data loss , using Stellar Phoenix Linux data recovery software. I got it downloaded from http://www.hard-drive-recovery-software.com/linux-recovery.php

jhansonxi said...

Thanks raxon. That was such a smoothly worded bit of commercial spam I decided to leave it. :D

Anonymous said...

I didn't even know that SysRQ still has a function. I should have known your blogpost before when ff froze everything up and HD was like crazy. Bookmarked for future use.

About Me

Omnifarious Implementer = I do just about everything. With my usual occupations this means anything an electrical engineer does not feel like doing including PCB design, electronic troubleshooting and repair, part sourcing, inventory control, enclosure machining, label design, PC support, network administration, plant maintenance, janitorial, etc. Non-occupational includes residential plumbing, heating, electrical, farming, automotive and small engine repair. There is plenty more but you get the idea.