How to give your hard drive a checkup
Longtime computer users have likely at one point or another experienced that sinking feeling after their trusty rig develops a new "click-click-click," or a whirring sound. It certainly should raise the suspicion that it could be the last gasp of a failing hard drive. The problem is that while in some cases these new sounds are a "click of death" for an aging storage device, it's not always the case. For example, a cooling fan that has worn out can sometimes have a similar sound to a crashing hard drive.
If you have any fear that your mechanical hard drive is reading and writing its last bytes of data, the first priority should be to grab any and all data from it. In some cases, the telltale noises will precede the drive's total failure, and there should be a window of opportunity to access the data before the drive completely calls it quits. This can also be an opportunity to verify that backups are working and current, including those on a USB external hard drive or network attached storage (NAS). However, even with backups functioning, a direct data grab off the potentially failing hard drive is a good idea at this point, so that you can be sure the data is captured to a convenient location, such as a USB flash drive, optical media, or another hard drive. Once the drive crashes completely, it will require a data recovery service to retrieve any data from it, which is quite expensive and has no guarantee of success.
Get S.M.A.R.T.
With the data saved from the drive, and the suspicion of a drive failure remaining, there are methods to help you figure out what's going on. Modern hard drives use what is called S.M.A.R.T. Data, which stands for Self-Monitoring, Analysis, and Reporting Technology, and is commonly referred to as SMART. It debuted in 2004, and grew out of IBM's Predictive Failure Analysis technology for monitoring multiple drives' health in servers. SMART Data can be accessed on internal drives on both the IDE and SATA interfaces, as well as external drives of the eSATA and USB varieties.
Crystal Disk Info
You might think that figuring out if a storage drive is dying would be a core function of Windows, but this is not the case. However, there are several freeware solutions that can easily get access to the SMART data on a storage drive and display it. The program we turn to for this is Crystal Disk Info, as it is frequently updated and has a good interface. An alternate that works as well is HD Tune, but doesn't seem to be updated as frequently.
Here we see a screenshot of Crystal Disk Info. At first glance, it's clear that there's a lot of data provided about the storage device. Starting with the general information, at the top is the type of drive, the manufacturer, model, and capacity listed. In this case it's a Toshiba 1TB drive. Below that, we see listed the firmware version, the serial number (removed from the screenshot), the interface, and its speed, which is SATA 300. There is also the drive letter Windows has assigned to the drive, the buffer size of 8MB, and the spindle rotation rate of 5,400rpm. So far, this is all information could be gleaned from looking at the physical drive, except for the firmware and the drive letter, and while interesting background info, is not really informing us about the health of the drive.
SMART data analysis
On the left side is a general "Health Status," which in this case is "Good," and reassures us that this drive is not failing and is fine to continue to use. This Health Status is an overall indicator of the condition of the drive. There's also an indicator of the drive's temperature, which, at 29C, indicates that the drive is within the ideal temperature range of 25C to 40C for maximum longevity.
When considering the age of a drive, there are some indicators of how much the drive has been used, which can raise the suspicion that perhaps it's worn out. One of these is the "Power On Count," which tells you the number of times that this computer's drive has been powered on (137 in this example). The other is the "Power On Hours," which tells you how many hours the drive has been used (171 in this example). Both of these are low in this case, as this is system is only a few months old. Older research had indicated that three percent of drives failed in their first three months of use, and then a steady rate of hard drive failure after two years, until five years, when most of the drives were at their end of life. Unfortunately, while the hours and Power Ons give an idea of the drive's use, it does not predict the drive's impending failure.
One of the difficulties in looking at SMART Data to try to figure out which parameter will indicate an impending drive failure, is that the hard drive vendors are not forthcoming about which of these parameters are most predictive of failure. This is probably because drive manufacturers hardly want to highlight that their drives fail at all (why dwell on the negative?), but as sure as death and taxes, a hard drive failing sooner or later is just one more eventuality of life.
Predicting drive failure
While mum's the word over at the hard drive vendors, one company that has been forthcoming is Backblaze, which has experience with 40,000 hard drives worth of data. With that many drives, you can bet they have learned a thing or two about drive failures along the way. They report that they use five stats to predict a hard drive failure:
- SMART 5 – Reallocated Sector Count: This is a count of the number of drive sectors that have been found to have errors, and remapped to good sectors. As bad sectors accumulate, this negatively affects drive performance.
- SMART 187 – Reported Uncorrectable Errors: A count of the number of errors that could not be recovered.
- SMART 188 – Command Timeout: A count of the number of operations that had to be aborted due to a hard drive timeout.
- SMART 197 – Current Pending Sector Count: A count of the current sectors that have unrecoverable read errors that are pending to be remapped.
- SMART 198 – Uncorrectable Sector Count: This is a total count of the uncorrected sectors, and will increase in value as the drive is failing.
If we look at Crystal Disk Info, it does not follow the same numbering scheme; we'll have to put on our hexadecimal thinking caps and do the conversions. So, while 5 is still the Reallocated Sector Count, the reported Uncorrectable Errors is not 187, but BB, and Command Timeout is BC. The Current Pending Sector Count is C5, and Offline Uncorrectable is at C6.
By looking at the Current value, and the Threshold, an estimation of how close a drive is to the end of its life can be made.
A word about SSDs
While the previous discussion is focused on mechanical hard drives, the same approach can be applied to solid-state drives. While these flash memory storage solutions with no moving parts (except the electrons) do not fail from the mechanical issues that plague traditional hard drives, the NAND chips and controller boards of SSDs can still fail. While Crystal Disk Info software can certainly be used for SSD SMART Data analysis, there is also manufacturer-specific software for analysis. One of the advantages of manufacturer-specific software for an SSD is that it can check for a firmware upgrade to the drive.
Here we have an example of a Crucial MX100 SSD that has a 256GB capacity. Most of the parameters in terms of SMART Data for an SSD are the same as on a mechanical hard drive, although there are some differences due to the flash-based storage technology of the SSD. For example, parameter 5 is no longer the reallocated sector count as on a mechanical hard drive, but now Retired NAND Blocks.
Another useful parameter is 202, which is the "Percentage Lifetime Used." This is not reported for mechanical hard drives, but takes advantage of the fact that SSDs have a predicted lifespan based on a finite number of write cycles according to the manufacturer. For example, on Intel SSDs, this is based on 20GB of data written daily, over five years, which is a total of 37TB of data written (and the 20GB daily number is higher than what the majority of home users will write on a daily basis). The "Percentage Lifetime Used" tracks the amount of data written and can give a value of the useful lifespan of the SSD based on the data written to date. This is another way to figure out if a drive is close to a failure, or has plenty of life left in it.
An ounce of prevention...
A storage drive failure is a reality that most users will encounter at some point. With the use of software, such as Crystal Disk Info, or manufacturer-specific tools, SMART Data can be obtained from the hard drive. With an understanding of the value and limitations of SMART data, as well as which parameters to keep an eye on, you can get a feel for the health of your drives.