Blake Gonzales

As discussed earlier, a failed memory bit in both SMP and clustered systems can cause data integrity issues or even an entire system crash. Error correcting memory (ECC) has been introduced to mitigate memory bit failures and actually detect and correct the failures.
First let us address the data integrity issues associated with memory failures. There are two issues to address here: 1) detecting data integrity issues, and 2) handling data integrity failures. In typical memory systems, there is inherently the chance that single or multiple bits can flip without the complete failure of the memory subsystem. This risk of this happening is actually quite high, although it has improved somewhat over the years with increased memory reliability. The problem with this is that you may not know that memory errors have occurred because the system continues to run. This can easily lead to data integrity issues and could manifest itself as erroneous data, calculations, or errors in data saved to the file system. Obviously, not being able to detect when you have data integrity issues can become a severe failure, because you may not realize the failure until long after the original failure occurred.

As stated above, detecting that a memory failure has occurred, and therefore the possibility of data corruption, is of concern. The introduction of ECC memory has solved many of the problems associated with memory failures. ECC memory adds an additional memory unit, which is used to store parity data each time a data bit is written to the memory subsystem. When reading data from memory, the parity area is consulted to insure the data being read is not corrupted. If corruption is detected, the hardware then attempts to correct the bit by reconstructing it from the parity data. Generally, memory errors are successfully corrected by ECC memory, and the system continues to process data without incident. Sometimes though, the error is so severe that such an incident may well result in an uncorrectable memory error.

The question then is how to report these errors to the operating system to ensure data integrity. With most modern day Linux and UNIX systems, the ECC memory is tightly coupled with the operating system, and errors are reported to the kernel and logged. In this way, an administrator can watch for high rates of correctable errors (i.e. 100 correctable errors in 24 hours) that indicate repair is warranted. In the event of an uncorrectable error, the kernel is notified and the operating system will generally panic and shutdown the system. Why should the operating system panic instead of just logging the uncorrectable error? Once a uncorrectable error has occurred, data integrity of the system is compromised and the safest thing to do is shutdown and repair the memory.

Next Up… HEAD/LOGIN NODES


-- Blake Gonzales, Dell HPC Scientist

See other posts from this Blog series:
INTRO
SMP
CLUSTERED SYSTEMS
CLUSTERED SYSTEM INFRASTRUCTURE
POWER DISTRIBUTION
COOLING