GnuCash Personal Finance Manager
GnuCash!

RAID and Data Storage Protection Solutions for Linux

When a system administrator is first asked to provide a reliable, redundant means of protecting critical data on a server, RAID is usually the first term that comes to mind. In fact, RAID is just one part of an overall data availability architecture. RAID, and some of the complimentary storage technologies, are reviewed below.

RAID, short for Redundant Array of Inexpensive Disks, is a method whereby information is spread across several disks, using techniques such as disk striping (RAID Level 0) and disk mirroring (RAID level 1) to achieve redundancy, lower latency and/or higher bandwidth for reading and/or writing, and recoverability from hard-disk crashes. Over six different types of RAID configurations have been defined. A brief introduction can be found in Mike Neuffer's What Is RAID? page.

Related References

Types of Data Loss

Many users come to RAID with the expectation that using it will prevent data loss. This is expecting too much: RAID can help avoid data loss, but it can't prevent it. To understand why, and to be able to plan a better data protection strategy, it is useful to understand the different types of failures, and the way they can cause data loss.

Accidental or Intentional Erasure
A leading cause of data loss is the accidental or intentional erasure of files by you or another (human) user. This includes files that were erased by hackers who broke into your system, files that were erased by disgruntled employees, and files erased by you, thinking that they weren't needed any more, or due to a sense of discovery, to find out what old-timers mean when they say they fixed it for good by using the wizardly command su - root; cd /; rm -r *. RAID will not help you recover data lost in this way; to mitigate these kinds of losses, you need to perform regular backups (to archive media that aren't easily lost in a fire, stolen, or accidentally erased).

Total Disk Drive Failure
One possible disk drive failure mode is "complete and total disk failure". This can happen when a computer is dropped or kicked, although it can also happen due to old age (of the drive). Typically, the read head crashes into the disk platter, thereby trashing the head, and keeping any/everything on that platter from being readable. If the disk drive has only one platter, this means everything. Failure of the drive electronics (e.g. due to electrostatic discharge, or moisture buildup in capacitors) can result in the same symptoms. This is the pre-eminent failure mode that RAID protects against. By splattering data in a redundant way across many disks, the total failure of any one disk will not cause any actual data loss.

Power Loss and Ensuing Data Corruption
Many beginners think that they can test RAID by starting a disk-access intensive job, and then unplugging the power while it is running. This is usually guaranteed to cause some kind of data corruption, and RAID does nothing to prevent it or to recover the resulting lost data. This kind of data corruption/loss can be avoided by using a journaling file system, and/or a journaling database server (to avoid data loss in a running SQL server when the system goes down). In discussions of journaling, there are typically two types of protection that can be offered: journaled meta-data, and journaled (user's) data. The term "meta-data" refers to the file name, the file owner, creation date, permissions, etc., whereas "data" is that actual contents of the file. By journaling the meta-data, a journaling file system can guarantee fast system boot times, by avoiding long integrity checks during boot. However, journaling the meta-data does not prevent the contents of the file from getting scrambled. Note that most journaling file systems journal only the meta-data, and not the data. (Ext3fs can be made to journal data, but at a tremendous performance loss). Note that databases have their own unique ways of guaranteeing data integrity in the face of power loss or system crash.

Bad Blocks on Disk Drive
The most common form of disk drive failure is a slow but steady loss of 'blocks' on the disk drive. Blocks can go bad in a number of ways: the thin film of magnetic media can separate or slide on its underlying disk platter; the film of magnetic media can span a pit or gouge in the underlying platter, and eventually, like a soap bubble, it can pop. Although disk drives have filters that prevent dust from entering, the filters will not keep out humidity, and slow corrosion can set in. Mechanical abrasion can occur in several ways: the disk head can smash into the disk; alternately, a piece of broken-off media can temporarily jam under the head, or can skitter across the disk patter. Disk head crashes can be caused by kicking or hitting the CPU cabinet; they can also be caused by vibration induced by cooling fans, construction work in the room, etc. There are many other mechanical causes leading to (permanent) bad blocks. In addition, there are also "soft" or corrupted blocks: in modern hard drives, the size of one bit is so small that ordinary thermal noise (Boltzmann noise) is sufficient to occasionally flip a bit. This occurs so frequently that it is normally handled by the disk firmware: modern disk drives store ECC bits to detect and correct such errors. The number of ECC-corrected errors on a disk can be monitored with smartmon tools. Although on-disk ECC correction is sufficient to correct most soft errors, a tiny fraction will remain uncorrectable. Such soft errors damage the data, but do not render the block permanently (physically) unusable. Other soft errors are described in the next section below.

Over time, bad blocks can accumulate, and, from personal experience, do so as fast as one a day. Once a block is bad, data cannot be (reliably) read from it. Bad blocks are not uncommon: all brand new disk drives leave the factory with hundreds (if not thousands) of bad blocks on them. The hard drive electronics can detect a bad block, and automatically reassign in its place a new, good block from elsewhere on the disk. All subsequent accesses to that block by the operating system are automatically and transparently handled by the disk drive. This feature is both good, and bad. As blocks slowly fail on the drive, they are automatically handled until one day the bad-block lookup table on the hard drive is full. At this point, bad blocks become painfully visible to the operating system: Linux will grind to a near halt, while spewing dma_intr: status=0x51 { DriveReady SeekComplete UnrecoverableError } messages.

Using RAID can mitigate the effect of bad blocks. A Linux md-based software RAID array can be forced to run a check/repair sequence by writing the appropriate command to /sys/block/mdX/md/sync_action (see RAID Administration commands, and also below, for details). During repairs, if a disk drive reports a read error, the RAID array will attempt to obtain a good copy of the data from another disk, and then write the good copy onto the failing disk. Assuming the disk has spare blocks for bad-block relocation, this should trigger the bad-block relocation mechanism of the disk. If the disk no longer has spare blocks, then syslog error messages should provide adequate warning that a hard drive needs to be replaced. In short, RAID can protect against bad blocks, provided that the disk drive firmware is correctly detecting and reporting bad blocks. For the case of general data corruption, discussed below, this need not be the case.

General System Corruption
All computer systems are subject to a low-level but persistent corruption of data. Very rarely, a bit in system RAM will flip, due to any one of a large number of reasons, including Boltzmann thermal energy, ground bounce in signal paths, a low level of natural radioactivity in the silicon, and cosmic rays! Similar remarks apply to the CPU itself, as well as all on-chip busses. Off-chip busses (cabling), including EDI, SATA and SCSI cabling, is subject to ground-bounce, clock-skew, electrical interference, kinks in the wires, bad termination, etc. The size of individual bits in modern disk drives are now so small that even ordinary thermal noise (Boltzmann noise) will occasionally flip a bit. Not to be ignored is that obscure software or system bugs can also lead to corrupted data. Some parts may include systems to minimize such errors. For example, mid-range/high-end RAM chips include parity or ECC bits, or employ chipkill-based technology. Serial lines, including Ethernet and SATA, commonly using parity and checksumming to avoid errors. High-end disk drives store parity bits on the disk itself; unfortunately, such high-end drives typically cost about seven times more than consumer-grade disk drives (and/or offer one-seventh the capacity).

If some random "soft" error created in RAM is written to disk, it becomes enshrined and persistent, unless some step is taken to repair it. Random bit-flips on-disk are, by definition, persistent. As these random errors accumulate, they can render a system unusable, and permanently spoil data. Unfortunately, regular data backups do little to avoid this kind of corruption: most likely, one is backing up corrupted data.

Despite this being a common disk failure mode, there is very little (almost nothing) that can be done about it, at least on current Linux with current open source tools. RAID, even in theory, does not address this problem, nor does file system journaling. At this point, I am aware of only a small number of options:

Woe is I! Over the last 15 years, I've retired over 25 hard drives with bad blocks, while managing a small stable of four or five servers and home computers. This works out to a failure rate of less than one every three years, but, multiplied by the number of machines, this adds up. Most recently, I installed a brand new WDC SATA drive, only to discover weeks later that it was silently corrupting my data, and that at a rather incredibly phenomenal rate: dozens of files a day. It took weeks of crazy system instability before I realized what was going on: the drive was really, really cheap! Defective from the factory! The silent part of the corruption was particularly disturbing: at no point did any Linux system component tell me quite what was really happening. Yet, this could have been avoided. Woe is I.

This lack of data error detection and data error correction options for Linux prompts the following wish-list:

Better smartmon integration
On ordinary, consumer-grade PC's, the smartmon tools should be integrated into the desktop panel: much like a low-battery alert on a laptop, there should be a badblocks/impending failure alert. Similarly, DriveReady/SeekComplete errors appearing in the syslog should also be promptly reflected in panel/dock status applets.

ECC integration into RAID
With only "minor" changes, software ECC could be incorporated into Linux raid. Currently, Linux RAID provides no bad data protection at all. In RAID-1, blocks are read from both disks: you have a 50-50 chance of getting any given block from one or the other disks. Data integrity could be improved by reading both disks, and comparing the results: this would immediately indicate a problem. However, if there is a mis-compare, it would not tell you which block is bad, and which one is good. The echo check > /sys/block/mdX/md/sync_action command, discussed above, performs such a compare. However, the recover action has only a 50-50 chance of picking the right block (when the drive itself doesn't signal a read error).

Similarly, RAID-5 stores parity bits, but does not actually use them for data integrity checks. RAID-6 stores a second set of parity bits, but does not use these in an ECC-like fashion. A "simple" modification of RAID-6 could, in principle, store ECC bits, and then use these for recovering from a bad block.

File system integrity monitors
Long-term archival data can slowly accumulate both soft and hard errors, even if the data is never explicitly accessed. It would be good to have an automatic archive monitoring tool to scan the disk for errors, and report them as they occur. Such scanning can be achieved by computing file checksums, and then periodically copying the files, and verifying that the old and new file checksums match. Mismatches indicate that either the old file, or the new copy, have damage due to soft or hard errors.

Currently, all file system integrity tools, such as tripwire, AIDE or FCheck are aimed at intrusion detection, and not at data decay. This means that all file changes are assumed to be malicious, even if they were initiated by the user. This makes them impractical for day-to-day operations on a normal file system, where user-driven actions cause many files to be added, modified and removed regularly. It is also inappropriate for triggering bad block replacement mechanisms, since unchanged files are never physically moved about the disk. (Physically writing a disk block will normally trigger bad block replacement algorithms in the disk firmware in most drives. Simply reading a block will not (in most drives)).

A core assumption of such a file-system integrity checker is that on-disk data corruption is far more frequent than data corruption due to spontaneous bit flips is RAM or other system components. If corruption in other system components was common, then the likelihood of false positives increases: that good on-disk data was mis-identified as bad.

See also:

Disk controller failure
Disk controller failure does not normally lead to data loss, but it can lead to system downtime. High-availability systems typically use multiple controllers, each with its own cabling to a disk drive, to minimize the impact of disk controller failure. The Linux multi-path driver supports such systems.

Linux RAID Solutions

There are three types of RAID solution options available to Linux users: software RAID, outboard DASD boxes, and RAID disk controllers.

Software RAID
Pure software RAID implements the various RAID levels in the kernel disk (block device) code. Pure-software RAID offers the cheapest possible solution: not only are expensive disk controller cards or hot-swap chassis not required, but software RAID works with cheaper IDE disks as well as SCSI disks. With today's fast CPU's, software RAID performance can hold its own against hardware RAID in all but the most heavily loaded and largest systems. The current Linux Software RAID is becoming increasingly fast, feature-rich and reliable, making many of the lower-end hardware solutions uninteresting. Expensive, high-end hardware may still offer advantages, but the nature of those advantages are not entirely clear.

The basic Linux Software RAID implementation is provided by the md (multi-disk) driver, which has been around since the late 1990's. Features of the md driver include:

Outboard DASD Solutions
DASD (Direct Access Storage Device, an old IBM mainframe term) are separate boxes that come with their own power supply, provide a cabinet/chassis for holding the hard drives, and appear to Linux as just another SCSI device. In many ways, these offer the most robust RAID solution. Most boxes provide hot-swap disk bays, where failing disk drives can be removed and replaced without turning off power. Outboard solutions usually offer the greatest choice of RAID levels: RAID 0,1,3,4,and 5 are common, as well as combinations of these levels. Some boxes offer redundant power supplies, so that a failure of a power supply will not disable the box. Finally, with Y-scsi cables, such boxes can be attached to several computers, allowing high-availability to be implemented, so that if one computer fails, another can take over operations.

Because these boxes appear as a single drive to the host operating system, yet are composed of multiple SCSI disks, they are sometimes known as SCSI-to-SCSI boxes. Outboard boxes are usually the most reliable RAID solutions, although they are usually the most expensive (e.g. some of the cheaper offerings from IBM are in the twenty-thousand dollar ballpark). The high-end of this technology is frequently called 'SAN' for 'Storage Area Network', and features cable lengths that stretch to kilometers, and the ability for a large number of host CPU's to access one array.

Inboard DASD Solutions
Similar in concept to outboard solutions, there are now a number of bus-to-bus RAID converters that will fit inside a PC case. These in several varieties. One style is a small disk-like box, that fits into a standard 3.5 inch drive bay, and draws power from the power supply in the same way that a disk would. Another style will plug into a PCI slot, and use that slot only for electrical power (and the space it provides).

Both SCSI-to-SCSI and EIDE-to-EIDE converters are available. Because these are converters, they appear as ordinary hard-drives to the operating system, and do not require any special drivers. Most such converters seem to support only RAID 0 (stripping) and 1 (mirroring), apparently due to size and cabling restrictions.

The principal advantages of inboard converters are price, reliability, ease-of-use, and in some cases, performance. Disadvantages are usually the lack of RAID-5 support, lack of hot-plug capabilities, and the lack of dual-ended operation.

RAID Disk Controllers
Disk Controllers are adapter cards that plug into the PCI bus. Just like regular disk controller cards, a cable attaches them to the disk drives. Unlike regular disk controllers, the RAID controllers will implement RAID on the card itself, performing all necessary operations to provide various RAID levels. Just like outboard boxes, the Linux kernel does not know (or need to know) that RAID is being used. However, just like ordinary disk controllers, these cards must have a corresponding device driver in the Linux kernel to be usable.

If the RAID disk controller has a modern, high-speed DSP/controller on board, and a sufficient amount of cache memory, it can outperform software RAID, especially on a heavily loaded system. However, using and old controller on a modern, fast 2-way or 4-way SMP machine may easily prove to be a performance bottle-neck as compared to a pure software-RAID solution. Some of the performance figures below provide additional insight into this claim.

Related Data Storage Protection Technologies

There are several related storage technologies that can provide various amounts of data redundancy, fault tolerance and high-availability features. These are typically used in conjunction with RAID, as a part of the overall system data protection design strategy.

SAN and NAS
There are a variety of high-end storage solutions available for large installations. These go typically under the acronyms 'NAS' and 'SAN'. NAS abbreviates 'Network Area Storage', and refers to NFS and Samba servers that Unix and Windows clients can mount. SAN abbreviates 'Storage Area Network', and refers to schemes that are the conceptual equivalent of thousand-foot-long disk-drive ribbon cables. Although the cables themselves may be fiber-optic (Fibre-Channel) or Ethernet (e.g. iSCSI), the attached devices appear to be 'ordinary disk drives' from the point of view of the host computer. These systems can be quite sophisticated: for example, this white-paper describes a SAN-like system that has built-in RAID and LVM features.

Journaling
Journaling refers to the concept of having a file system write a 'diary' of information to the disk in such a way as to allow the file system to be quickly restored to a consistent state after a power failure or other unanticipated hardware/software failure. A journaled file system can be brought back on-line quickly after a system reboot, and, as such, is a vital element of building a reliable, available storage solution.

There are a number of journaled file systems available for Linux. These include:

These different systems have different performance profiles and differ significantly in features and functions. There are many articles on the web which compare these. Note that some of these articles may be out-of-date with respect to features, performance or reputed bugs.

LVM
Several volume management systems are available for Linux; the best-known of these is LVM, the Logical Volume Manager. LVM implements a set of features and functions that resemble those that would be found in traditional LVM systems on other Unix's. The Linux LVM (like all traditional Unix volume management systems) provides an abstraction of the physical disks that makes it easier to administer large file systems and disk arrays. It does this by grouping sets of disks (physical volumes) into a pool (volume group). The volume group can be in turn be carved up into virtual partitions (logical volumes) that behave just like the ordinary disk block devices, except that (unlike disk partitions) they can be dynamically grown, shrunk and moved about without rebooting the system or entering into maintenance/standalone mode. A file system (or a swap space, or a raw block device) sits on top of a logical volume. In short, LVM adds an abstraction layer between the file system mount points (/, /usr, /opt, etc) and the hard drive devices (/dev/hda, /dev/sdb2, etc.)

The benefit of LVM is that you can add and remove hard drives, and move data from one hard drive to another without disrupting the system or other users. Thus, LVM is ideal for administering servers to which disks are constantly being added, removed or simply moved around to accommodate new users, new applications or just provide more space for the data. If you have only one or two disks, the effort to learn LVM may outweigh any administrative benefits that you gain.

Linux LVM and Linux Software RAID can be used together, although neither layer knows about the other, and some of the advantages of LVM seem to be lost as a result. The usual way of using RAID with LVM is as follows:

  1. Use fdisk (or cfdisk, etc.) to create a set of equal-sized disk partitions.
  2. Create a RAID-5 (or other RAID level array) across these partitions.
  3. Use LVM to create a physical volume on the RAID device. For instance, if the RAID array was /dev/md0, then pvcreate /dev/md0.
  4. Finish setting up LVM as normal.
In this scenario, although LVM can still be used to dynamically resize logical volumes, one does loose the benefit adding and removing hard drives willy-nilly. Linux RAID devices cannot be dynamically resized, nor is it easy to move a RAID array from one set of drives to another. One must still do space planning in order to have RAID arrays of the appropriate size. This may change: note that LVM is in the process of acquiring mirroring capabilities, although RAID-5 for LVM is still not envisioned.

Another serious drawback of this RAID+LVM combo is that neither Linux Software RAID (MD) nor LVM have any sort of bad-block replacement mechanisms. If (or rather, when) disks start manifesting bad blocks, one is up a creak without a paddle.

Veritas
The Veritas Foundation Suite is a storage management software product that includes an LVM-like system. The following very old press release announces this system: VERITAS Software Unveils Linux Strategy and Roadmap (January 2000) It seems that it is now available for IBM mainframes running Linux (August 2003)!

Diagnostic and Monitoring Tools

Sooner or later, you will feel the need for tools to diagnose hardware problems, or simply monitor the hardware health. Alternately, some rescue operations require low-level configuration tools. In this case, you might find the following useful:

smartmontools
The smartmontools package provides a set of utilities for working with the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into modern IDE/ATA/PATA, SATA and SCSI-3 disks. These tools can report a variety of disk drive health statistics, and the smartd daemon can run continuously to log events into the syslog.

scsirastools
"This project includes changes that enhance the Reliability, Availability and Serviceability (RAS) of the drivers that are commonly used in a Linux software RAID-1 configuration. Other efforts have been made to enable various common hardware RAID adapters and their drivers on Linux." The project is slightly misnamed: the Linux scsi layer handles all modern SATA drives, as well as FCP, SAS and USB drives, and thus is applicable to most all modern hardware.

The package contains low level utilities including sgdskfl to load disk firmware, sgmode to get and set mode pages, sgdefects to read defect lists, and sgdiag to perform format and other test functions.

sg3_utils
The sg3_utils package provides a set of utilities for use with the Linux SCSI Generic (sg) device driver. This driver supports modern SATA and USB-connected disks, as well as SCSI, FCP, SAS disks. The utilities include sg variants for the traditional dd command, tools for scanning and mapping the SCSI bus, tools for issuing low-level SCSI commands, tools for timing and testing, and some example source & miscellany.

This web page is remarkable because it also provides a nice cross-reference to other diagnostic and monitoring tools.


Obsolete/Historical data

This page was originally created in 1996, and only sporadically updated. A copy of some of the old/obsolete data formerly on this page can be found on the Obsolete RAID page.

Also, the RAID Reviews contains some product reviews and performance benchmarks, circa 1998, that were originally a part of this web page. Obsolete/unmaintained.


History

Last updated July 2008 by Linas Vepstas ([email protected])

Copyright (c) 1996-1999, 2001-2003, 2008 Linas Vepstas.
Copyright (c) 2003 Douglas Gilbert <[email protected]>

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included at the URL http://www.linas.org/fdl.html, the web page titled "GNU Free Documentation License".

The phrase 'Enterprise Linux' is a trademark of Linas Vepstas.
All trademarks on this page are property of their respective owners.
Return to the Enterprise Linux(TM) Page
Go Back to Linas' Home Page