Data availability is defined as having your data accessible at
all times. There are two components to data availability: data
integrity and fault tolerance.
Data Integrity. Data integrity means getting the
correct data, every time. Most RAID solutions offer dynamic sector
repair, where the defective sectors due to soft media errors are
repaired on the fly. The real differentiating factor is the amount
of error correction and error detection code provided.
Software-based RAID typically relies on a standard SCSI bus for
data integrity protection, where it can detect 1-bit errors but
has no ability to correct any errors. Hardware-based RAID
solutions usually contain more robust code. For instance,
Adaptec's hardware-based RAID solutions not only detect 4-bit
errors, but also correct 1-bit errors on the entire data path from
the storage media to the host system bus. You could purchase ECC
main memory to augment the SCSI bus capabilities of your
software-based RAID solution, but this is an additional cost.
Frequently, software-based RAID is used in conjunction with
less robust packaging in installations where low initial cost is
the primary consideration. These packaging configurations may not
support drive hot swap.7 If a drive is
accidentally pulled out, the enclosure may not notify the card of
the need to re-issue commands, thereby risking data loss or data
integrity - a potential cost hit that may far outweigh the initial
cost savings. If a drive in a RAID array fails, you must wait
until you can schedule system downtime to replace the failed
drive. Since most software- and hardware-based RAID solutions
support drive hot swap, the key to eliminating these potential
problems is making sure the disk drive enclosure supports it as
well.
Fault Tolerance. Fault tolerance is defined as
maintaining data availability in the event of one or more failures
in the system. The most common method of achieving fault tolerance
on servers and workstations today is RAID technology.
Each RAID level offers different tradeoffs on performance,
cost, and availability, and as such, it may be appropriate to use
different RAID levels for different applications - even on the
same server or workstation. RAID 0 (i.e., striping) should only be
used in high performance applications that can afford downtime
and/or lost data. Critical files in which an outage would severely
cripple business activities, such as boot drives, would best be
protected using RAID 1 (i.e., mirroring), or for even better
performance, RAID 0/1 (mirrored striping). Most applications can
best be protected by RAID 5 (striped parity), which offers the
best balance between performance, cost and availability.
Typically, these applications can't afford downtime but can
tolerate somewhat degraded performance in exchange for a lower
cost of data protection.
With NT's embedded RAID software, your choices are: RAID 0
(i.e., no protection) for workstations, and RAID 0, 1, or 5 for
servers. Microsoft's Windows NT Server 4.0 Enterprise Edition
cluster server services use hardware-based RAID exclusively.
Novell Netware offers RAID 1. Very high-end servers sometimes
employ software-based remote mirroring (RAID 1) for disaster
protection, such as that offered on EMC's Symmetrix series or by
Compaq/Digital's OpenVMS operating system.
Entry-level hardware-based RAID solutions typically feature
RAID 0, 1, and 5 and occasionally RAID 0/1. Higher-end
hardware-based solutions usually offer RAID 3 and sometimes other
variations, but those are beyond the scope of this paper. So, if
you want RAID protection on your workstation or clustered server,
or if you want more RAID choices, such as RAID 0/1 or RAID 3, your
only choice is a hardware-based solution.
Another important difference between software- and
hardware-based RAID is the ability to RAID-protect the boot disk
drive. Windows NT does not allow you to RAID-protect your boot
drive, so if it fails, your system goes down. You then must first
determine the cause of the downtime, replace the failed drive with
either a spare boot drive or restore the data from tape or other
secondary storage media, and reboot the system. During this time
your system is unavailable to users. Hardware-based RAID solutions
make no exceptions for RAID protection based on the data contained
on the disk drive.
But RAID protection is just the first step in achieving higher
fault tolerance. Hot spares with automatic recovery is another
important consideration when evaluating RAID solutions - one that
is only offered on hardware-based solutions. With hot spares, if a
drive in the array fails, the RAID card automatically detects the
failed drive, replaces it with a spare drive, and reconfigures the
array with the new drive - all while the system is running and
data remains available to users. Some cards allow the system
administrator to define the priority of drive reconstruction (for
instance, low, medium, or high priority) to optimize tradeoffs
between availability requirements (i.e., exposure to a second
drive failure) and application performance during data
reconstruction.
More sophisticated hardware-based RAID solutions also offer the
option of either dedicating spares to each array, or using a pool
of spares for all arrays to draw on. Using dedicated spares on the
most critical applications eliminates contention for spares in the
event of multiple drive failures. Pooling spares is a more
cost-effective method of data availability that is appropriate for
less critical applications.
The next level of fault tolerance protection is to add
redundancy to non-disk drive components. The downside is that it
significantly increases the cost of your configuration. Typical
areas to add redundancy include packaging, such as extra fans,
dual I/O paths from the server to the disk drive (i.e., redundant
controllers), and multiple servers. Using an Uninterrupted Power
Supply (UPS) is also a good idea.
Some software-based RAID solutions support disk duplexing, a
form of mirroring (RAID 1) using redundant controllers where each
disk drive is attached to a separate controller, thereby
eliminating the controller as the single point of failure. The
hardware-based RAID equivalent solution is called active-active
controller failover, available only on more expensive, high-end
external RAID controllers such as Data General's CLARiiON series.
Server redundancy is most cost-effectively achieved through
clustering, such as that offered on Microsoft's Windows NT Server
4.0 Enterprise Edition or many Unix and mainframe computer
systems. With clustering, multiple servers access the same
storage. In the event of a server failure, data on the disk drives
can still be accessed using other servers in the cluster.
Hardware-based external RAID controllers are typically used to
provide RAID protection for the disk drives in a clustered
environment.
In very high end mission-critical applications, remote
mirroring (RAID 1) software such that offered by Compaq/Digital's
OpenVMS is employed to mirror data to a remote site for disaster
protection. Such configurations are very expensive, because the
entire server configuration is duplicated at an offsite location.