Source for VMEbus, PMC Modules, CompactPCI, Single Board Computers, Rackmount Servers, and Rackmount Chassis

Home
Up
Search
Industrial Products

RAID RAS
RAID Management
RAID Performance
RAID Cost
RAID Feature Summary
RAID Glossary
MTBF

Hardware- or Software-Based RAID
Which Solution is Best for You?

RAS (Reliability, Availability, Serviceability)


RAS considerations

In today's environment, a company's data is its lifeblood. Once the exclusive realm of large corporations, applications with mission-critical data have become commonplace in small and medium businesses. With the cost of downtime varying from $10,0001 to $84,0002 per hour, and estimates of the cost of losing just 100 MBytes of data ranging from $85,000 to $500,000,3 few business can withstand an extended data outage. Even more disturbing is a recent study revealing that 94% of businesses that had suffered a catastrophic nonrecoverable failure in their corporate IT storage systems went out of business within two years.4

Avoiding downtime is especially critical for smaller businesses and IT sites, because they are less likely to have the staff, resources or disaster-planning capabilities of larger computer installations.


RAS definitions [top of page]

Let's examine the three main considerations for evaluating a RAID storage solution from a data availability standpoint: reliability, availability, and serviceability.


Reliability

Reliability means when or how often can you expect the item in question to fail. Typically expressed in Mean Time Between Failures (MTBF), this metric is used to quantify hardware component failures that exhibit an exponential failure. For instance, disk drive manufacturers claim MTBFs of 300,000 to 800,000 or more hours.

Those disk drive MTBFs sound good, but that's only part of the picture. What is stated on a specification sheet may represent the average of the population, not your drive in particular. Your drive's environment may not be optimal, either because a fan in the server packaging is not running optimally, or your system experiences a power surge that cripples your disk drive. The manufacturer may specify theoretical, not operational MTBFs, where theoretical MTBF specifications are derived from mathematical models of empirical field data of the individual drive components. Theoretical MTBFs do not account for failures due to drive infancy, manufacturing-induced defects, drive returns in which the failure cannot be repeated (i.e., NTFs - No Trouble Founds), and damage due to improper handling. Operational MTBFs,5 derived from actual disk drive field return data, are typically significantly lower than theoretical MTBFs and by definition are not published until at least 6 months after the drive is in volume production.

Even if you have an optimal environment and perfect drives, the MTBF of your total drive population could be unacceptably low, because overall MTBF of the storage is decreased in proportion to the number of components in the system (see Calculating Theoretical MTBFs). In other words, the more disk drives you have in your configuration, the more likely it is that one of them will fail. For instance, if your storage subsystem consists of two disk drives each with an MTBF of 300,000 hours, the theoretical MTBF of the disk drives alone (not including the card, server, fans, packaging or other components) is half that, or 150,000 hours. For a 10-drive configuration, the MTBF drops to 30,000 hours; a 100-drive configuration is 3,000 hours.

Conclusion: Your disk drives or other components of your system will eventually fail. If a critical drive fails, such as a boot drive or a drive containing payroll information, your entire organization may be effected. How prepared are you going to be to handle that situation, and can your business weather a prolonged outage?

So far, we've only addressed hardware reliability. Many software failures cannot even be categorized by MTBF statistics, because their failure mechanisms don't follow an exponential failure distribution. The amount of design and test time it would take to make "perfect" software is prohibitive for most vendors. According to Microsoft®, software (operating system and application) failures account for slightly less than 35% of all system outages; hardware accounts for another 30% of downtime.6 Conclusion: software is just as likely to fail as hardware.

So, how do you protect your critical data? Implementing RAID technology, either software- or hardware-based, is a logical first step in protecting your data from disk drive failures. RAID technology should be deployed on any server or workstation where the cost of lost data or downtime warrants it. But that's just the beginning. There are other availability and serviceability features that you should examine to determine the optimum RAID solution for your environment.


Availability

Data availability is defined as having your data accessible at all times. There are two components to data availability: data integrity and fault tolerance.

Data Integrity. Data integrity means getting the correct data, every time. Most RAID solutions offer dynamic sector repair, where the defective sectors due to soft media errors are repaired on the fly. The real differentiating factor is the amount of error correction and error detection code provided. Software-based RAID typically relies on a standard SCSI bus for data integrity protection, where it can detect 1-bit errors but has no ability to correct any errors. Hardware-based RAID solutions usually contain more robust code. For instance, Adaptec's hardware-based RAID solutions not only detect 4-bit errors, but also correct 1-bit errors on the entire data path from the storage media to the host system bus. You could purchase ECC main memory to augment the SCSI bus capabilities of your software-based RAID solution, but this is an additional cost.

Frequently, software-based RAID is used in conjunction with less robust packaging in installations where low initial cost is the primary consideration. These packaging configurations may not support drive hot swap.7 If a drive is accidentally pulled out, the enclosure may not notify the card of the need to re-issue commands, thereby risking data loss or data integrity - a potential cost hit that may far outweigh the initial cost savings. If a drive in a RAID array fails, you must wait until you can schedule system downtime to replace the failed drive. Since most software- and hardware-based RAID solutions support drive hot swap, the key to eliminating these potential problems is making sure the disk drive enclosure supports it as well.

Fault Tolerance. Fault tolerance is defined as maintaining data availability in the event of one or more failures in the system. The most common method of achieving fault tolerance on servers and workstations today is RAID technology.

Each RAID level offers different tradeoffs on performance, cost, and availability, and as such, it may be appropriate to use different RAID levels for different applications - even on the same server or workstation. RAID 0 (i.e., striping) should only be used in high performance applications that can afford downtime and/or lost data. Critical files in which an outage would severely cripple business activities, such as boot drives, would best be protected using RAID 1 (i.e., mirroring), or for even better performance, RAID 0/1 (mirrored striping). Most applications can best be protected by RAID 5 (striped parity), which offers the best balance between performance, cost and availability. Typically, these applications can't afford downtime but can tolerate somewhat degraded performance in exchange for a lower cost of data protection.

With NT's embedded RAID software, your choices are: RAID 0 (i.e., no protection) for workstations, and RAID 0, 1, or 5 for servers. Microsoft's Windows NT Server 4.0 Enterprise Edition cluster server services use hardware-based RAID exclusively. Novell Netware offers RAID 1. Very high-end servers sometimes employ software-based remote mirroring (RAID 1) for disaster protection, such as that offered on EMC's Symmetrix series or by Compaq/Digital's OpenVMS operating system.

Entry-level hardware-based RAID solutions typically feature RAID 0, 1, and 5 and occasionally RAID 0/1. Higher-end hardware-based solutions usually offer RAID 3 and sometimes other variations, but those are beyond the scope of this paper. So, if you want RAID protection on your workstation or clustered server, or if you want more RAID choices, such as RAID 0/1 or RAID 3, your only choice is a hardware-based solution.

Another important difference between software- and hardware-based RAID is the ability to RAID-protect the boot disk drive. Windows NT does not allow you to RAID-protect your boot drive, so if it fails, your system goes down. You then must first determine the cause of the downtime, replace the failed drive with either a spare boot drive or restore the data from tape or other secondary storage media, and reboot the system. During this time your system is unavailable to users. Hardware-based RAID solutions make no exceptions for RAID protection based on the data contained on the disk drive.

But RAID protection is just the first step in achieving higher fault tolerance. Hot spares with automatic recovery is another important consideration when evaluating RAID solutions - one that is only offered on hardware-based solutions. With hot spares, if a drive in the array fails, the RAID card automatically detects the failed drive, replaces it with a spare drive, and reconfigures the array with the new drive - all while the system is running and data remains available to users. Some cards allow the system administrator to define the priority of drive reconstruction (for instance, low, medium, or high priority) to optimize tradeoffs between availability requirements (i.e., exposure to a second drive failure) and application performance during data reconstruction.

More sophisticated hardware-based RAID solutions also offer the option of either dedicating spares to each array, or using a pool of spares for all arrays to draw on. Using dedicated spares on the most critical applications eliminates contention for spares in the event of multiple drive failures. Pooling spares is a more cost-effective method of data availability that is appropriate for less critical applications.

The next level of fault tolerance protection is to add redundancy to non-disk drive components. The downside is that it significantly increases the cost of your configuration. Typical areas to add redundancy include packaging, such as extra fans, dual I/O paths from the server to the disk drive (i.e., redundant controllers), and multiple servers. Using an Uninterrupted Power Supply (UPS) is also a good idea.

Some software-based RAID solutions support disk duplexing, a form of mirroring (RAID 1) using redundant controllers where each disk drive is attached to a separate controller, thereby eliminating the controller as the single point of failure. The hardware-based RAID equivalent solution is called active-active controller failover, available only on more expensive, high-end external RAID controllers such as Data General's CLARiiON series.

Server redundancy is most cost-effectively achieved through clustering, such as that offered on Microsoft's Windows NT Server 4.0 Enterprise Edition or many Unix and mainframe computer systems. With clustering, multiple servers access the same storage. In the event of a server failure, data on the disk drives can still be accessed using other servers in the cluster. Hardware-based external RAID controllers are typically used to provide RAID protection for the disk drives in a clustered environment.

In very high end mission-critical applications, remote mirroring (RAID 1) software such that offered by Compaq/Digital's OpenVMS is employed to mirror data to a remote site for disaster protection. Such configurations are very expensive, because the entire server configuration is duplicated at an offsite location.


Serviceability

As defined here, serviceability means in the event of a failure, how fast and easy is it to detect and isolate the failure, repair or replace the failed component, and reset the application or operating system. Serviceability also includes preventive maintenance features that help you monitor and replace marginal components before they fail.

S.M.A.R.T. and SAF-TE are two standards that have emerged in recent years that should be employed on any serious RAID implementation. Configurations supporting the S.M.A.R.T. (Self Monitoring, Analysis and Reporting Technology) standard monitor disk drives and report any out-of-threshold conditions that may signify a potential failure to the array card or server management software, permitting you to replace the drive before it fails. Configurations supporting SAF-TE (SCSI Accessed Fault-Tolerant Enclosure) monitor and report enclosure conditions to array or server management software, assisting in alerting and isolating enclosure-related failures.

In either case, you need to check that not only the disk drives are S.M.A.R.T.-compliant or enclosure is SAF-TE-compliant, but also that the RAID card's management software and operating system support these standards. Many software- and hardware-based RAID solutions support S.M.A.R.T. and SAF-TE. However, just as there are many different vendor implementations of SCSI drives, there are many different implementations of SAF-TE enclosures, all of which need to be tested for compatibility to ensure that enclosure-related events are properly reported and interpreted by the card and RAID management software.

With Microsoft NT software-based RAID, drive and enclosure events are reported via SNMP to the general management log, a log that contains storage- as well as server- and network-related events. The system manager can then employ a filter to view only storage-related events. Each storage installation can only be monitored locally on each server, so the system manager must physically "make the rounds" to monitor each RAID installation.

Sw-based RAID w/local mgmt v. Hw-based RAID w/remote mgmt

Many hardware-based RAID solutions offer RAID management software specifically designed not only to configure and manage RAID arrays but also to report storage-related events. The more sophisticated of these RAID management software packages categorize errors and events by severity, such as color-coded alerts highlighted in yellow for a potential problem and red for an actual component failure. Some even e-mail, fax or page the system manager in the event of alerts requiring immediate attention, greatly increasing the system manager's ability to detect problems and decrease the time it takes to bring the storage subsystem back up to full operational capability. Others allow you to manage, monitor and in some instances repair all hardware-based RAID installations from a single station, even remotely.



1 Annual Disaster Impact Research, Microsoft

2 Oracle Corporation survey

3 Computer Weekly, April 1996

4 Computer Weekly, April 1996

5 In the example described in disk drive manufacturer Quantum's white paper entitled "HDD Operational Vs Theoretical MTBF," the disk drive's theoretical MTBF is calculated at 713,000 hours. Operational MTBF revealed an MTBF of 188,005 hours - about one quarter of the theoretical MTBF. For more information, see http://www.quantum.com/src/whitepapers/mtbf/mtbf1.html.

6 Deploying Microsoft Windows NT® Server for High Availability, Microsoft Corporation

7 Drive hot swap is defined as the ability to pull out and replace a drive while the system is running and data is being accessed. With warm swap, you must first pause activity on the SCSI bus before removing the drive.


[contents] [next]
 

An Industrial Partner 1999-2002. All rights reserved.


CompactPCI, Embedded SBCs, Flat panel Displays, Industrial Chassis, IndustrialPC Peripherals, Industrial Power Supplies, Backplanes, Single Board Computers, Rackmount Servers, Network Communication, Open Frame Panel Computer, PC/104, Flash Disk, CTI, RAID Industrial Products CompactPCI, Embedded SBCs, Flat panel Displays, Industrial Chassis, IndustrialPC Peripherals, Industrial Power Supplies, Backplanes, Single Board Computers, Rackmount Servers, Network Communication, Open Frame Panel Computer, PC/104, Flash Disk, CTI, RAID E-Mail

VoxTechnologies Corp. - Industrial Computer Leader
Tel:
1-972-234-4343 Fax: 1-972-234-4295 Toll-Free:1-888-568-6224

For over a decade, VoxTechnologies has been a leading source of industrial computers and complete system products for the O.E.M. and Systems Integrator. Our primary goal is to provide a solution source for engineers that have the challenging task of interfacing and controlling the real world.

 
Telephone: 1-972-234-4343 General Info: info@voxtechnologies.com Sales Info: sales@voxtechnologies.com
 
We accept all major credit cards
Send mail to webmaster@voxtechnologies.com with questions or comments about this web site.
Copyright © 1999 VoxTechnologies Corporation- An Industrial Partner
Last modified: June 20, 2002