Unrecoverable read errors and RAID5

In summary, the article argues that RAID5 is no longer practical at large data volumes due to the chance of an individual HDD having an unrecoverable read error. However, the article's underlying math may be in error, and I'd like someone to check it.
  • #1
joema
106
3
A number of popular-level articles have been published saying that RAID5 is increasingly impractical at larger data volumes due to the chance of an individual HDD having an unrecoverable read error (URE) during the rebuild phase. I think the underlying math which produced this conclusion may be in error, and I'd like someone to check it.

Published reasoning: typical HDDs have a URE rate of 1 in 10^14 reads (interpreted as bytes). If a 16TB 8-drive RAID5 array has a single URE, that HDD must be replaced and data rebuilt on the spare. During the rebuild no further errors can be tolerated else the entire array is bad. Apparent mathematical reasoning: 1 URE per 10^14 bytes / 8 HDD per array = 1 URE per 12.5 TB read from the 8-drive array. Conclusion: given that URE rate there is a nearly 100% chance of a 2nd HDD failing while rebuilding the 16TB array. Example article: http://www.zdnet.com/article/has-raid5-stopped-working/

I think this is incorrect for several reasons:

(1) Common sense: if the chance of a URE is 1 per 12.5 TB read, large RAID0 and RAID5 arrays would be failing at an incredible rate.

(2) The specified URE rate is 1 per 10^14 *reads*, not bytes. Modern drives do reads in 4k-byte sectors, not bytes or 512-byte sectors. So translated to bytes, the URE rate is 1 per 10^14 * 4096 bytes per sector, or 1 URE per 409,600 terabytes read.

(3) While number of drives in the array increases chance of an individual failure per unit of *operating time*, they do not increase failure chance per data transfer volume vs a single HDD of equal capacity. E.g, if reading the *same* data volume from a single HDD vs an n-drive RAID array, each drive in the array will only do 1/nth the reads, hence have 1/nth the failure chance per aggregate volume of data. Therefore for a given number of reads, the URE probability is about the same between a single drive and a multi-drive RAID array.
 
Computer science news on Phys.org
  • #2
Link to an article that may help explain this:

raid 5 and ure

Also if a URE is encountered, it would seem that a smart raid controller would attempt to correct the error by using ECC to caculate the data needed for the failed sector and rewrite it, giving the hard drive some chance to recover from the error (the hard drive might remap the bad sector to a spare sector which will get updated by the write).
 
  • #3
I'm still investigating this. I think my above numbers and reasoning are incorrect, because the HDD non-recoverable error spec is failed reads per total bits read, not sectors read. However the fact remains, hard drives and RAID5 are much more reliable than indicated by the manufacturer's "non-recoverable error rate". That was the key point in the article you mentioned -- thanks!

If the spec is 1 failure per 10^14 bits read, 10^14 bits = 12.5 terabytes. So by that spec you'd expect a failure on average every 12.5TB -- that's where Robin Harris who wrote all the "doom and gloom" ZDNet articles got his number. However -- we know from observation that HDDs and RAID systems do not fail anywhere near that often.

One answer is the spec is simply a "worst case" spec -- IOW a guarantee the HDD will be no *worse* than that. It is not an average failure rate, nor a predicted failure rate. It's more like an uptime or availability guarantee. If a vendor promises 90% availability, that doesn't mean the system is unavailable 10% of the time. It may well achieve 99% availability -- it's just a worst case guarantee.

If the HDD is really more reliable than, say, 1 error per 10^14 bits read, why don't they say that? We could just as well ask if the average engine in a Honda car will last 180,000 miles, why is the engine warranty only 60,000 miles? There are many reasons for that.
 
  • #4
A key study (while several years old) covers this exact area: namely the disparity between the "non-recoverable error rate" spec published by HDD manufacturers and empirically-observed results. A spec of one non-recoverable error per 10^14 bits read would equate to one error every 12.5 terabytes read. This study found four non-recoverable errors per two petabytes read, which equates to one error per 4E15 bits read, or about 40 times more reliable than the HDD manufacturer spec. Empirical Measurements of Disk Failure Rates and Error Rates (2005, Jim Gray, et al):
: http://research.microsoft.com/apps/pubs/default.aspx?id=64599
 
  • Like
Likes jim mcnamara
  • #5


I believe it is important to critically evaluate and question information presented in popular-level articles. In this case, the published reasoning for RAID5 becoming impractical at larger data volumes due to UREs does not seem to be supported by accurate calculations.

Firstly, the common sense argument is valid. If the URE rate of 1 per 12.5 TB read was accurate, we would see a much higher rate of failure in large RAID arrays, which is not the case.

Secondly, the specified URE rate of 1 per 10^14 reads is not equivalent to 1 per 12.5 TB read. Modern hard drives use 4k-byte sectors, which means the URE rate should be translated to 1 per 409,600 terabytes read. This significantly decreases the chances of a URE occurring during a rebuild phase.

Finally, the number of drives in a RAID array does not necessarily increase the chances of a URE. Each drive in the array only reads a fraction of the data compared to a single drive, so the probability of a URE occurring is about the same for a single drive and a multi-drive RAID array. This means that the URE rate should not be calculated based on the total capacity of the RAID array, but rather on the amount of data each drive reads during a rebuild.

In conclusion, the apparent mathematical reasoning presented in popular-level articles may be flawed and should be critically evaluated. While UREs should always be taken into consideration when designing a RAID array, the chances of a URE occurring during a rebuild phase may not be as high as suggested. Further research and evaluation is needed to accurately determine the impact of UREs on RAID5 arrays at larger data volumes.
 

Related to Unrecoverable read errors and RAID5

1. What are unrecoverable read errors?

Unrecoverable read errors refer to data that cannot be retrieved or recovered due to permanent damage or corruption. This can occur due to a variety of reasons such as hardware failure, software bugs, or physical damage to the storage medium.

2. What is RAID5?

RAID5 (Redundant Array of Independent Disks) is a data storage technique that uses block-level striping with distributed parity to improve performance and fault tolerance. It requires at least three disks and can withstand the failure of one disk without losing data.

3. How does RAID5 handle unrecoverable read errors?

When an unrecoverable read error occurs on one disk in a RAID5 array, the data on that disk can be reconstructed using the parity data on the other disks. This is possible because the data is distributed across multiple disks and the parity information allows for data reconstruction.

4. Can RAID5 protect against multiple unrecoverable read errors?

While RAID5 can protect against the failure of one disk, it may not be able to handle multiple unrecoverable read errors. If multiple disks in the array experience unrecoverable read errors, the data on those disks may be lost. This is why it is important to regularly monitor and maintain RAID5 arrays to prevent errors from occurring in the first place.

5. Are there any alternatives to RAID5 for protecting against unrecoverable read errors?

Yes, there are other RAID levels, such as RAID6, that provide higher levels of fault tolerance and can protect against multiple unrecoverable read errors. Additionally, some storage systems offer built-in error correction and detection mechanisms to prevent and handle unrecoverable read errors. It is important to research and consider all options when choosing a storage solution for your data.

Similar threads

  • Computing and Technology
Replies
30
Views
2K
  • Computing and Technology
Replies
14
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
2K
Replies
2
Views
5K
Replies
3
Views
10K
  • Programming and Computer Science
Replies
2
Views
8K
  • Advanced Physics Homework Help
Replies
5
Views
1K
  • Computing and Technology
Replies
2
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
2K
Replies
1
Views
624
Back
Top