Is the core issue with RAID rebuilds that the RAID rebuild timeouts are too low ...

AlexandrB · on April 24, 2020

It sounds like at least 60 seconds. Even with no timeouts RAID rebuilds go from a day with CMR to over a week with SMR.

floatingatoll · on April 24, 2020

Okay, so they do succeed, they just take a long time. Good to know! Thanks.

jandrese · on April 24, 2020

Only if your RAID software/hardware is exceptionally tolerant of the drive.

Don't forget the reason "NAS" drives exist in the first place is that several years ago drive manufacturers added a feature where if a read failed they would go into an extremely thorough but long (60+ second) recovery effort to get the sector back. RAID controllers would just see the drive stop responding to commands and mark it dead. So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.

If the drives go out to lunch due to a SMR writeback bottleneck then they will have lost their main selling point. Presumably in the normal case the drive will write the data just fine, but at a slower rate so you can rebuild your array but it will take all week. However, if one of the sectors fails the CRC check after the write and it has to try several times to get it I can definitely see the RAID controller getting frustrated and kicking it out.

I would be interested to see if any RAID software comes with a "SMR" mode where if a drive stops responding to commands during a rebuild the controller lets the drive take a 20 minute break before resuming the rebuild.

bityard · on April 24, 2020

> So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.

Hang on a sec. Is this documented somewhere?

I bought a WD Red to plug into my Raspberry Pi which I use as a file server. There's no RAID, just the one disk. I thought I was buying a more energy efficient or bulk-storage-oriented drive.

But if what you say is true, then the "NAS" or "Red" drives should _never_ be used outside of a RAID because robust error correction was removed from them by design. Do I have that right?

kstrauser · on April 24, 2020

That's exactly correct: see https://en.wikipedia.org/wiki/Error_recovery_control for details.

Basically, NAS drives have a hard limit on how long they'll try to recover from errors before just reported the failure back to the RAID controller so that it can handle them.

Silhouette · on April 24, 2020

Yes, that's the right idea. NAS/RAID drives have a different error recovery strategy, because the assumption is that they'll be part of a multi-drive arrangement where failing fast (and allowing the containing system to handle recovery) is preferable to avoiding failure if at all possible (but potentially taking a long time and thus causing the containing system to think the drive has stopped functioning properly and fail the whole thing out). I can't point you to any specific documentation off the top of my head, but this is a well-known position that I've seen described explicitly several times.

I'm afraid that does mean your choice of a Red for a single-disk system was not ideal. Presumably you keep backups of any valuable data anyway, but if downtime for recovery would be a significant problem for you then you might want to consider replacing that drive with something more suitable for your situation.

bityard · on April 25, 2020

I should hand in my geek card, this feels like something I should have known about. In my defense, though, the HD manufacturers offer little to no information about the _technical_ differences between their drive lines. All of their documentation just says, "designed for X use case".

I do have backups, that's not my concern. My concern is that _when_ there is a read/write error (which are completely normal events with today's hard drive technology), the drive just gives up right away instead of making a few attempts. This could easily translate into (silently!) lost data in a single-disk scenario.

magicalhippo · on April 25, 2020

If one uses ZFS, one can instruct ZFS to keep multiple copies of the data. It will try to spread those copies among multiple disks, but in single-disk systems it will just spread the duplicate blocks over that disk.

Since ZFS does checksum verification on every read, it has a much better chance of recovering from a few bad sectors.

Downside though is that the default RPi installs are 32bit and ZFS was written with 64bit-only in mind, and AFAIK there are still some issues and limitations when running on a 32bit system.

flyinghamster · on April 24, 2020

You can set the TLER value via smartctl, though that might not work through a USB interface.

smartctl -l scterc,<READTIME>,<WRITETIME> /dev/xxxx

WD Red drives should retain this setting across reboots. Some drives don't, and some don't support the command.

pkaye · on April 24, 2020

I think this is called TLER (Time Limited Error Recovery.)

https://en.wikipedia.org/wiki/Error_recovery_control

bluGill · on April 24, 2020

If you are ever where this happens the drive is end of life and should be tossed and the new one rebuilt from backups. You do have backups right... Drives fail without notice often

floatingatoll · on April 24, 2020

Wait, if it's a NAS drive, the drive firmware will ensure that it doesn't timeout due to media failure. Which the RAID can trust, because it's a NAS drive.

So.. why do the RAID rebuilds have timeouts on NAS drives at all? If you paid all that extra money for a special firmware that doesn't time out on media error, and the drive is still accepting and processing commands in less than X hours per command, then wiring in your own timeout seems like a really bad idea.

When the cache is full and something sends a write to the drive, does the drive still accept "are you still there?" commands while the write is queued?

Wowfunhappy · on April 24, 2020

> Which the RAID can trust, because it's a NAS drive.

There's the problem right there! :) Drives aren't trustworthy, regardless of label.

floatingatoll · on April 25, 2020

So the raid software thinks it knows better than the drive firmware, ignores the fact that it's operating a drive with no I/O timeouts, and helpfully times out the drive from the array because obviously it's not behaving 'correctly' in line with the unverified assumptions of the RAID software?

It reads to me like the fault here isn't just on the hard drive manufacturers, like everyone's made it appear in top-level comments of both issues about it this week. I'm glad I asked more questions so that I'm better informed to help my friends when they encounter this. I appreciate everyone in the thread offering help with the details.

kstrauser · on April 24, 2020

There've been lots of detailed reports about rebuilds not succeeding. For example: https://github.com/openzfs/zfs/issues/10214

zrm · on April 24, 2020

This seems like a different problem than people are making it out to be.

SMR drives have slow random writes and paper over it with caching, until the cache gets full. Then in theory what should happen is that the actual write speed of the drive is exposed. That means resilvers would take a long time, but they should still finish.

What seems to be actually happening is that some of these drives have a firmware bug such that when caching is enabled and the cache gets full, the write speed drops to zero. The system then regards the drive as faulty and boots it out.

So it seems like this should be solvable with a firmware update that causes the drive to behave differently (slow rather than stopped) when the cache gets full.

This also implies that some other SMR drives with different firmware might not behave like this, and that it might not happen with ordinary RAID rebuilds as opposed to ZFS resilvering because then the writes should be almost entirely sequential (i.e. what SMR is good at) as opposed to ZFS which is more random.

Silhouette · on April 24, 2020

It doesn't have to stop completely to trigger the problem case, does it? It seems like just being slow enough to trigger the containing system's timeout response would be enough.

zrm · on April 24, 2020

SMR drives shouldn't be that slow. They're slower than PMR drives but it shouldn't be by so much that individual writes are taking tens of seconds.

What's probably happening is one of two things. Either the cache gets full and the drive is blocking while it flushes the cache to the disk, or the drive is advertising that it can do a large number of simultaneous write operations which it can't actually do all of in a reasonable amount of time when the cache is full, and then the ones queued last time out. In the first case they could have the firmware continue to process uncached writes when the cache is full, in the second case don't advertise the ability to do as many simultaneous writes.

Another alternative might be for the system to use a longer timeout value for these drives, but whether that's reasonable depends on how long it would actually have to be.

floatingatoll · on April 24, 2020

This is precisely the origin of my question upstream: Why, precisely, in specific detail, are these rebuilds failing?