Replacing faulty disks with ZFS
Ok, this morning I got a notification on my mobile that my storage pool was DEGRADED. After letting my fear downtime cool down and jumping on my machine to see what’s going on, I could see where the issue was coming from:
1$ zpool status
2
3 pool: data
4 state: DEGRADED
5status: One or more devices are faulted in response to persistent errors.
6 Sufficient replicas exist for the pool to continue functioning in a
7 degraded state.
8action: Replace the faulted device, or use 'zpool clear' to mark the device
9 repaired.
10 scan: scrub repaired 3.46M in 0 days 07:57:09 with 0 errors on Sun Nov 20 08:06:11 2022
11config:
12
13 NAME STATE READ WRITE CKSUM
14 data DEGRADED 0 0 0
15 mirror-0 DEGRADED 0 0 0
16 wwn-0x5000cca25dc75594 FAULTED 10 0 37.1K too many errors
17 wwn-0x5000cca25dc76378 ONLINE 0 0 0
18 mirror-1 ONLINE 0 0 0
19 wwn-0x5000cca25c18ca4c ONLINE 0 0 0
20 wwn-0x5000cca25c18caa4 ONLINE 0 0 0
21 mirror-2 ONLINE 0 0 0
22 wwn-0x5000cca25c1749fc ONLINE 0 0 0
23 wwn-0x5000cca244329824 ONLINE 0 0 0
24
25errors: No known data errors
So I’m facing with a faulty disk and as my vdevs are mirrored, nothing was lost of even stopped. Thank you ZFS!
I picked my spare drive, as a good sysadmin that I am, and pushed it in the storage array (thank you hot plugging!) and then it was available.
So now, how to replace the faulty drive?
I thought I needed for to attach the drive to the pool, but it failed:
1$ sudo zpool attach data mirror-0 /dev/disk/by-id/wwn-0x5000cca25dc2f1d8
2cannot attach /dev/disk/by-id/wwn-0x5000cca25dc2f1d8 to mirror-0: can only attach to mirrors and top-level disks
So I tried the new thing I knew: replacing. What replacing does is to attach the new device, resilver, and then detach the fault (or missing) one. So i gave it a shot:
1$ sudo zpool replace data /dev/disk/by-id/wwn-0x5000cca25dc75594 /dev/disk/by-id/wwn-0x5000cca25dc2f1d8
And it worked! Now, it’s resilvering while nothing is being down. I could see the status and it being busy:
1$ zpool status
2 pool: data
3 state: DEGRADED
4status: One or more devices is currently being resilvered. The pool will
5 continue to function, possibly in a degraded state.
6action: Wait for the resilver to complete.
7 scan: resilver in progress since Sun Nov 20 11:09:02 2022
8 186G scanned at 23.3G/s, 123G issued at 15.4G/s, 8.29T total
9 0B resilvered, 1.45% done, 0 days 00:09:03 to go
10config:
11
12 NAME STATE READ WRITE CKSUM
13 data DEGRADED 0 0 0
14 mirror-0 DEGRADED 0 0 0
15 replacing-0 DEGRADED 0 0 0
16 wwn-0x5000cca25dc75594 FAULTED 10 0 37.1K too many errors
17 wwn-0x5000cca25dc2f1d8 ONLINE 0 0 0
18 wwn-0x5000cca25dc76378 ONLINE 0 0 0
19 mirror-1 ONLINE 0 0 0
20 wwn-0x5000cca25c18ca4c ONLINE 0 0 0
21 wwn-0x5000cca25c18caa4 ONLINE 0 0 0
22 mirror-2 ONLINE 0 0 0
23 wwn-0x5000cca25c1749fc ONLINE 0 0 0
24 wwn-0x5000cca244329824 ONLINE 0 0 0
25
26errors: No known data errors
Now I’m just crossing my fingers that the second drive in the pool does not die on me which would be really problematic given that if the second one failed, the pool is dead (I have backups but it’s never a pleasant process to go through…).