Replacing faulty disks with ZFS

2 minute read

Ok, this morning I got a notification on my mobile that my storage pool was DEGRADED. After letting my fear downtime cool down and jumping on my machine to see what’s going on, I could see where the issue was coming from:

 1$ zpool status
 2
 3  pool: data
 4 state: DEGRADED
 5status: One or more devices are faulted in response to persistent errors.
 6        Sufficient replicas exist for the pool to continue functioning in a
 7        degraded state.
 8action: Replace the faulted device, or use 'zpool clear' to mark the device
 9        repaired.
10  scan: scrub repaired 3.46M in 0 days 07:57:09 with 0 errors on Sun Nov 20 08:06:11 2022
11config:
12
13        NAME                        STATE     READ WRITE CKSUM
14        data                        DEGRADED     0     0     0
15          mirror-0                  DEGRADED     0     0     0
16            wwn-0x5000cca25dc75594  FAULTED     10     0 37.1K  too many errors
17            wwn-0x5000cca25dc76378  ONLINE       0     0     0
18          mirror-1                  ONLINE       0     0     0
19            wwn-0x5000cca25c18ca4c  ONLINE       0     0     0
20            wwn-0x5000cca25c18caa4  ONLINE       0     0     0
21          mirror-2                  ONLINE       0     0     0
22            wwn-0x5000cca25c1749fc  ONLINE       0     0     0
23            wwn-0x5000cca244329824  ONLINE       0     0     0
24
25errors: No known data errors

So I’m facing with a faulty disk and as my vdevs are mirrored, nothing was lost of even stopped. Thank you ZFS!

I picked my spare drive, as a good sysadmin that I am, and pushed it in the storage array (thank you hot plugging!) and then it was available.

So now, how to replace the faulty drive?

I thought I needed for to attach the drive to the pool, but it failed:

1$ sudo zpool attach data mirror-0 /dev/disk/by-id/wwn-0x5000cca25dc2f1d8
2cannot attach /dev/disk/by-id/wwn-0x5000cca25dc2f1d8 to mirror-0: can only attach to mirrors and top-level disks

So I tried the new thing I knew: replacing. What replacing does is to attach the new device, resilver, and then detach the fault (or missing) one. So i gave it a shot:

1$ sudo zpool replace data /dev/disk/by-id/wwn-0x5000cca25dc75594 /dev/disk/by-id/wwn-0x5000cca25dc2f1d8

And it worked! Now, it’s resilvering while nothing is being down. I could see the status and it being busy:

 1$ zpool status
 2  pool: data
 3 state: DEGRADED
 4status: One or more devices is currently being resilvered.  The pool will
 5        continue to function, possibly in a degraded state.
 6action: Wait for the resilver to complete.
 7  scan: resilver in progress since Sun Nov 20 11:09:02 2022
 8        186G scanned at 23.3G/s, 123G issued at 15.4G/s, 8.29T total
 9        0B resilvered, 1.45% done, 0 days 00:09:03 to go
10config:
11
12        NAME                          STATE     READ WRITE CKSUM
13        data                          DEGRADED     0     0     0
14          mirror-0                    DEGRADED     0     0     0
15            replacing-0               DEGRADED     0     0     0
16              wwn-0x5000cca25dc75594  FAULTED     10     0 37.1K  too many errors
17              wwn-0x5000cca25dc2f1d8  ONLINE       0     0     0
18            wwn-0x5000cca25dc76378    ONLINE       0     0     0
19          mirror-1                    ONLINE       0     0     0
20            wwn-0x5000cca25c18ca4c    ONLINE       0     0     0
21            wwn-0x5000cca25c18caa4    ONLINE       0     0     0
22          mirror-2                    ONLINE       0     0     0
23            wwn-0x5000cca25c1749fc    ONLINE       0     0     0
24            wwn-0x5000cca244329824    ONLINE       0     0     0
25
26errors: No known data errors

Now I’m just crossing my fingers that the second drive in the pool does not die on me which would be really problematic given that if the second one failed, the pool is dead (I have backups but it’s never a pleasant process to go through…).