I'm running a burnin period on a new set of disks. Shape is three RAIDZ1's of four 4TB disks, I was messing around with throughput for 10gigE use - this runs about 900meg/second r/w. I have a Spare drive also allocated to the pool.
Yesterday one of the disks popped, recovered, resilvered to the Spare, that completed, then popped again and died, and the Spare stepped in again. The pool was resilvering when I came to look this morning. But some worrying additional r/w errors are seen on other hdds in the same RAIDZ1:
I didn't want to swap out the spare with those other errors also in play, so I'm leaving that to go to completion. I also didn't want to pull a second disk in a RAIDZ1, of course that would be suicidal.
Instead I added another spare to take on that FAULTED drive:
Now it's "No known data errors", magically? And the new spare-0 has allocated itself to a different drive? Well, ok - it did say "too many errors" on that one too. Well, let's bang in another spare and try again:
Can someone explain to me how this pool is working at all? Three disks in a RAIDZ1 that are in a bad enough way that the system has all by itself subbed in three spares, and yet "No known data errors" (and there were originally!) and the pool is still up and running?
Spares are clearly doing some sort of deep magic here. Much more than was revealed in my recent thread "What are Spare drives in pools useful for?"
Also I probably ought to add another spare for that last member of the RAIDZ1 since that's having read/write errors too! But I've run out of slots unless I plumb in the other disk shelf. And I don't understand how this is still working! With zero data errors!
Help?
Yesterday one of the disks popped, recovered, resilvered to the Spare, that completed, then popped again and died, and the Spare stepped in again. The pool was resilvering when I came to look this morning. But some worrying additional r/w errors are seen on other hdds in the same RAIDZ1:
Code:
root@Sisyphus:~ # zpool status pool: DataPool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Mar 28 10:37:08 2020 25.0T scanned at 3.82G/s, 19.2T issued at 2.93G/s, 25.0T total 2.62G resilvered, 76.73% done, 0 days 00:33:48 to go config: NAME STATE READ WRITE CKSUM DataPool DEGRADED 2 0 5.48M raidz1-0 ONLINE 0 0 0 gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 raidz1-1 DEGRADED 2 0 11.0M gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors spare-1 DEGRADED 0 0 4.03K 12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0 gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0 raidz1-2 ONLINE 0 0 0 gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 spares 4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc errors: 5744771 data errors, use '-v' for a list
I didn't want to swap out the spare with those other errors also in play, so I'm leaving that to go to completion. I also didn't want to pull a second disk in a RAIDZ1, of course that would be suicidal.
Instead I added another spare to take on that FAULTED drive:
Code:
root@Sisyphus:~ # zpool status pool: DataPool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Mar 28 12:30:51 2020 14.6T scanned at 471M/s, 9.82T issued at 1.20G/s, 25.0T total 0 resilvered, 39.34% done, 0 days 03:36:12 to go config: NAME STATE READ WRITE CKSUM DataPool DEGRADED 2 0 5.48M raidz1-0 ONLINE 0 0 0 gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 raidz1-1 DEGRADED 2 0 11.0M spare-0 DEGRADED 0 0 0 gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc ONLINE 0 0 0 spare-1 DEGRADED 0 0 4.03K 12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0 gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0 raidz1-2 ONLINE 0 0 0 gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 spares 4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc 8076838506038568445 INUSE was /dev/gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc errors: No known data errors
Now it's "No known data errors", magically? And the new spare-0 has allocated itself to a different drive? Well, ok - it did say "too many errors" on that one too. Well, let's bang in another spare and try again:
Code:
root@Sisyphus:~ # zpool status pool: DataPool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Mar 28 12:35:20 2020 24.0T scanned at 1.43G/s, 16.1T issued at 228M/s, 25.0T total 1.21G resilvered, 64.57% done, 0 days 11:17:39 to go config: NAME STATE READ WRITE CKSUM DataPool DEGRADED 2 0 5.48M raidz1-0 ONLINE 0 0 0 gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 raidz1-1 DEGRADED 2 0 11.0M spare-0 DEGRADED 0 0 0 gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc ONLINE 0 0 0 spare-1 DEGRADED 0 0 4.03K 12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0 spare-2 DEGRADED 0 0 14.9K gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors gptid/4b125e82-70f0-11ea-895f-a0369f4e18bc ONLINE 0 0 0 gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0 raidz1-2 ONLINE 0 0 0 gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0 gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0 spares 4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc 8076838506038568445 INUSE was /dev/gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc 14815739748345029129 INUSE was /dev/gptid/4b125e82-70f0-11ea-895f-a0369f4e18bc errors: No known data errors
Can someone explain to me how this pool is working at all? Three disks in a RAIDZ1 that are in a bad enough way that the system has all by itself subbed in three spares, and yet "No known data errors" (and there were originally!) and the pool is still up and running?
Spares are clearly doing some sort of deep magic here. Much more than was revealed in my recent thread "What are Spare drives in pools useful for?"
Also I probably ought to add another spare for that last member of the RAIDZ1 since that's having read/write errors too! But I've run out of slots unless I plumb in the other disk shelf. And I don't understand how this is still working! With zero data errors!
Help?
Last edited: