Vegard Kolbeinsen
Explorer
- Joined
- Jul 19, 2016
- Messages
- 72
It first started with the NAS suddenly rebooting. The NAS is now several years old, a supermicro with 16 of 24 harddrives installed, varying from different sizes oldest being 320GB disk to now newer 4TB disk. They all are running mirrors so that if I needed an increase in space I would just need to add 2 more drive each time, and if one failed I would just change the disk in that mirror.
Zpool Status:
So the problem is this sudden reboot every 15-45min, and to start with it looked like a 3TB disk in the first mirror was failing, the "less /var/log/messages" did not show any error but I was standing by the NAS one time it rebooted and it showed errors on disk in bay2 just before the reboot on the display. I tried running some smart tests on this and first it did not give errors, but then it gave some erros, and then no error it was kind of weird.
Without spending to much time i went out and got 2 new 4TB disk to just replace it. And then I ran into the second problem I was unable to replace the unit. tried gpart destroy on the disk and only after a couple of reboots I was able to start the replacement, as one can see from the zpool status above.
And this led to the third problem still with this going the system randomly reboots, the same error apeared on the bay2 with the brand new drive. By now I had dusted off an old PC laying around and hooked up the "defect" 3TB disk and running some extended testing (so far short,long shows no error and its doing a zero bit test now).
Its been a years since I cleaned the NAS so I shut it off and and started vacum cleaning and air pressure the NAS, taking out each bay and getting all dust out.
And after booting up there where no error on bay2, but still random reboot, this time I saw a error just before reboot (nothing in log) that bay4 was acting weird. Again I shut down took it out, checked the SATA cable and put it back in. Error seems to be gone but still some weird reboots.
I am now thinking there is some error with either a sudden bad SATA cable, not sure if thats possible or the SATA controller is starting to die on me. I will have to look into if its the same SATA controller that is going on bay2 and bay4 to be sure. For now I am just letting it stay on doing its resilver see if that can be done before next reboot, its been on for 35 min without problems for now.
But is there a way to "stresstest" more than just SMART on each drive to see what bay that triggers a reboot?
Zpool Status:
Code:
root@freenas:~ # zpool status pool: FreeNAS state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Mar 27 15:04:42 2020 4.10T scanned at 2.37G/s, 2.56T issued at 3.98G/s, 13.6T total 0 resilvered, 18.82% done, 0 days 00:47:22 to go config: NAME STATE READ WRITE CKSUM FreeNAS DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 gptid/b6ff17e7-5d7b-11e6-9dea-0025900943c8 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 6299447651800029231 OFFLINE 0 0 0 was /dev/gptid/b7b83d80-5d7b-11e6-9dea-0025900943c8 gptid/e05a2fe7-7033-11ea-84e3-0025900943c8 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gptid/682e9cdf-5d7d-11e6-9dea-0025900943c8 ONLINE 0 0 0 gptid/690ee98c-5d7d-11e6-9dea-0025900943c8 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gptid/62810637-5d8f-11e6-a561-0025900943c8 ONLINE 0 0 0 gptid/646e6ae8-5d8f-11e6-a561-0025900943c8 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 gptid/27a61ff8-5d93-11e6-a561-0025900943c8 ONLINE 0 0 0 gptid/2886473a-5d93-11e6-a561-0025900943c8 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 gptid/1722191b-5f8d-11e6-9f9d-0025900943c8 ONLINE 0 0 0 gptid/5ef74856-1e00-11e8-b701-0025900943c8 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 gptid/cf05621c-de46-11e6-aa30-0025900943c8 ONLINE 0 0 0 gptid/af074c71-605e-11e6-9f9d-0025900943c8 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 gptid/ade86ee6-647d-11e6-b38d-0025900943c8 ONLINE 0 0 0 gptid/aeab2d2c-647d-11e6-b38d-0025900943c8 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 gptid/033fc681-6486-11e6-b38d-0025900943c8 ONLINE 0 0 0 gptid/058ab140-6486-11e6-b38d-0025900943c8 ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0 days 00:00:24 with 0 errors on Fri Mar 27 03:45:24 2020 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 errors: No known data errors
So the problem is this sudden reboot every 15-45min, and to start with it looked like a 3TB disk in the first mirror was failing, the "less /var/log/messages" did not show any error but I was standing by the NAS one time it rebooted and it showed errors on disk in bay2 just before the reboot on the display. I tried running some smart tests on this and first it did not give errors, but then it gave some erros, and then no error it was kind of weird.
Without spending to much time i went out and got 2 new 4TB disk to just replace it. And then I ran into the second problem I was unable to replace the unit. tried gpart destroy on the disk and only after a couple of reboots I was able to start the replacement, as one can see from the zpool status above.
And this led to the third problem still with this going the system randomly reboots, the same error apeared on the bay2 with the brand new drive. By now I had dusted off an old PC laying around and hooked up the "defect" 3TB disk and running some extended testing (so far short,long shows no error and its doing a zero bit test now).
Its been a years since I cleaned the NAS so I shut it off and and started vacum cleaning and air pressure the NAS, taking out each bay and getting all dust out.
And after booting up there where no error on bay2, but still random reboot, this time I saw a error just before reboot (nothing in log) that bay4 was acting weird. Again I shut down took it out, checked the SATA cable and put it back in. Error seems to be gone but still some weird reboots.
I am now thinking there is some error with either a sudden bad SATA cable, not sure if thats possible or the SATA controller is starting to die on me. I will have to look into if its the same SATA controller that is going on bay2 and bay4 to be sure. For now I am just letting it stay on doing its resilver see if that can be done before next reboot, its been on for 35 min without problems for now.
But is there a way to "stresstest" more than just SMART on each drive to see what bay that triggers a reboot?
Attachments
Last edited: