cmh
Explorer
- Joined
- Jan 7, 2013
- Messages
- 75
Have searched, read through a couple posts that looked related, nothing seemed to match up quite right.
Problem summary:
Received single checksum error two months ago, cleared. Got it again yesterday and in the process of figuring out which drive some things were done badly, and then I got checksum errors on all five drives in the pool. Tried cleaning it and reseating all cables and now it won't boot.
Details:
I've got an old FreeNAS Mini motherboard - got it from a friend when support replaced his, he had already bought another. It's connected to two SSDs for the boot mirror and (5) 8TB Seagate Iron Wolf drives, all bought early in 2018 across the span of four months. Drives are in a SuperMicro 5 bay unit in 3x5.25" bays in an old case. Power supply was modern at the time of the build - 380W Antec.
TrueNAS Core installed on the SSDs, updated to 13.0-U5.3 last week. (I've also got a backup NAS, my previous, that I upgrade first) Main pool is a RAIDZ2.
On June 27th, I got a warning about a single checksum error. Followed the procedure and reset the count, triggered a scrub of the pool and confirmed all my drives were part of both the long and short SMART tests. All good.
Yesterday I got the warning again, checked and it was a single checksum error again. Was busy, had several other things going on, so I did the zpool clear again and didn't note which drive had the errors. (didn't note the drive in June, either - I know, dumb, but I really didn't want to have to deal with it at that moment)
Last night it failed again, but this time it was a bunch of checksum errors. Pool was still okay, everything was working. Triggered a scrub and wound up with over 3000 checksum errors on all five drives, but the scrub finished this morning, having fixed 694M out of 36TB. Grabbed a portion of the zpool status output:
Oh I should also mention that I looked at the smartctl -a output for all of those drives, and <i>all five</i> had a significant amount of errors. Raw read error rate, seek error rate, and G-Sense error rate were all concerningly high. I didn't grab that output, and like a bonehead I took the opportunity to finally install that pending iTerm update, so all my scrollbacks are lost. Hard pressed to imagine all five disks failed simultaneously, and I had it configured to run SMART tests on all disks and haven't heard a peep from that. Drives were purchased in 2018, 1 in Feb, 3 in Mar, and the last one in May, so I really doubt they're from the same batch. Why did I spread the purchases across 4 months? That's a question for past me, because present me has absolutely no idea.
At this point it's feeling like maybe a power supply issue, or a bad connection, or possible the controller on the motherboard?
So I shut it down, take it outside, blow it out (not really all that much dust despite living in the basement) and reconnect every cable. I look at the system board and nothing looks amiss, but it would have to be spewing smoke or be noticeably damaged for me to pick up on it.
Bring it back downstairs and plug it in and it starts to boot, but gets to "lo0: link state changed to UP" and that's the last thing it does. Have given it plenty of time to get past that point and so far nothing.
Anyone have any thoughts or suggestions about how I might be able to get this thing back online? I'm thinking it might be time to pony up and order up a new TrueNAS Mini, but I'd like to get this system back up and running at least until that gets here. Considering reinstalling onto a different drive and seeing if that works.
Thanks, hopefully info is as sufficient as it can be with me not having logged the data that sure would have been useful to have now.
Problem summary:
Received single checksum error two months ago, cleared. Got it again yesterday and in the process of figuring out which drive some things were done badly, and then I got checksum errors on all five drives in the pool. Tried cleaning it and reseating all cables and now it won't boot.
Details:
I've got an old FreeNAS Mini motherboard - got it from a friend when support replaced his, he had already bought another. It's connected to two SSDs for the boot mirror and (5) 8TB Seagate Iron Wolf drives, all bought early in 2018 across the span of four months. Drives are in a SuperMicro 5 bay unit in 3x5.25" bays in an old case. Power supply was modern at the time of the build - 380W Antec.
TrueNAS Core installed on the SSDs, updated to 13.0-U5.3 last week. (I've also got a backup NAS, my previous, that I upgrade first) Main pool is a RAIDZ2.
On June 27th, I got a warning about a single checksum error. Followed the procedure and reset the count, triggered a scrub of the pool and confirmed all my drives were part of both the long and short SMART tests. All good.
Yesterday I got the warning again, checked and it was a single checksum error again. Was busy, had several other things going on, so I did the zpool clear again and didn't note which drive had the errors. (didn't note the drive in June, either - I know, dumb, but I really didn't want to have to deal with it at that moment)
Last night it failed again, but this time it was a bunch of checksum errors. Pool was still okay, everything was working. Triggered a scrub and wound up with over 3000 checksum errors on all five drives, but the scrub finished this morning, having fixed 694M out of 36TB. Grabbed a portion of the zpool status output:
Code:
scan: scrub repaired 694M in 12:13:17 with 0 errors on Fri Aug 11 07:45:10 2023 config: NAME STATE READ WRITE CKSUM sto ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/d012e43e-67e7-11e8-9745-d05099c38c29 ONLINE 0 0 3.14K gptid/d0fa3a57-67e7-11e8-9745-d05099c38c29 ONLINE 0 0 3.15K gptid/d1dffb15-67e7-11e8-9745-d05099c38c29 ONLINE 0 0 3.21K gptid/d2e4a62a-67e7-11e8-9745-d05099c38c29 ONLINE 0 0 3.21K gptid/d3d3b8b0-67e7-11e8-9745-d05099c38c29 ONLINE 0 0 3.18K
Oh I should also mention that I looked at the smartctl -a output for all of those drives, and <i>all five</i> had a significant amount of errors. Raw read error rate, seek error rate, and G-Sense error rate were all concerningly high. I didn't grab that output, and like a bonehead I took the opportunity to finally install that pending iTerm update, so all my scrollbacks are lost. Hard pressed to imagine all five disks failed simultaneously, and I had it configured to run SMART tests on all disks and haven't heard a peep from that. Drives were purchased in 2018, 1 in Feb, 3 in Mar, and the last one in May, so I really doubt they're from the same batch. Why did I spread the purchases across 4 months? That's a question for past me, because present me has absolutely no idea.
At this point it's feeling like maybe a power supply issue, or a bad connection, or possible the controller on the motherboard?
So I shut it down, take it outside, blow it out (not really all that much dust despite living in the basement) and reconnect every cable. I look at the system board and nothing looks amiss, but it would have to be spewing smoke or be noticeably damaged for me to pick up on it.
Bring it back downstairs and plug it in and it starts to boot, but gets to "lo0: link state changed to UP" and that's the last thing it does. Have given it plenty of time to get past that point and so far nothing.
Anyone have any thoughts or suggestions about how I might be able to get this thing back online? I'm thinking it might be time to pony up and order up a new TrueNAS Mini, but I'd like to get this system back up and running at least until that gets here. Considering reinstalling onto a different drive and seeing if that works.
Thanks, hopefully info is as sufficient as it can be with me not having logged the data that sure would have been useful to have now.