Was hoping some folks on the forum might be able to help me troubleshoot this issue.
This is an all flash system I built a little less than a year ago with the help of this forum. The specs are:
Chasis: SUPERMICRO CSE-216BE2C-R920LPB
Motherboard: SUPERMICRO MBD-X10SRH-CLN4F-O (HBA is an onboard LSI 3008 flashed to IT mode)
CPU: Xeon E5-1650v4
RAM: 4x32GB Samsung DDR4-2400
Boot Drive: Two mirrored SSD-DM064-SMCMVN1 (64GB DOM)
SLOG: Intel P4800X (my performance on that is posted on the SLOG benchmarking thread)
Array Drives - 8 x Samsung 883 DCT 1.92 TB
Network - Chelsio T520-BT
It has worked extremely well in production since then, serving exclusively as the datastore for an ESXi cluster on a 10gb network. The activity it sees is a few dozen VMs with varying workloads (virtual desktops, networks servers, a few light databases) and nightly backups to a few different places. All great; no issues.
About a month ago I upgraded to the 11.3 chain (i think it was RC2 but not positive, could have been RC1). A little after that the following error popped up in the daily alerts
Went in to the console--smart seemed to be working fine. Ran short and long tests on the drive; no issues.
A little while later the system reboots. Never had a random reboot before on the system. OK, I think (maybe not entirely rationally) this is just bad luck--maybe a cosmic ray hit the controller, whatever. But the reboots keep happening about once a day. The logs have entries like this around the time of the reboots:
I've been running FreeNAS for probably 6 or 7 years and the only issues I've ever had are bad drives, so I immediately assume that da7 has just gone bad. Swap it. A few hours later I get
Da6 is not the replacement drive; it's the next one down in the pool. And the reboots keep happening--about once a day (BTW kudos to VMWare for how resilient ESXi is to a complete reboot of the datastore; the vms freeze but nearly all of them resume after the store comes back--without a reboot). I don't really know what is going on but it seems like its more likely a controller or cable issue. So I migrate all the VMs off this store, pull out the chasis and take a look. It's fairly dusty (note to self: look at improving server room environment) so I blow out all the dust, reseat all the cables, and, for good measure, move all the drives to previously unused ports on the backplane (8 drives in a 24 bay chasis, plenty of room). I also take the old da7 drive and add it back as a hot spare (bringing drive total to 9) as I'm now thinking that it wasn't bad.
Reboot, and all seems well. Drives all pass smartctl. Run a few read/write tests; wait a day; all good. Start migrating some non-essential vms back on to it. No issues. VMs work fine, backups work fine. 6 days of solid uptime. I'm feeling good; migrate a few of the bigger--but still not mission critical--vms back on to it. All OK.
Then--this morning (7 days of uptime into it) I wake up to the following e-mail:
I don't need to tell you that these are all fully capable of SMART tests, and, in fact, pass them when run manually. The log is attached, but you'll see the similar camstatus errors.
I have not had any reboots, but I feel like that's coming...
Any thoughts or ideas on how to troubleshoot further? At this point, I'm not sure what to do but trying swapping the SAS breakout cables, then the controller (annoying as I'll have to get a PCI controller since this one is on the mobo), and, really worse case, the backplane (yuck). Any other/better ideas?
What else can I tell you that might be helpful? Let me see:
As always greatly appreciate the patience and insight on these boards.
Best,
Sam
This is an all flash system I built a little less than a year ago with the help of this forum. The specs are:
Chasis: SUPERMICRO CSE-216BE2C-R920LPB
Motherboard: SUPERMICRO MBD-X10SRH-CLN4F-O (HBA is an onboard LSI 3008 flashed to IT mode)
CPU: Xeon E5-1650v4
RAM: 4x32GB Samsung DDR4-2400
Boot Drive: Two mirrored SSD-DM064-SMCMVN1 (64GB DOM)
SLOG: Intel P4800X (my performance on that is posted on the SLOG benchmarking thread)
Array Drives - 8 x Samsung 883 DCT 1.92 TB
Network - Chelsio T520-BT
It has worked extremely well in production since then, serving exclusively as the datastore for an ESXi cluster on a 10gb network. The activity it sees is a few dozen VMs with varying workloads (virtual desktops, networks servers, a few light databases) and nightly backups to a few different places. All great; no issues.
About a month ago I upgraded to the 11.3 chain (i think it was RC2 but not positive, could have been RC1). A little after that the following error popped up in the daily alerts
Code:
Device: /dev/da7 [SAT], not capable of SMART self-check.
Went in to the console--smart seemed to be working fine. Ran short and long tests on the drive; no issues.
A little while later the system reboots. Never had a random reboot before on the system. OK, I think (maybe not entirely rationally) this is just bad luck--maybe a cosmic ray hit the controller, whatever. But the reboots keep happening about once a day. The logs have entries like this around the time of the reboots:
Code:
mpr0: mprsas_action_scsiio: Freezing devq for target ID 24 (da7:mpr0:0:24:0): WRITE(10). CDB: 2a 00 db 9d 83 18 00 00 10 00 (da7:mpr0:0:24:0): CAM status: CAM subsystem is busy (da7:mpr0:0:24:0): Retrying command
I've been running FreeNAS for probably 6 or 7 years and the only issues I've ever had are bad drives, so I immediately assume that da7 has just gone bad. Swap it. A few hours later I get
Code:
Device: /dev/da6 [SAT], not capable of SMART self-check.
Da6 is not the replacement drive; it's the next one down in the pool. And the reboots keep happening--about once a day (BTW kudos to VMWare for how resilient ESXi is to a complete reboot of the datastore; the vms freeze but nearly all of them resume after the store comes back--without a reboot). I don't really know what is going on but it seems like its more likely a controller or cable issue. So I migrate all the VMs off this store, pull out the chasis and take a look. It's fairly dusty (note to self: look at improving server room environment) so I blow out all the dust, reseat all the cables, and, for good measure, move all the drives to previously unused ports on the backplane (8 drives in a 24 bay chasis, plenty of room). I also take the old da7 drive and add it back as a hot spare (bringing drive total to 9) as I'm now thinking that it wasn't bad.
Reboot, and all seems well. Drives all pass smartctl. Run a few read/write tests; wait a day; all good. Start migrating some non-essential vms back on to it. No issues. VMs work fine, backups work fine. 6 days of solid uptime. I'm feeling good; migrate a few of the bigger--but still not mission critical--vms back on to it. All OK.
Then--this morning (7 days of uptime into it) I wake up to the following e-mail:
Current alerts:
* Scrub of pool 'flashpool' finished.
* Scrub of pool 'freenas-boot' finished.
* Device: /dev/da8 [SAT], not capable of SMART self-check.
* Device: /dev/da6 [SAT], Read SMART Self-Test Log Failed.
* Device: /dev/da5 [SAT], not capable of SMART self-check.
* Device: /dev/da5 [SAT], failed to read SMART Attribute Data.
I don't need to tell you that these are all fully capable of SMART tests, and, in fact, pass them when run manually. The log is attached, but you'll see the similar camstatus errors.
I have not had any reboots, but I feel like that's coming...
Any thoughts or ideas on how to troubleshoot further? At this point, I'm not sure what to do but trying swapping the SAS breakout cables, then the controller (annoying as I'll have to get a PCI controller since this one is on the mobo), and, really worse case, the backplane (yuck). Any other/better ideas?
What else can I tell you that might be helpful? Let me see:
- At this point I'm on 11.3-Release (I upgraded when I pulled the chasis)
- The only thing "unusual" that happened last night was I had added an extra backup location in Veeam so there would have been a little more pressure on the drives copying to the new repository.
- Drive temperatures are good--maybe max 34 C under load.
- I have logs going back in time which I'm not posting (a) to spare you; (b) they are handled by a remote server which stores them in a sqldb so its a little more work to post them in a usable format (but I'm happy to if it would help). To my eye, they look similar to the one I'm posting now.
As always greatly appreciate the patience and insight on these boards.
Best,
Sam