Regular system (not storage) crashing since upgrade from Bluefin

artstar

Dabbler
Joined
Jan 10, 2022
Messages
36
So I waited to see what the general consensus was and felt it was fine to upgrade to Cobia 23.10.0.1 from my Bluefin install. I'm beginning to regret it though as it seems that every so often, Cobia crashes. None of my mounts (nfs or smb) are accessible. Can't ssh into the appliance. Web UI loads the login page but that's where it ends.

When I look at the IPMI, I see the following:
1699663751497.png


I have syslog streaming to Graylog but since the storage is an nfs share on the NAS, I lose the ability to continue logging, so there's a gap between the failure and the point where storage is restored after a lengthy shutdown and reboot. I'm going to change that to an alternative storage point to see what, if anything, the syslog may further reveal.

The storage itself survives, which is comforting, but the system just grinds to a halt. It's an Atom-based Supermicro system, using a pair of SSDs in a hardware RAID1 for system drives, so they're not at all part of the ZFS pool, if that detail is of any value in this discussion.

The shutdown process is long, as everything being shut down times out, including unmounts:
1699665128339.png


Is there anyone else seeing similar issues and if so, were you able to resolve it?
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
So I waited to see what the general consensus was and felt it was fine to upgrade to Cobia 23.10.0.1 from my Bluefin install. I'm beginning to regret it though as it seems that every so often, Cobia crashes. None of my mounts (nfs or smb) are accessible. Can't ssh into the appliance. Web UI loads the login page but that's where it ends.

When I look at the IPMI, I see the following:
View attachment 72318

I have syslog streaming to Graylog but since the storage is an nfs share on the NAS, I lose the ability to continue logging, so there's a gap between the failure and the point where storage is restored after a lengthy shutdown and reboot. I'm going to change that to an alternative storage point to see what, if anything, the syslog may further reveal.

The storage itself survives, which is comforting, but the system just grinds to a halt. It's an Atom-based Supermicro system, using a pair of SSDs in a hardware RAID1 for system drives, so they're not at all part of the ZFS pool, if that detail is of any value in this discussion.

The shutdown process is long, as everything being shut down times out, including unmounts:
View attachment 72319

Is there anyone else seeing similar issues and if so, were you able to resolve it?
Don't think we've seen anything else like this.
Can you describe your system?

What was the upgrade process... did problems start immediately?

Are boot devices OK.. not full?
md127 seems to have a strange message is that a software RAID?
 

artstar

Dabbler
Joined
Jan 10, 2022
Messages
36
Sure thing and thanks for getting back to me.

Supermicro A2SDi-8C-HLN4F (Atom C3758 CPU), 32GB RAM, two 256GB Samsung Evo 860 drives (RAID1)

Problem started about 19 hours after installation. Second incident (the one reported here) was about 28 hours later.

Boot devices are seemingly healthy (this is about six hours after boot). Can't identify the state of the drives when it does crash, unfortunately.
1699684169540.png


RAID (for system drives) is set up in motherboard BIOS, so just using the onboard SATA controller.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It may or may not be related to SCALE 23.10 .. 19 hours later is quite a delay.

We don't really recommend the RAID-1 boot device setup.. for good SSDs, we find it makes systems less reliable. The SSDs don't fail very often and RAID adds more complexity.

In TrueNAS appliances we have switched to single boot device per controller.....
 

artstar

Dabbler
Joined
Jan 10, 2022
Messages
36
That's an interesting approach, considering the enterprise systems I work with all use RAID mirrored SSDs for system drives.

Never was a problem with Angel nor Bluefish over the last two years or so but I'll give the single drive approach a shot to see if that helps.

Not to be rude, I'm not inclined to put this down to coincidence though. Something about Cobia definitely doesn't like my setup, compared to previous versions. It's likely I'll end up having to go back to Bluefish.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
That's an interesting approach, considering the enterprise systems I work with all use RAID mirrored SSDs for system drives.

Never was a problem with Angel nor Bluefish over the last two years or so but I'll give the single drive approach a shot to see if that helps.

Not to be rude, I'm not inclined to put this down to coincidence though. Something about Cobia definitely doesn't like my setup, compared to previous versions. It's likely I'll end up having to go back to Bluefish.
Understand your skepticism... I just didn't find any similar cases.
If you rolled back to bluefin and its stable... that might confirm your suspicion.
 
Top