System failure/crashing on reboot

bieringm · Aug 28, 2023

Hey all,

I upgrade my Truenas machine about 3-4 weeks ago and all was good. Fast forward to yesterday I got an email alert stating "Pool ... state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." I went to go log into the dashboard, but was unable to connect to the system and I was also unable to ping the system. I then restarted the machine, got the email alert again (a few minutes after boot) and was still unable to log in. I then plugged in a monitor to my GPU and was greeted with the following screen shots:

AIL4fc9PR6UmrTkR3qfIFDOOyESvXcHuKsK4S1UafzYrsMenxcFFVdYiiOI39uLxAxIrabtle3HVRFEhfPY_vUC-UilTs_-c8vXWMz31R4u5Nn_scZcJAjLtyGRUHM3POLbgwUIUjK8LBwkqQXUdniDjjFPS2PDd4gA06lr51627ykgwzVCxeQt4OMWa37a3tgsWgAeZ9sa0BemTTgUe77dm1tLO5qbpCcnKZcGGcqGyTr_x-J5J8xK--ohvDYI-MO28hF6wZXa243wjKzn6lvz_FHQUaobQAqUASE13Y_18iho2wNZMFgsnKXOpZxnimvbQcDRIaTtmoVXeOxkIeK3sgtjz4ptnMff7lTO8wBYpSIopQcWA0VxxPXKAQfstAQHLD5gjF_5tt0JD12OUdT6p_bCDQeytVTxcHnPJLlYwcdEdHOg5ZKBKaaGYpEJAaY8ZVcIXNi4Wkmrdw57LrwPo4PhCQDzGqBoc1A1gHrGkCyLqZpI5-CFTS_F4gMUWlEV-6HttcLBmWsZ7Dy0HNE8NzwetVKAFz5RDhsr1hvXf3mdMok_o3YQYawvcBS6faT2rMOa4EkTtc96ousab-gfSw3jq_VOReM3GLxE2H04YkO8wg5mJ6Zizx3yH3Ocjx9oKURqVvwZXXqJJ2lD2_hhEXMMbrHCq1vHvV9-ecOUMPn9YOTAtAKFaX_3GvKVCU8bNIdI8mCxsQ-0yPo9QisVQsPTCutOTYeE0olMb6vDkyWrIbH49avfeWYW7vjBo5YEbiCci9ZfuPcEnsgjNJGDTagDp8CAr5-xUvE6gdsR-NzDiRIPNu9uyh5naOFVPSN14MPDVyu19-pfZDSl3q7KwCzVmUoe-Z-SnCHoxyZenzbKDJ-zh8ILFjX81Il3ms0wTvbbyYoicNRAja_WSZ_NHADQ=w1011-h761-s-no

AIL4fc_8R2YPfXA1X9zunj5Cz-auYt_X_bHdK0HFI-gjaMnAXzvhT237VzKMoJ0kRMX3zwH9PInggyoAcSBIO_F8xG2BUoocULTDs3wtMHCH10xum_y7ecRtgw_J9dhmh4lIHqMC7svpxvjcru53sdMa0XTspB7bU1fY3ivraqlUhrMKFHxbv_chg_o2g1xiKdcA32bt9eoDKMrhJXrFSuCKzV0tmiOy2iCaCl1YDJVjgDi4juCLBUzP0UooRb1kvFP5WC5Z2shcH9ZiHiOSi6PzC2La-rv6XzdOl7ApA0j6_vWB5VBGrRgIViLIDpBBMGBhXQVapct5t-hzOiYu7TN-7h_lzrHrAlgCm3srI3VGDdlAfQuia1Jc8-mXyRS1KZj4fA0hn0Vt_ktmRk2QELGC0R_MR_7oTRV5UZrCPhkE5B_JzAJ-z_hiGSmWYL5wjwLXortbjeXKny3j8WoVJR0S9WqnlHBPIMlBN5smyaVSGud2LtV7Ajj7POp8lkm16ejRkuct3_OaxVjbQEk8R73aHDbWpjkR1kbaxTjDZaW4xnWP7GFSC-n4ERBqd-Yad57oINQAWeCKMIesqVW-b1TW2j43aQ28NYYT-Z9kLVPOVwSAID_RTGYeNFLK76aTzto2PrOH-2hcBVw3VkOIj4MkyF1wItGAf4NahciAufKtwBm0zcyUuk9-ufOr5wFq4oA04URCkd-a6Du7NQFOGWzQ_2Igmfx_5Hiny3DiO3yrlwYttW5EoITXJPOm7KZ2iuB1UQQR0vUjNCsE6NKAxFArWIogHQawuIklRMvy43R6ASlT61YYv0KzcCp4eYamNeR6pFfTadkweUG-gW9nHSGpXZs0MNu8_wRtCgnsZllkYHaEsYnXHf6V5nNR7ZIjW2OzPb14cIqPWa7i0fVnQwHqRgE=w1011-h761-s-no

At this point I cannot SSH or ping the system and it just sits in this state. I am not sure what the cause of this error is.

The system is as follows:
Ryzen 5 3600
Asrock B550 Phantom Gaming mobo
Mushkin Redline 64GB DDR4 3200 RAM
4x Intel 670p 2TB M.2 running on Dell 6N9RH (Raid Z1)
4x Seagate IronWolf 4TB HDDs (Raid Z1)
1x Kingston A400 240GB Sata SSD (Cache Drive)
1x Kingston A400 120GB Sata SSD (Install Drive)

The Intel drives were added on with the most recent rebuild.

sretalla · Aug 28, 2023

bieringm said:
4x Intel 670p 2TB M.2 running on Dell 6N9RH (Raid 5)

I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.

bieringm · Aug 28, 2023

sretalla said:
I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.

I was afraid of that. Thankfully I have a CRON job that backs up that pool to the main SATA pool. I have attached the screenshots this time so hopefully they can be viewed.

bieringm · Aug 28, 2023

sretalla said:
I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.

With that being said my understanding of the 6N9RH is it is just a bifurcation card and has no real controller on it. All the RAID is done in software via Truenas.

sretalla · Aug 28, 2023

OK, so the "controllers" are all NVME, so not hardware RAID... all good in that area then.

smartctl -a on each of the drives would maybe point to any sources of issue.

bieringm · Aug 28, 2023

I was able to get the machine to boot into the GUI along with allowing SSH. I seem to have 2 failing drives in my HDD array, so I think that was causing most of the issues. Thanks sretalla for the help.

Important Announcement for the TrueNAS Community.

System failure/crashing on reboot

bieringm

Cadet

sretalla

Powered by Neutrality

What's all the noise about HBA's, and why can't I use a RAID controller?

bieringm

Cadet

What's all the noise about HBA's, and why can't I use a RAID controller?

Attachments

bieringm

Cadet

What's all the noise about HBA's, and why can't I use a RAID controller?

sretalla

Powered by Neutrality

bieringm

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

System failure/crashing on reboot

Cadet

Powered by Neutrality

Cadet

Attachments

Cadet

Powered by Neutrality

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System failure/crashing on reboot"

Similar threads