System failure/crashing on reboot

bieringm

Cadet
Joined
Oct 15, 2021
Messages
5
Hey all,

I upgrade my Truenas machine about 3-4 weeks ago and all was good. Fast forward to yesterday I got an email alert stating "Pool ... state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." I went to go log into the dashboard, but was unable to connect to the system and I was also unable to ping the system. I then restarted the machine, got the email alert again (a few minutes after boot) and was still unable to log in. I then plugged in a monitor to my GPU and was greeted with the following screen shots:
AIL4fc9PR6UmrTkR3qfIFDOOyESvXcHuKsK4S1UafzYrsMenxcFFVdYiiOI39uLxAxIrabtle3HVRFEhfPY_vUC-UilTs_-c8vXWMz31R4u5Nn_scZcJAjLtyGRUHM3POLbgwUIUjK8LBwkqQXUdniDjjFPS2PDd4gA06lr51627ykgwzVCxeQt4OMWa37a3tgsWgAeZ9sa0BemTTgUe77dm1tLO5qbpCcnKZcGGcqGyTr_x-J5J8xK--ohvDYI-MO28hF6wZXa243wjKzn6lvz_FHQUaobQAqUASE13Y_18iho2wNZMFgsnKXOpZxnimvbQcDRIaTtmoVXeOxkIeK3sgtjz4ptnMff7lTO8wBYpSIopQcWA0VxxPXKAQfstAQHLD5gjF_5tt0JD12OUdT6p_bCDQeytVTxcHnPJLlYwcdEdHOg5ZKBKaaGYpEJAaY8ZVcIXNi4Wkmrdw57LrwPo4PhCQDzGqBoc1A1gHrGkCyLqZpI5-CFTS_F4gMUWlEV-6HttcLBmWsZ7Dy0HNE8NzwetVKAFz5RDhsr1hvXf3mdMok_o3YQYawvcBS6faT2rMOa4EkTtc96ousab-gfSw3jq_VOReM3GLxE2H04YkO8wg5mJ6Zizx3yH3Ocjx9oKURqVvwZXXqJJ2lD2_hhEXMMbrHCq1vHvV9-ecOUMPn9YOTAtAKFaX_3GvKVCU8bNIdI8mCxsQ-0yPo9QisVQsPTCutOTYeE0olMb6vDkyWrIbH49avfeWYW7vjBo5YEbiCci9ZfuPcEnsgjNJGDTagDp8CAr5-xUvE6gdsR-NzDiRIPNu9uyh5naOFVPSN14MPDVyu19-pfZDSl3q7KwCzVmUoe-Z-SnCHoxyZenzbKDJ-zh8ILFjX81Il3ms0wTvbbyYoicNRAja_WSZ_NHADQ=w1011-h761-s-no
AIL4fc_8R2YPfXA1X9zunj5Cz-auYt_X_bHdK0HFI-gjaMnAXzvhT237VzKMoJ0kRMX3zwH9PInggyoAcSBIO_F8xG2BUoocULTDs3wtMHCH10xum_y7ecRtgw_J9dhmh4lIHqMC7svpxvjcru53sdMa0XTspB7bU1fY3ivraqlUhrMKFHxbv_chg_o2g1xiKdcA32bt9eoDKMrhJXrFSuCKzV0tmiOy2iCaCl1YDJVjgDi4juCLBUzP0UooRb1kvFP5WC5Z2shcH9ZiHiOSi6PzC2La-rv6XzdOl7ApA0j6_vWB5VBGrRgIViLIDpBBMGBhXQVapct5t-hzOiYu7TN-7h_lzrHrAlgCm3srI3VGDdlAfQuia1Jc8-mXyRS1KZj4fA0hn0Vt_ktmRk2QELGC0R_MR_7oTRV5UZrCPhkE5B_JzAJ-z_hiGSmWYL5wjwLXortbjeXKny3j8WoVJR0S9WqnlHBPIMlBN5smyaVSGud2LtV7Ajj7POp8lkm16ejRkuct3_OaxVjbQEk8R73aHDbWpjkR1kbaxTjDZaW4xnWP7GFSC-n4ERBqd-Yad57oINQAWeCKMIesqVW-b1TW2j43aQ28NYYT-Z9kLVPOVwSAID_RTGYeNFLK76aTzto2PrOH-2hcBVw3VkOIj4MkyF1wItGAf4NahciAufKtwBm0zcyUuk9-ufOr5wFq4oA04URCkd-a6Du7NQFOGWzQ_2Igmfx_5Hiny3DiO3yrlwYttW5EoITXJPOm7KZ2iuB1UQQR0vUjNCsE6NKAxFArWIogHQawuIklRMvy43R6ASlT61YYv0KzcCp4eYamNeR6pFfTadkweUG-gW9nHSGpXZs0MNu8_wRtCgnsZllkYHaEsYnXHf6V5nNR7ZIjW2OzPb14cIqPWa7i0fVnQwHqRgE=w1011-h761-s-no

At this point I cannot SSH or ping the system and it just sits in this state. I am not sure what the cause of this error is.

The system is as follows:
Ryzen 5 3600
Asrock B550 Phantom Gaming mobo
Mushkin Redline 64GB DDR4 3200 RAM
4x Intel 670p 2TB M.2 running on Dell 6N9RH (Raid Z1)
4x Seagate IronWolf 4TB HDDs (Raid Z1)
1x Kingston A400 240GB Sata SSD (Cache Drive)
1x Kingston A400 120GB Sata SSD (Install Drive)

The Intel drives were added on with the most recent rebuild.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
4x Intel 670p 2TB M.2 running on Dell 6N9RH (Raid 5)
I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.
 

bieringm

Cadet
Joined
Oct 15, 2021
Messages
5
I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.
I was afraid of that. Thankfully I have a CRON job that backs up that pool to the main SATA pool. I have attached the screenshots this time so hopefully they can be viewed.
 

Attachments

  • PXL_20230827_175150383.jpg
    PXL_20230827_175150383.jpg
    573.4 KB · Views: 59
  • PXL_20230827_181719874.jpg
    PXL_20230827_181719874.jpg
    510.6 KB · Views: 57

bieringm

Cadet
Joined
Oct 15, 2021
Messages
5
I think this is already the source of your problem and unfortunately it's going to be too late to avoid data loss.

Please read this:

If you've used a RAID5 configuration in your RAID card to present a volume to ZFS, it's not possible for ZFS to control the data in a way that prevents (and can correct) corruption.

It can, however, detect such corruption, which is what you're now seeing.

I don't think you've attached your screenshots properly, so I have not seen what you're sharing there.
With that being said my understanding of the 6N9RH is it is just a bifurcation card and has no real controller on it. All the RAID is done in software via Truenas.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
OK, so the "controllers" are all NVME, so not hardware RAID... all good in that area then.

smartctl -a on each of the drives would maybe point to any sources of issue.
 

bieringm

Cadet
Joined
Oct 15, 2021
Messages
5
I was able to get the machine to boot into the GUI along with allowing SSH. I seem to have 2 failing drives in my HDD array, so I think that was causing most of the issues. Thanks sretalla for the help.
 
Top