Disk problems after upgrading to 9.3

Delivereath · Feb 21, 2015

Hi,

I was running my NAS with Freenas 9.1 until a week ago. I upgraded it to 9.3 and right after that, it started to have issues. My setup is 2 RAIDZ1 pools, one with 4x 1TB and one with 4x 2TB. Both are attached to a M1015 controller and Freenas runs as a virtual machine in ESXI. Freenas has a direct access to the controller using passtrough.

First I got a few warnings like :
- Firmware version 15 does not match driver version 16 for /dev/mps0
- zpool version not up to date

I updated my zpool version without any issue and this cleared the corresponding warning.

I also had some issues when reading video files from my NAS. At some point, the video would just freeze and I had to kill the task. At this exact moment, Freenas reported the following :

Feb 21 04:20:06 freenas (da1:mps0:0:0:0): READ(10). CDB: 28 00 0d c0 8a e8 00 00 08 00
Feb 21 04:20:06 freenas (da1:mps0:0:0:0): CAM status: SCSI Status Error
Feb 21 04:20:06 freenas (da1:mps0:0:0:0): SCSI status: Check Condition
Feb 21 04:20:06 freenas (da1:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 21 04:20:06 freenas (da1:mps0:0:0:0): Info: 0xdc08ae8
Feb 21 04:20:06 freenas (da1:mps0:0:0:0): Error 5, Unretryable error

This continuously happens on da1.

Next I ran a long smartctl test and got 2 disks (da1 and da4) which failed due to read errors.

I've also seen in the logs the following message :

Feb 21 04:42:33 freenas smartd[2479]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors

I ran a manual scrub. It repaired 50MB of data but I've now the following alert in the GUI :

CRITICAL: The volume PoolA (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Two open points :
- Regarding the firmware version driver message, could this lead to some disk errors ? Should I plan an update of the firmware ?

- About the disk, do you think that my da1 drive is dying ? It's quite odd this happening right after my update to 9.3. Anything I could do or test before changing the disk ?

Thanks

Bidule0hm · Feb 21, 2015

You should backup the data NOW if it isn't already the case. da1 is dying, you should replace it as fast as you can after doing the backup. The replace has many chances to fail since it's a RAID-Z1 and you say da4 also begin to have errors. This is why you should backup your data before doing anything.

I've helped a member a few days ago with pretty much the exact same problem (but a bit worse, his pool doesn't mount after reboot) and yesterday I found that the two failing drives are too much trashed to get his data back.

You should also update the FW to the P16 version to avoid some problems, but it's not what's causing the errors right now.

Delivereath · Feb 21, 2015

Ok, I will change the disk. I already have a backup but anyway I'm copying data to my other Pool.

Is it enough to use the command cp -R -a to correctly copy the files ? Any way to easily compare checksums of all the files after the copy is done ?

Bidule0hm · Feb 21, 2015

Ok, good ;)

Yes, note that -R is useless with -a as -a is the same as doing -RPp.

I don't know for the checksums but I can imagine there is a solution, just search on the web

DrKK · Feb 21, 2015

Just for the record, it would appear to be complete coincidence that the disk failure occurred around the same time as the upgrade to 9.3.

I agree with the other guy, you are in imminent failure zone here.

Delivereath · Feb 22, 2015

I've ordered the new disk.

Regarding, the reconstruction of my RaidZ1, what would be the best way to do it ? Since I have all data backed up, could I just delete the raidz, create a new one and copy the data again ?

gpsguy · Feb 22, 2015

Did you order 2 disks?

You original message said "Next I ran a long smartctl test and got 2 disks (da1 and da4) which failed due to read errors."

Delivereath said:
I've ordered the new disk.

Delivereath · Feb 22, 2015

I had no errors when using da4. However, it failed the smartctl test due to a read error.

It may be that it just has a few bad sectors. Any way to test that and to disable any bad sector ?

Bidule0hm · Feb 22, 2015

You can write to that sector with the dd command to force a remap, but be extremely careful with dd, you can wipe your data if you don't know what you're doing. There is some threads on the forum with the details to do this ;)

But when a drive start to have some bad sectors generally it get worse and a few weeks (or even days) later you've many thousands bad sectors and the drive is dead at this point... But you can be lucky and only have a few bad sectors for months without any problem, just keep an eye on that value :)

Ericloewe · Feb 22, 2015

Delivereath said:
I had no errors when using da4. However, it failed the smartctl test due to a read error.

It may be that it just has a few bad sectors. Any way to test that and to disable any bad sector ?

Wrong thought process. By the time you know the sectors are bad, the drive knows it as well. ZFS should fill in any gaps that might exist (of course, RAIDZ1 severely limits this capability...).

You might tolerate one or two bad sectors, but have a replacement drive burned-in and ready to go anyway. If you don't tolerate the bad sector, replace it now. In both cases, make sure you have a spare ready to go at the end of this little exercise.

Important Announcement for the TrueNAS Community.

Disk problems after upgrading to 9.3

Delivereath

Dabbler

Bidule0hm

Server Electronics Sorcerer

Delivereath

Dabbler

Bidule0hm

Server Electronics Sorcerer

DrKK

FreeNAS Generalissimo

Delivereath

Dabbler

gpsguy

Active Member

Delivereath

Dabbler

Bidule0hm

Server Electronics Sorcerer

Ericloewe

Server Wrangler

Similar threads