SOLVED Pool Suspended for I/O Failure

albrecd · Jul 20, 2023

Hi everyone,
I've been running TrueNAS Core for about 5 years, and relatively recently upgraded to a pool of 4x 8TB drives in two mirrored VDEVs. It's been running smoothly for a few weeks, but unfortunately got a pretty scary alert this evening:

WARNING: Pool 'NASMirror' has encountered an uncorrectable I/O failure and has been suspended.

Here is my current pool status:

Code:

root@ateamnas:~ # zpool status NASMirror
  pool: NASMirror
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub repaired 0B in 03:39:53 with 0 errors on Fri Jun 30 23:05:27 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        NASMirror                                       ONLINE       0     0 0
          mirror-0                                      ONLINE      24     8 0
            gptid/af89a88d-128b-11ee-9eea-7824af32148b  ONLINE      13     8 0
            gptid/6f2ec921-1221-11ee-ad59-7824af32148b  ONLINE       9     8 0
          mirror-1                                      ONLINE      33    12 0
            gptid/99be896b-1614-11ee-b28a-7

I ran a zpool clear, but the pool is still suspended. I can traverse the directory tree via the shell, but otherwise can't read/write. I didn't want to try to much beyond that before checking in here so I don't make things worse.

The system is currently powered on, and several other pools are still up and running (Including another pool on the same SAS controller).

Some additional error logs:

Code:

...
Jul 20 21:13:14 ateamnas mps0: Controller reported scsi ioc terminated tgt 3 SMID 1455 loginfo 31120100
Jul 20 21:13:14 ateamnas (da1:mps0:0:3:0): READ(10). CDB: 28 00 24 40 5c b0 00 01 00 00
Jul 20 21:13:14 ateamnas (da1:mps0:0:3:0): CAM status: CCB request completed with an error
Jul 20 21:13:14 ateamnas (da1:mps0:0:3:0): Error 5, Retries exhausted
Jul 20 21:13:14 ateamnas mps0: Controller reported scsi ioc terminated tgt 4 SMID 1044 loginfo 31170000
Jul 20 21:13:14 ateamnas (da2:mps0:0:4:0): READ(10). CDB: 28 00 24 40 51 38 00 01 00 00
Jul 20 21:13:14 ateamnas (da2:mps0:0:4:0): CAM status: CCB request completed with an error
Jul 20 21:13:14 ateamnas (da2:mps0:0:4:0): Retrying command, 3 more tries remain
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): READ(10). CDB: 28 00 24 40 51 38 00 01 00 00
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): CAM status: SCSI Status Error
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): SCSI status: Check Condition
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,1 (Power on occurred)
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Field Replaceable Unit: 22
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Retrying command (per sense data)
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): READ(10). CDB: 28 00 24 40 51 38 00 01 00 00
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): CAM status: SCSI Status Error
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): SCSI status: Check Condition
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): SCSI sense: NOT READY asc:4,11 (Logical unit not ready, notify (enable spinup) required)
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Field Replaceable Unit: 83
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Command Specific Info: 0
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Descriptor 0x80: f5 53
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Descriptor 0x81: 00 00 00 00 00 00
Jul 20 21:13:15 ateamnas (da2:mps0:0:4:0): Polling device for readiness

My current hardware:
- Motherboard: ASUS H87M-PLUS
- CPU: Intel Core i5-4750
- RAM quantity: 8GB (2x 4GB)
- Hard Drives:
- Pool 'NASMirror': 4x HGST 8TB SAS drives, 2x Mirrored VDEV, Connected via Supermicro controller
- Pool 'PlexVol': 1x Patriot Burst 120GB SATA SSD, Connected via main board
- Pool 'SafeMirror': 1x Seagate 1TB 2.5" SATA and 1x Western Digital 1TB 2.5" SATA, Connected via main board
- Pool 'TempMirror': 2x HGST 8TB SAS drives, Mirrored VDEV, Connected via Supermicro controller (I had recently connected these drives an was running smart tests, just put them into a pool to transfer data over if I can read from NASMirror).
- Pool 'Boot': 2x SanDisk Cruiser 16GB USB drives, Mirrored VDEV
- Disk Controller: Supermicro 9207-8I
- Network: Built-in

This is just a hobby machine and nothing critical is on it that isn't also stored somewhere else, but it would be convenient not to lose the pool if possible (and also to figure out what went wrong since it seems to have impacted so many disks all at once).

Thanks everyone!

*Edit: Cleaned up code tags.*

albrecd · Jul 21, 2023

Update:

This morning I re-ran zpool clear on the pool and the suspension was lifted. I copied a few files off of the pool to test and had no issues, then shut the server down for now.

Updated pool status:

Code:

root@ateamnas:~ # zpool status -v NASMirror
  pool: NASMirror
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 2.73M in 00:00:01 with 21 errors on Fri Jul 21 07:17:44 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        NASMirror                                       ONLINE       0     0 0
          mirror-0                                      ONLINE       0     0 0
            gptid/af89a88d-128b-11ee-9eea-7824af32148b  ONLINE       0     048
            gptid/6f2ec921-1221-11ee-ad59-7824af32148b  ONLINE       0     048
          mirror-1                                      ONLINE       0     0 0
            gptid/99be896b-1614-11ee-b28a-7824af32148b  ONLINE       0     078
            gptid/d04e2182-128b-11ee-9eea-7824af32148b  ONLINE       0     078

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x293>

I'm planning to roll back to a recent snapshot, otherwise anything I should be doing?

Also if anyone has thoughts on where I should focus trying to determine what happened here that would be much appreciated!

NugentS · Jul 21, 2023

You have a bunch of chksum errors across all 4 disks (and all the previous read/write) - and if I understand correctly all 4 are connected to the SMC 9207-8i.
Given that 1 port of the LSI (SMC) card can support 4 direct attached drives - are they all on the same cable? Try reseating / replacing that cable - or swapping to the other cable and see where errors come up after a scrub
Also does the new pool show any errors at all?

To me this looks like a cable / controller issue - what sort of case is this in? LSI cards get very hot - and the cards are designed to be in a case with proper airflow - so can overheat in a consumer grade case.

As for:
<metadata>:<0x0>
<metadata>:<0x1>
<metadata>:<0x293>
This is metata data - which you cannot fix or repair. If restoring a snapshot does not fix this then the pool itself is borked and will need trashing and rebuilding - but you can still get the data off it before having to trash things

albrecd · Jul 21, 2023

NugentS said:
and if I understand correctly all 4 are connected to the SMC 9207-8i.

That is correct, and yes all on the same cable. The other pool attached to the controller is on another cable. It could be a cable issue, and the timing is pretty coincidentally close to when I added the new drives, so I may have bumped something and didn't realize it.

NugentS said:
Also does the new pool show any errors at all?

Not currently, though the drives have only been attached for ~36 hours and in a pool ~12 hours (they were not in a pool at the time of the other pool's suspension).

NugentS said:
what sort of case is this in?

This is a consumer grade Rosewill 2U rack case (with some drives in an external bay), so it could be an airflow issue. I've upgraded the fans in the case, and haven't noticed a temperature issue reported for the CPU / drives in the main case and have alerting set up that should email me if temps get high, but I definitely can't rule that out.

So based on all of this I think my next steps are to re-seat the cable, then copy everything to the temp pool, then roll back to snapshot on the main pool, then run a scrub and see what errors persist. If there are still issues I'll kill the pool and start over (I wouldn't hate to add a bit more swap space anyway to make possible future failed drive replacement easier). I'll also see if I can monitor the temperature of the controller a bit more closely.

I also know that my RAM is pretty low for the amount of storage, any chance not enough space for ARC could result in I/O issues like this? I suspect I'm just wishful thinking since that would be a fairly easy solve.

Thanks very much for your response!

NugentS · Jul 21, 2023

Yeah - you are short of memory. But I would have thought that would cause performance issues rather than borking a pool. If this was Scale I would say you were way short - but Core you are short - more would be much better - but not to the point of destroying data.

When you say upgraded the fans - what do you mean?
Rack Cases tend to use high noise, high flow / high static pressure fans to force the air through and for good reason. An "Upgrade" to say quieter fans is actually probably (very likley) a downgrade in terms of cooling, even if they are quieter. A 2U case uses little fans that will scream, quieter fans may cause you thermal issues.

albrecd · Jul 21, 2023

Okay, yeah I figure the RAM was a long shot.

I changed to Noctua NF-R8s. That was ~5 years ago and I don't remember the exact model that was in the case but I don't think they were server grade. Either way it is certainly possible that my airflow just hasn't been adequate for a 2U case and I'm only now seeing the issue since the LSI controller was installed ~1 month ago.

NugentS · Jul 21, 2023

Do you have room to stick a fan on the LSI (SMC) card?
Tie Wraps, thermal epoxy / screws (whatever) just something to keep it cool

albrecd · Jul 21, 2023

Yes, I'll do that, thanks!

albrecd · Jul 23, 2023

Update:

I beefed up the airflow in the case (added a few fans externally to the back of the case for outflow, and added a 60mm fan internally pointed directly at the SAS controller's heat sink. It'll be hard to prove the negative if everything remains stable, but if hopefully I don't have a recurrence of the data corruption I'm pretty happy to assume this was the issue.

The main pool (NASMirror) has been online and stable (though with 'unhealthy' status) since coming back Friday morning, and I was able to copy everything to TempMirror as a precaution (and since I may need to rebuild NASMirror to deal with the reported permanent errors in the metatdata).

That being said, I rolled NASMirror back to a snapshot before the corruption and re ran zpool clear (which did not change the 'unhealthy' status reported for the pool), then ran a scrub, following which the pool is reported as healthy. Current status:

Code:

root@ateamnas:~ # zpool status -v NASMirror
  pool: NASMirror
 state: ONLINE
  scan: scrub repaired 0B in 03:52:27 with 0 errors on Sun Jul 23 14:57:56 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        NASMirror                                       ONLINE       0     0 0
          mirror-0                                      ONLINE       0     0 0
            gptid/af89a88d-128b-11ee-9eea-7824af32148b  ONLINE       0     0 0
            gptid/6f2ec921-1221-11ee-ad59-7824af32148b  ONLINE       0     0 0
          mirror-1                                      ONLINE       0     0 0
            gptid/99be896b-1614-11ee-b28a-7824af32148b  ONLINE       0     0 0
            gptid/d04e2182-128b-11ee-9eea-7824af32148b  ONLINE       0     0 0

errors: No known data errors

I'm guessing the rollback was the solution rather than the scrub since it reports 0B repaired (there may have just been a delay in the status updating), but either way it seems like the pool is healthy again. Is there any reason I should still rebuild, or am I fairly safe to trust that everything is back to healthy and carry on with NASMirror as is?

Thank you!

NugentS · Jul 23, 2023

No - if its not reporting metadata errors, then don't worry about it

Your pool looks good to me. See if it stays that way for a week or so before declaring solved though

albrecd · Jul 23, 2023

That sounds good, and I'll wait a while before taking down the temporary pool.

Thanks very much for all of your help and advice @NugentS !

Important Announcement for the TrueNAS Community.

SOLVED Pool Suspended for I/O Failure

albrecd

Dabbler

albrecd

Dabbler

NugentS

MVP

albrecd

Dabbler

NugentS

MVP

albrecd

Dabbler

NugentS

MVP

albrecd

Dabbler

albrecd

Dabbler

NugentS

MVP

albrecd

Dabbler

Similar threads