Strange SATA-DOM issue

Status
Not open for further replies.
Joined
Dec 2, 2015
Messages
730
I had a strange event with my backup FreeNAS server last night, and I'm looking for possible explanations and recommendations to troubleshoot if it happens again.

The system specs are:

Motherboard: Supermicro X10SL7-F
CPU: G3258
RAM: 16G ECC Crucial CT2KIT102472BD160B/CT2CP102472BD160B
Boot Device: Supermicro 16GB SATA-DOM at /dev/ada3
Hard drives: 8 x Western Digital Red 4 TB in RAIDZ2
Encryption: enabled (server will be moved off site, and it is possible that the server could be stolen)
Case: Fractal Design Node 804
OS: FreeNAS-11.0-U1 (aa82cc58d)

This machine's sole purpose is to receive replications overnight from the main server, so that it holds a complete copy of the data on the main server.

This morning's daily kernel log email showed the following:

Code:
ahcich5: Timeout on slot 11 port 0
ahcich5: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd c0 serr 00000000 cmd 0004cb17
(ada3:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada3:ahcich5:0:0:0): CAM status: Command timeout
(ada3:ahcich5:0:0:0): Retrying command
ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich5: Timeout on slot 12 port 0
ahcich5: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 80 serr 00000000 cmd 0004cc17
(aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich5:0:0:0): CAM status: Command timeout
(aprobe0:ahcich5:0:0:0): Retrying command
ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich5: Timeout on slot 13 port 0
ahcich5: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 80 serr 00000000 cmd 0004cd17
(aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich5:0:0:0): CAM status: Command timeout
(aprobe0:ahcich5:0:0:0): Error 5, Retries exhausted
ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)


Note that ada3 is the boot SSD.

I tried logging in to the GUI, but the browser timed out before the GUI appeared. ssh also failed. The server was pinged successfully. I tried a graceful shutdown with IPMI, but that timed out. I did a hard shutdown and restart with IPMI, and the server came up OK, with no apparent problems. I checked /var/log/messages, and it showed nothing after 02:57:24 until the reboot - the system dataset is on the SSD. It looks like the server was still receiving the replications, as zfs list -t snapshot shows all the expected hourly snapshots were received even after the SSD went offline.

It seems that the SATA-DOM boot SSD became unavailable for some reason, but came back on restart. What could cause that to happen? Is there any troubleshooting I can do now, or if it occurs again? Is there anything I could do to reduce the likelihood that this happens again. Would reseating the SATA-DOM in the SATA port potentially be useful?

Thanks for any advice.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Would reseating the SATA-DOM in the SATA port potentially be useful?
Possibly. Beyond that, you'll have to monitor it and see how it behaves.
 
Status
Not open for further replies.
Top