Random Freezes

Status
Not open for further replies.

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
I've been having an issue on my freenas system where it will randomly lock up and become inaccessible. There are generally no error messages on the screen, sometimes there was a recent series of messages similar to

> GEOM_ELI: Device da1p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da2p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da3p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da4p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da5p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da6p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da7p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da8p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da0p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da9p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da10p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware
> GEOM_ELI: Device da11p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 256
> GEOM_ELI: Crypto: hardware

however, I don't think that is related. There seems to be no rhyme or reason to when it locks up.
I have attached the debug dump, I don't see anything that stands out to me. Any help as to where to look would be great.
here is my hardware.
Category
Item
Part Number
Quantity
Motherboard
SUPERMICRO X9SRi-3F
MBD-X9SRI-3F-O
1x
Processor
Intel Xeon E5-1620 / 3.6 GHz
CM8062101038606
1x
Memory
Kingston ValueRAM memory - 32 GB
KVR16E11K4/32
2x
HBA/RAID
LSI SAS 9207-4i4e
LSI00303
1x
Hard Drive
WD RE SAS WD4001FYYG 4 TB
SAS-2
WD4001FYYG
12x
Chassis
Supermicro SC846
CSE-846BE16-R920B
1x
 

Attachments

  • ixdiagnose.rar
    343.5 KB · Views: 248

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you post the output of the following:

smartctl -a -q noserial /dev/da0
 

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WD4001FYYG-01SL3
Revision: VR02
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device type: disk
Transport protocol: SAS
Local Time is: Wed May 21 12:44:26 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 35 C
Drive Trip Temperature: 69 C

Manufactured in week 01 of year 2012
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 31
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 29
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 78296 6 14 78302 6 16829.323 0
write: 646490 3 4 646493 3 14840.213 0
verify: 0 1 1 1 1 0.000 0

Non-medium error count: 980

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 7485 - [- - -]
# 2 Background short Completed - 7484 - [- - -]
# 3 Background short Completed - 7483 - [- - -]
# 4 Background short Completed - 7482 - [- - -]
# 5 Background short Completed - 7481 - [- - -]
# 6 Background short Completed - 7480 - [- - -]
# 7 Background short Completed - 7479 - [- - -]
# 8 Background short Completed - 7478 - [- - -]
# 9 Background short Completed - 7477 - [- - -]
#10 Background short Completed - 7476 - [- - -]
#11 Background short Completed - 7475 - [- - -]
#12 Background short Completed - 7474 - [- - -]
#13 Background short Completed - 7473 - [- - -]
#14 Background short Completed - 7472 - [- - -]
#15 Background short Completed - 7471 - [- - -]
#16 Background short Completed - 7470 - [- - -]
#17 Background short Completed - 7469 - [- - -]
#18 Background short Completed - 7468 - [- - -]
#19 Background short Completed - 7467 - [- - -]
#20 Background short Completed - 7466 - [- - -]
Long (extended) Self Test duration: 31120 seconds [518.7 minutes]
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah.. your RAID controller is masking the status of your disks. There's no raw SMART data being provided. It's possible your problems are because of a failing disk. But since we can't verify if it is or isn't you are kind of in a world of hurt.

You may be able to reflash that controller to IT mode. I'm not sure if you can or not, you'll have to do some googling to find out. I can't keep track of the list of controllers that can be flashed to IT anymore.

Background short tests every hour is a bit excessive too. The log will only hold 20 entries so a failed test will end up being removed from the list in less than a day(which isn't very useful for troubleshooting). I have a guide on testing and scrubs if you want to read up on the kinds of schedules I recommend.
 

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
I'm running a LSI SAS 9207-4i4e Host Bus Adapter, shouldn't that be in it mode by default?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm not sure what the default mode is, but it should be in jbod mode. or reflashed to IT mode if its supported.
 

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
alright, its been a while since I had any lockups. but when it came around to do the monthly scrub it locked up. I think this is related to scrubbing. I got my nightly e-mail announcing scrubbing had started then when I went to check the server it was frozen. So I rebooted it and now scrubbing is humming along just fine. Its been going at ~212M/s since the reboot.

Also I checked the firmware of the the hba and it is indeed in IT mode as I suspected it was.

I'm attaching a couple screenshots
Screenshot 2014-06-15 08.38.35.png Screenshot 2014-06-15 08.40.11.png Screenshot 2014-06-16 10.13.18.png Screenshot 2014-06-16 10.18.44.png
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
If you are running FreeNAS 9.2.1.x you should be using firmware v16. You have v15.

You need to reflash it.


Sent from my phone
 

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
I am running FreeNAS-9.2.0-RELEASE-x64 (ab098f4) is v15 the correct firmware for that?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

kyleman7

Dabbler
Joined
Jul 17, 2013
Messages
14
I found this in one of the dmesg files from the debug dump a line 94
May 11 13:34:28 GISV kernel: mps0: <LSI SAS2308> port 0xd000-0xd0ff mem 0xfba40000-0xfba4ffff,0xfba00000-0xfba3ffff irq 42 at device 0.0 on pci7
May 11 13:34:28 GISV kernel: mps0: Firmware: 15.00.00.00, Driver: 14.00.00.02-fbsd
May 11 13:34:28 GISV kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Does the driver version number directly relate to the required firmware version?
 

Attachments

  • dmesg.yesterday.txt
    18.6 KB · Views: 295

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I found this in one of the dmesg files from the debug dump a line 94


Does the driver version number directly relate to the required firmware version?

The driver should match the firmware.
 
Status
Not open for further replies.
Top