FreeNAS intermittently going unresponsive after adding new volume.

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Hello,

My FreeNAS install has been running great up until recently. The most recent change I made was by adding the 6x 3TiB in bays 00-05. These are Western Digital Red 5400 RPM drives configured in a raidz-1 (slow non-critical storage). Let's call this volume "media".

The 600GiB drives in bays 06-11 are three mirrors striped (I think this is equivalent to a RAID10). These drives are Hitachi 15k SAS drives. This volume is fast(ish) and considered critical as it has a zvol on it that is presented over iSCSI to a VMware cluster. The disks in bays 12 and 13 operate standalone and have individual zvols created on each that are also presented over iSCSI to VMware (non-critical). iSCSI is on its own dedicated 2x 1Gbps NICs with isolated VLANs to the ESXi hosts. All of these have been working flawlessly up until recently. The internal USB to SATA is an SSD that has the FreeNAS os installed on it.

On the previously mentioned media volume, I have created a zvol with two SMB shares that are accessed by a Windows Plex server and also an Ubuntu Docker host that mounts the SMB shares locally and passes them into containers like Sonarr, Radarr, and SABnzbd.

So that explains a bit about my setup and the change I made recently. So beginning about 2 days after adding the 6x disks that make up my media volume FreeNAS has been intermittently going unresponsive.

1st incident:
Web UI: timed out
SSH: timed out
SMB shares: unaccessible
iSCSI LUNs: unaccessible (all VMware VMs down)
Resolution: selected reboot option on r510 console. the system successfully rebooted.

2nd incident:
Web UI: allowed login but the page would only load about 50% and nothing was clickable
SSH: authenticated but never gave a prompt
SMB shares: unaccessible
iSCSI LUNs: unaccessible (all VMware VMs down)
Resolution: selected shell option on r510 console. shell would not accept input. ended up hard resetting r510.

3rd incident:
Web UI: timed out
SSH: timed out
SMB shares: unaccessible
iSCSI LUNs: unaccessible (all VMware VMs down)
Resolution: selected reboot option on r510 console. the system failed to reboot. the console showed multiple sonewconn: pcb 0xfffff800afcdacb0: Listen queue overflow: 193 already in queue awaiting acceptance (80 occurrences). ended up hard resetting r510.

I have noticed lots of CIFS VFS: no writable handles for inode and task unrar blocked for more than 120 seconds on my Ubuntu Docker host. I started to think that maybe I was causing too much IO on the SMB share so I stopped the SABnzbd container that downloads and unrars stuff on the share. However, even after stopping that high IO container after the 2nd incident FreeNAS still hung a few days later. Not sure where to go from here other than disabling everything that accesses the shares (Plex, Sonarr container and Radarr container) to try and rule out some weird IO issues causing FreeNAS to freak out. Not sure where to go from here, thanks for the help and reading a rather long post.

Code:
R510 Bay\Drive Layout

Bay 00 3TiB | Bay 03 3TiB | Bay 06 600GiB | Bay 09 600GiB
Bay 01 3TiB | Bay 04 3TiB | Bay 07 600GiB | Bay 10 600GiB
Bay 02 3TiB | Bay 05 3TiB | Bay 08 600GiB | Bay 11 600GiB

Internal 2.5" Bay 12 1.2TiB
Internal 2.5" Bay 13 800GiB
Internal USB to SATA: 250GB SSD


Chassis: Dell R510
Controller: Dell PERC H200 flashed to IT mode
CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 16GB
Build: FreeNAS-11.2-U2
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Update: FreeNAS has gone unresponsive again. It's currently in that state. Console shows nothing other than the default selectable options. Open to suggestions, logs to look at etc. I'll probably attempt an option 10 reboot later tonight if I don't hear anything from the community (last time I tried option 10 reboot it ended up hanging as mentioned in "3rd incident" resulting in me having to hard reset the R510). Thanks again!
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Update: Ended up trying to reboot via the console which resulted in it hanging again, so I reset the r510 (again, really hate doing that). After reading this thread FreeNAS Mini on 11.2 freezes / locks up I've gone ahead and upgraded to 11.2 U3. We'll see what happens.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
I assume you did check SMART information from drives and it was good?
 
  • Like
Reactions: jpi

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
@AlexGG The 6x newly added 3TiB drvies in bays 00-05 were pulled from a working Synology setup (been working solid for ~2+ years). That said, I didn't think it was necessary to run SMART tests on them. I'll kick those off now just to be sure. Good idea!! Will post back with results later this evening. Thanks.
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Tests look good.

Code:
root@freenas[~]# for disk in 0 1 10 11 12 13; do smartctl -a /dev/da$disk | grep -e "# 1" -e overall; done
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%      1086         -
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%      1068         -
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%     26830         -
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%     31304         -
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%     36488         -
SMART overall-health self-assessment test result: PASSED
# 1  Extended offline    Completed without error       00%     11584         -
 
Last edited:

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Right, the important stuff :). See below...

Code:
da0 s/n: v0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   178   021    Pre-fail  Always       -       5916
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       47
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1102
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       47
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       27
194 Temperature_Celsius     0x0022   116   103   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Code:
da1 s/n: 26
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   192   167   021    Pre-fail  Always       -       5400
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       43
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1086
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       41
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   118   113   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Code:
da10 s/n: tp
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   180   021    Pre-fail  Always       -       5750
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26846
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       37
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       18
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       532
194 Temperature_Celsius     0x0022   119   107   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

Code:
da11 s/n: 7a
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   178   174   021    Pre-fail  Always       -       6091
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       39
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   058   058   000    Old_age   Always       -       31320
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       14
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       728
194 Temperature_Celsius     0x0022   117   112   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Code:
da12 s/n: 7p
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   184   178   021    Pre-fail  Always       -       5791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       36504
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       14
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       924
194 Temperature_Celsius     0x0022   119   113   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Code:
da13 s/n: 02
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   183   177   021    Pre-fail  Always       -       5841
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11600
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       28
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       364
194 Temperature_Celsius     0x0022   119   113   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
My guess, is that memory and the new slow disk pool are contributing factors.

Typically, we recommend a minimum of 32-64GB of RAM for iSCSI use. You have 16GB and in addition to FreeNAS you're running at least 4 VM's.

Striped mirrors will give you the best performance for iSCSI. RAIDz1 might be fine for SSD's.

I'd probably start by adding more RAM. Run the zilstat utility "to determine if the system will benefit from a SLOG. REad the ZFS primer for more information.
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
@gpsguy Thanks for the reply. I'll see about getting more memory, do note though, the VMs I mentioned before are running in VMware on two dedicated ESXi hosts (clustered). Yes, striped mirrors are backing the storage presented to VMware. Reading up on zilstat. Thanks.

Note: I only have one CPU in the R510, so 64GB will have to suffice. Now... to go salvage stuff from old systems at work!
 
Last edited:
Top