NAS transfer and access became ultra slow

Cupcake · Jun 2, 2019

Hi guys
I have been running Freenas on this server for a couple of years now. The box is

6x4TB in ZFS2 config
16 GB RAM
Pool is 72% full
I run scrubs and SMART checks frequently
Originally I had transfer rates of ~ 80 mb/s
Transfer rates now are ~ 50 kb/s
web interface is still responsive as always
Listing the directories in the pool takes forever. That's regardless of whether I access it via a windows network map or directly via SSH and run ls

I don't remember that I made any changes when it became so slow. Transfer rates are now really slow, as in few kb/s. After copying for a few seconds, the rate will drop to 0 kb/s before it spikes again for a few more seconds. Copying 15 MB takes more than a minute...

The only out of the ordinary thing I see is "CRITICAL: June 2, 2019, 1:08 p.m. - Device: /dev/ada1, 19 Currently unreadable (pending) sectors" in the web interface. Haven't had time to look into this yet, but I would expect the system to still run fine even with problems on a single drive. I did have to replace and resilver a drive before and even then I could still use the NAS better than now.

Any ideas what I should look into?

EDIT: This is my system
Mainboard: Asus M5A78LM/USB-3
CPU: AMD Athlon II X2 270
RAM: Kingston 16GB (Kit 2x8GB) DDR3-1333 ECC CL9 w/TS
Drives: 6x WD 4TB red

joeschmuck · Jun 2, 2019

skaarj441 said:
The only out of the ordinary thing I see is "CRITICAL: June 2, 2019, 1:08 p.m. - Device: /dev/ada1, 19 Currently unreadable (pending) sectors" in the web interface. Haven't had time to look into this yet, but I would expect the system to still run fine even with problems on a single drive.

Well this could be the cause. I'm not saying it is but it's a contributing factor. The other obvious things that make a network slow is the normal Cat5/6 cable magically going bad (yes it does happen), or if you made any change in your network hardware (bought a new switch for example). The other thing making transfer slow is file fragmentation, just a ton of small files, lots of reasons. Heck, if you exceeded 80% full even momentarilly then that is a different problem.

My advice: Investigate your drive failures, you may have more than one. Run a SMART Long test on all of your drives, if you are not sure how to read the results then look at my tagline for a link, or just port the results of each drive and we will tell you what is good/bad.

Lastly, please provide your system specs per the forum rules so we can provide you more accurate advice. It really does save us from as much guessing and help solve the problem sooner.

Good Luck!

HoneyBadger · Jun 2, 2019

skaarj441 said:
but I would expect the system to still run fine even with problems on a single drive

Unfortunately that's a bit of a fallacious assumption; if you have a drive in your vdev that's hanging or stalling in response to commands then it will impact things pretty severely.

You'd likely get faster results by failing out the unhealthy drive; either way, a replacement is likely required. Post SMART results as suggested by @joeschmuck

Cupcake · Jun 2, 2019

joeschmuck said:
Well this could be the cause. I'm not saying it is but it's a contributing factor. The other obvious things that make a network slow is the normal Cat5/6 cable magically going bad (yes it does happen), or if you made any change in your network hardware (bought a new switch for example). The other thing making transfer slow is file fragmentation, just a ton of small files, lots of reasons. Heck, if you exceeded 80% full even momentarilly then that is a different problem.

My advice: Investigate your drive failures, you may have more than one. Run a SMART Long test on all of your drives, if you are not sure how to read the results then look at my tagline for a link, or just port the results of each drive and we will tell you what is good/bad.

Lastly, please provide your system specs per the forum rules so we can provide you more accurate advice. It really does save us from as much guessing and help solve the problem sooner.

Good Luck!

Thanks for the hints. I updated my initial post with the system specs. SMART tests are running periodically, but I did not yet have the time to figure out what the pending sectors error actually means :/

What I tried out of curiosity is restoring the factory defaults, and then putting the drive offline and online again. Now the LAN connection is maxed out again...

Exceeding the 80% mark is a state of the zfs volume, not the freenas configuration, right? So if that were the case, resetting the config would not have helped, correct?

joeschmuck · Jun 3, 2019

skaarj441 said:
Exceeding the 80% mark is a state of the zfs volume, not the freenas configuration, right? So if that were the case, resetting the config would not have helped, correct?

Correct on both, however rebooting would have reverted the system back to the non-optimized writing.

Did you read these two links: https://www.ixsystems.com/community...bleshooting-guide-all-versions-of-freenas.17/
and https://en.wikipedia.org/wiki/S.M.A.R.T. ?

Pending Sector Errors means there was a write/read error for a specific sector or multiple sectors. A single sector is not bad however a grouping of sectors is bad. Sometimes a hard drive will remove the sector from use (remap it) but sometimes the damage leads to intermittent good read/writes where it seems to repair itself. Read my Hard Drive Troubleshooting Guide, it will provide you the information you need. After you have run a SMART Long Test (do not think the Short Tests are good enough, they are not on thier own, then post the results of the tests for each drive. We will interpret the results for you and provide you some sound advice. I will not tell you to eplace your hard drive(s) unless you have indication of a hardware failure. Some failures can be attributed to a SATA cable as well, so post the SMART Long test results for each drive. Save yourself the aggravation of a complete pool failure.

Cupcake · Jun 10, 2019

joeschmuck said:
so post the SMART Long test results for each drive. Save yourself the aggravation of a complete pool failure.

Thanks a lot for your help. I finally got the output.

Drive1 (The problematic drive with pending sectors)

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E0168108
LU WWN Device Id: 5 0014ee 2b3bf8a80
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Jun 10 11:53:34 2019 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (53280) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 532) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1299
  3 Spin_Up_Time            0x0027   186   176   021    Pre-fail  Always       -       7691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       502
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   048   048   000    Old_age   Always       -       38424
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       275
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       68
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       3203754
194 Temperature_Celsius     0x0022   120   109   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     38381         -
# 2  Short offline       Completed: read failure       90%     38343         2331490408
# 3  Short offline       Completed without error       00%     38341         -
# 4  Short offline       Completed without error       00%     38339         -
# 5  Short offline       Completed without error       00%     38335         -
# 6  Short offline       Completed without error       00%     38333         -
# 7  Short offline       Completed without error       00%     38331         -
# 8  Short offline       Completed without error       00%     38329         -
# 9  Short offline       Completed without error       00%     38327         -
#10  Short offline       Completed without error       00%     38325         -
#11  Short offline       Completed without error       00%     38323         -
#12  Short offline       Completed without error       00%     38322         -
#13  Short offline       Completed without error       00%     38322         -
#14  Short offline       Completed without error       00%     38320         -
#15  Short offline       Completed without error       00%     38318         -
#16  Short offline       Completed without error       00%     38317         -
#17  Short offline       Completed without error       00%     38315         -
#18  Short offline       Completed without error       00%     38313         -
#19  Short offline       Completed without error       00%     38311         -
#20  Short offline       Completed without error       00%     38310         -
#21  Short offline       Completed without error       00%     38308         -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

"1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1" sounds good to me :D
I ordered a replacement drive anyway. Nice to have.

tfran1990 · Jun 10, 2019

What would cause the short test to fail with error but pass a long test?
seems odd to me.

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     38381         -
# 2  Short offline       Completed: read failure       90%     38343         2331490408
# 3  Short offline       Completed without error       00%     38341         -

HoneyBadger · Jun 10, 2019

skaarj441 said:
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 3203754

3.2 million load cycles? I thought the WD Red's weren't supposed to park that often.

tfran1990 · Jun 10, 2019

HoneyBadger said:
3.2 million load cycles? I thought the WD Red's weren't supposed to park that often.

Could that be because of an aggressive power save mode?

joeschmuck · Jun 11, 2019

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1299
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   048   048   000    Old_age   Always       -       38424
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       3203754
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       19
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     38381         -
# 2  Short offline       Completed: read failure       90%     38343         2331490408

So looking at this with my glasses on I see a few things but this lines up with what others have said.
ID 1 = 1299. If you run another SMART Short Test and the value is the same or slowly incrementing then this is a failure indicator for this drive. If the value changes to crazy values then this is an ignore item. Keep this in mind for any other drives you have.

ID 5 = 0 which is a good thing. You haven't have a failure that required the drive to remap a sector from use.

ID 9 = 38424 = 4.38 years of power on hours. That is a long time. It is likely about time the drive starts to fail.

ID 193 = 3203754 and that is a high load count and WD only warrants these drives for 600,000 counts (this is double what the original warranty was). My advice is to apply the 5 minute timer setting to reduce the load count for your other WD drives. It's a simple procedure and if you always leave your drives running, just disable the timer like many of us do. The heads will unload on power failure or shutdown.

ID 196 = 0, same as ID 5 above.

ID 197 = 19, not terrible but an indication of possible pending doom, along with the SMART Test Failure.

ID 198/199 = 0 and are fine.

ID 200 = 1 and this can also be a failing indication on many drives, yours is one of them.

It is odd as others have stated that you failed a SMART Short test but passed a SMART Extended/Long test, but your drive is having intermittent write/read failures. Count your blessings that your drive is having a "nice" failure and didn't just crap out on you like they do to many folks. When you replace the failed drive remember to redo the SMART Short and Long tests in FreeNAS, normally the replaced drive will drop off the list of drives to be tested. This is a problem with FreeNAS but it's a well known issue and I'm not sure it can easily be fixed but it sure can be an annoyance. With that said, after your new drive is replaced I'd just check the SMART data to ensure you see it is actually running a test on each drive you have.

Sorry I got a bit wordy, just trying to point out everything I see.

joeschmuck · Jun 11, 2019

tfran1990 said:
Could that be because of an aggressive power save mode?

Yes but it could be also due to using factory defaults over the 4+ years of runtime.

Cupcake · Jun 11, 2019

HoneyBadger said:
3.2 million load cycles? I thought the WD Red's weren't supposed to park that often.

tfran1990 said:
Could that be because of an aggressive power save mode?

I will definitely look into changing the default timeout. Thought I did that, but to be honest I have not touched the system much since I set it up 5+ years ago. It just worked :D

joeschmuck said:
Code:
Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1299 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 38424 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 3203754 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 19 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 38381 - # 2 Short offline Completed: read failure 90% 38343 2331490408

So looking at this with my glasses on I see a few things but this lines up with what others have said.
ID 1 = 1299. If you run another SMART Short Test and the value is the same or slowly incrementing then this is a failure indicator for this drive. If the value changes to crazy values then this is an ignore item. Keep this in mind for any other drives you have.

ID 5 = 0 which is a good thing. You haven't have a failure that required the drive to remap a sector from use.

ID 9 = 38424 = 4.38 years of power on hours. That is a long time. It is likely about time the drive starts to fail.

ID 193 = 3203754 and that is a high load count and WD only warrants these drives for 600,000 counts (this is double what the original warranty was). My advice is to apply the 5 minute timer setting to reduce the load count for your other WD drives. It's a simple procedure and if you always leave your drives running, just disable the timer like many of us do. The heads will unload on power failure or shutdown.

ID 196 = 0, same as ID 5 above.

ID 197 = 19, not terrible but an indication of possible pending doom, along with the SMART Test Failure.

ID 198/199 = 0 and are fine.

ID 200 = 1 and this can also be a failing indication on many drives, yours is one of them.

It is odd as others have stated that you failed a SMART Short test but passed a SMART Extended/Long test, but your drive is having intermittent write/read failures. Count your blessings that your drive is having a "nice" failure and didn't just crap out on you like they do to many folks. When you replace the failed drive remember to redo the SMART Short and Long tests in FreeNAS, normally the replaced drive will drop off the list of drives to be tested. This is a problem with FreeNAS but it's a well known issue and I'm not sure it can easily be fixed but it sure can be an annoyance. With that said, after your new drive is replaced I'd just check the SMART data to ensure you see it is actually running a test on each drive you have.

Sorry I got a bit wordy, just trying to point out everything I see.

Thanks a lot for the elaborate explanation. Learned something new. Since it's ZFS2 I think I will keep the drive in there for now, out of curiosity, and see if it will worsen soon. The replacement drive is ready and waiting on the shelf. Thanks again for the details, I really appreciate it.

Important Announcement for the TrueNAS Community.

NAS transfer and access became ultra slow

Cupcake

Dabbler

joeschmuck

Old Man

HoneyBadger

actually does care

Cupcake

Dabbler

joeschmuck

Old Man

Cupcake

Dabbler

tfran1990

Patron

HoneyBadger

actually does care

tfran1990

Patron

joeschmuck

Old Man

joeschmuck

Old Man

Cupcake

Dabbler

Similar threads