Disks change state to "Faulted" randomly on load

Status
Not open for further replies.

Grorkef

Cadet
Joined
Sep 23, 2018
Messages
5
Hello,
since I started with my FreeNAS Project I had these problems. I use this System as Storage for my VMs.

System Specifications:
Mainboard Asrock E3C224-4L
CPU Intel Xeon E3-1231v3
RAM 4x Kingston ValueRAM 8GB DDR3 ECC
Network Chelsio T520-SO-CR
HBA LSI SAS 9305-16i + 4x CBL-SFF8643-06M 0.6m Cab
Case Gooxi RMC3116-670-HSE (12Gbit SAS Backplane)
PSU Seasonic SS-500 (500 Watt)

Drives 10x 2TB SATA (3Gb 5200rpm) (I know that's not the best. The Performance is ok but I will replace the drives in die future)
2x Samsung 930 PRO (ZIL and L2ARC)
2x Kingston SSD (system drives, attached to mainboard SATA Controller)



pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:05 with 0 errors on Sun Oct 28 03:45:05 2018
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors

pool: zPool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0 in 0 days 00:11:46 with 0 errors on Fri Oct 12 23:41:16 2018
config:

NAME STATE READ WRITE CKSUM
zPool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/468bd5db-c7cb-11e8-bfe1-000743495cb0 ONLINE 0 0 0
gptid/474a3ba5-c7cb-11e8-bfe1-000743495cb0 FAULTED 3 0 0 too many errors
gptid/4936a5ad-c7cb-11e8-bfe1-000743495cb0 ONLINE 0 0 0
gptid/4b18c4eb-c7cb-11e8-bfe1-000743495cb0 ONLINE 0 0 0
gptid/4cb36de4-c7cb-11e8-bfe1-000743495cb0 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
gptid/296c77da-ce04-11e8-b12d-000743495cb0 ONLINE 0 0 0
gptid/3741d807-ce04-11e8-b12d-000743495cb0 ONLINE 0 0 0
gptid/5793acf8-ce04-11e8-b12d-000743495cb0 ONLINE 0 0 0
gptid/624fc955-ce04-11e8-b12d-000743495cb0 ONLINE 0 0 0
gptid/6d155253-ce04-11e8-b12d-000743495cb0 FAULTED 3 0 0 too many errors
logs
mirror-2 UNAVAIL 0 0 0
da12p1 FAULTED 9 0 0 too many errors
da11p1 FAULTED 6 0 0 too many errors
cache
da11p2 FAULTED 3 0 0 too many errors
da12p2 FAULTED 3 0 0 too many errors

errors: No known data errors




The smart info of all drives are ok no reallocated sectors. All Self Tests are without any errors.

As far as the system is running with low load there is no problem.
But when I start larger copy Jobs or Migrate VMs from other storage randomly errors on disks occurs.

After restarting the system all counts are zeroed and the system is running regularly.


I hope you can help me to identify the problem, as far as this problem is not solved I can't use the system productively.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
You may need a bigger power supply... I would have gone with a 650 with 10 drives. How's your performance otherwise? I suspect it's not amazing.
 

Grorkef

Cadet
Joined
Sep 23, 2018
Messages
5
Hm ok, I thought that 500 watts will be enough. But until now I couldn't measure the power consumption.
Yeah, the performance is not as good as it could. But I will change the drives in next months.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I agree with @krdragon75 on the PSU sizing... 10 drives + boot drive at startup may be too much for 500.

Also check your cabling (both SATA and power).

You may be wasting your time with ZIL and/or L2ARC with only 32GB of RAM... you're right on the limit of making things worse (possibly still over it).
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Drives 10x 2TB SATA (3Gb 5200rpm) (I know that's not the best. The Performance is ok but I will replace the drives in die future)
What kind?
The smart info of all drives are ok no reallocated sectors. All Self Tests are without any errors.
Too many errors doesn't relate to bad sectors. Those drives will need to be replaced.
As far as the system is running with low load there is no problem.
But when I start larger copy Jobs or Migrate VMs from other storage randomly errors on disks occurs.
This does sound like it could be power related because the system being under load could draw enough power that the supply starts to have trouble.
Here is the guide on power:

Proper Power Supply Sizing Guidance
https://forums.freenas.org/index.php?threads/proper-power-supply-sizing-guidance.38811/

I always like to go a little over on power supply size because the supply looses capability over time as it ages. So, if I only 'need' a 650 watt supply, I might buy a 750 watt supply and expect it to last a bit longer.
2x Samsung 930 PRO (ZIL and L2ARC)
2x Kingston SSD (system drives, attached to mainboard SATA Controller)
This is not the correct kind of hardware for this implementation. That is why these have faulted on you.
You need to review these threads about SLOG:

The ZFS ZIL and SLOG Demystified
http://www.freenas.org/blog/zfs-zil-and-slog-demystified/

Testing the benefits of SLOG using a RAM disk!
https://forums.freenas.org/index.ph...s-of-slog-using-a-ram-disk.56561/#post-396630

Testing the benefits of SLOG
https://forums.freenas.org/index.php?threads/testing-the-benefits-of-slog-using-a-ram-disk.56561

SLOG benchmarking and finding the best SLOG
https://forums.freenas.org/index.ph...-and-finding-the-best-slog.63521/#post-454773
You may be wasting your time with ZIL and/or L2ARC with only 32GB of RAM...
Absolutely. You shouldn't bother until you maximize RAM, but if you have a sync write requirement, you do need SLOG.
 

Grorkef

Cadet
Joined
Sep 23, 2018
Messages
5
I ordered a 700 watts PSU.

I informed me about the ZIL and L2ARC and yes it usually should not improve the performance but I tested it and in my system, the performance was with ZIL and L2ARC 5 times better as without.
Without the ZIL and L2ARC, it took 2 hours to migrate a VM with 40 GB. After adding Zil und L2ARC it was done in 5 minutes.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I informed me
I don't know who informed you, but
and yes it usually should not improve the performance
It depends entirely on your use of the system and you are using the system in a way that does call for SLOG at the least. So adding it is going to improve performance.
The reason I gave you those links today is so you can select more appropriate hardware because the L2ARC and SLOG that you are using are not the correct hardware for the task and that will impact performance and longevity.
 

Grorkef

Cadet
Joined
Sep 23, 2018
Messages
5
Sorry for the late replay. I got the 700 watts PSU. After the change the vdev Disks are stable there are no random faults anymore.
But the problem with the slog and cache are still there.

After a few hours, one SSD of the cache and the slog are faulted. I can reset this when I remove the slog and the cache and reassign them to the pool. The SSDs are ok I also bought an additional one to check this but the same problem.

There was a question about the disks
Code:
not in use <ATA ST1000DM003-1ER1 CC46>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 1 lun 0 (pass1,da1)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 2 lun 0 (pass2,da2)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 3 lun 0 (pass3,da3)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 4 lun 0 (pass4,da4)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 5 lun 0 (pass5,da5)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 6 lun 0 (pass6,da6)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 7 lun 0 (pass7,da7)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 9 lun 0 (pass8,da8)
<ATA ST2000DM001-1ER1 CC25>        at scbus0 target 10 lun 0 (pass9,da9)
<ATA SAMSUNG HD204UI 0001>         at scbus0 target 11 lun 0 (pass10,da10)
<ATA Samsung SSD 860 1B6Q>         at scbus0 target 13 lun 0 (pass11,da11)
not in use <ATA Samsung SSD 860 1B6Q>         at scbus0 target 14 lun 0 (pass12,da12)
<ATA Samsung SSD 860 1B6Q>         at scbus0 target 15 lun 0 (pass13,da13)

freenas-boot:
<KINGSTON SA400S37120G SBFKB1E1>   at scbus3 target 0 lun 0 (pass14,ada0)
<KINGSTON SA400S37120G SBFKB1E1>   at scbus4 target 0 lun 0 (pass15,ada1)


I have one big problem no. After few days the systems crash with "swap_pager indefinite wait buffer bufobj 0" Errors. I can only reset the server then. This a huge problem because I use this system as datastore for my ESXi hosts.
 
Status
Not open for further replies.
Top