ZFS and phy bad for slot error(s)

Status
Not open for further replies.

louisk

Patron
Joined
Aug 10, 2011
Messages
441
I just came across an odd issue. I'm running FreeNAS-8.3.1-BETA2-x64 (r13015) on a Dell R610 connected to an MD1000 JBOD. I've been running this config for a couple of years now. Its been quite solid. Yesterday, I physically moved my gear to a different room. When I brought things online, I got the following message:

55725 (Fri Jan 18 00:21:00 PST 2013/ENCL/CRIT) - Enclosure PD 00(e1/s255) phy bad for slot 9
55723 (Fri Jan 18 00:21:00 PST 2013/ENCL/CRIT) - Enclosure PD 00(e1/s255) phy bad for slot 4
55724 (Fri Jan 18 00:21:00 PST 2013/ENCL/CRIT) - Enclosure PD 00(e1/s255) phy bad for slot 6

When I login to the webby, I see no such claims about any errors at all (The alerts icon is still green, and the storage section claims things are healthy. See screenshot attached). I checked the cli and got the same results as the webby. I'm left wondering why this was sent out.

pool-healthy.jpg

Basic google didn't turn up anything that seemed particularly relevant. Any thoughts about what this means?

Thanks.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not sure if this relates to your issue, but I'll share my experience with 'phy' errors.

I had an Adaptec SAS controller with an external x4 port connected to a SAS port expander with 16 hard drives attached. The SAS cable between the controller and SAS port expander started getting flaky and research led me to an issue with the signal from the controller to the port expander. I would check your cabling and see if something is wrong. Phy errors are basically "communication errors" that meet certain criteria. When my cable decided to finally go bad I lost 1 of the 4 links to the SAS expander (slowed down my RAID array by about 30%). I'd check your cabling and see if anything is being pinched, stressed, etc. I'm not sure how you could determine which "slot" is which hard drive or cable to help isolate the problem.

It looks like the issue is likely inside the MD1000 and not between your controller and the MD1000. I'm not familiar with the MD1000 but perhaps the MD1000 has some failing cables inside? I would assume that if you ignore the errors as I did everything will continue to run okay until whatever link is going bad fails for good. This could mean suddenly a hard drive is no longer available or your entire MD1000 stops working(depending on what cable is going bad). The fact that all 3 happened at the exact same time makes me wonder if it was a power fluctuation and those 3 devices are susceptible to power fluctuations more than the rest.

Your zpool should continue to show as healthy since everything is still connected and detected. That will change when whatever is getting flaky finally fails. Of course, you should probably find and fix the fault before it gets to that point(which I think you're trying to do already).

Is your hardware attached to an UPS? Is the UPS sufficiently sized for your hardware?
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
Yes, the R610 and MD1000 are sitting (currently with a Cisco 2960G) on a APC 3000XL all alone. I think the UPS is sufficient. It reports everything green as well.

Interesting. I'll poke around a bit and see if I find anything else.

Thanks.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
Louis,

What came of this error and any fix?

You mention a perc5 in your sig, is that a 5/i or 5/e? Are you using a sas 5e or other external sas card to connect to the md1000 so zfs can see the individual drives?

Thanks,

-hak
 

louisk

Patron
Joined
Aug 10, 2011
Messages
441
The box has a SAS 6I/R for the 6 2.5" slots. I had a 5e for connecting the MD1000, but I have switched to a sas5e. With the PERC, I had to create a collection of RAID0 volumes, one for each drive. It made things annoying, so it got swapped out.

I honestly don't recall doing anything special. It may have gone away on its own, it may have gone away when I swapped the SAS 5e card. It may have gone away with some drive swapping. I know there weren't any changes to the MD1000 itself, nor any changes to the R610 other than swapping PERC for SAS cards.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
ok, thanks. i have a crossflashed m105 internally and a sas5e (not perc) for the md1000 but i only see 2 'disks' which look like the EM / enclosure managers (?). it's definitely not a perc5e. I've got a 8088 lsi card on a truck, will try with that and see if I can get the freenas server to see any individual disks, right now it only sees the internal ones on the m105 and 2 non-disks listed as disks. i've rebooted the box to no avail. maybe i'll reboot the md1000 as well next time i'm on site.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
rebooted the md1000, no change. i confirmed it was a SAS5/e (not a perc) connecting to the md1000 externally. just in case it was the dell sas5/e card, i pulled it and replaced it with an LSI 9200-8e and different cables, and it looks the same. DA4 and DA5 in the attached screengrab. Off to google how to factory reset an MD1000.
 

Attachments

  • md disks.png
    md disks.png
    68.5 KB · Views: 295

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
pardon me, but 21MB is a freakin' weird size! Is that actually somehow accurate?

Just out of curiosity did you make sure your firmware on your LSI controller matches your driver?
 
Status
Not open for further replies.
Top