How best to find the offending disk?

Status
Not open for further replies.

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
Hello all!

I've been building and rebuilding my SunFire x4500 (16GB ECC Ram) with 48 disks (two are 128GB ssds currently not used in my pool and one is a 256gb ssd cache disk, used in the pool) on 9.2.0 x64. Everything was running great until the scrub kicked in and then the system just puked.

I'm guessing it ran across a potentially faulty disk but would think the entire system shouldn't go offline. When I rebooted the system, it did come back. Pool looked clean.

I ran the smartctl -t short against all disks active in the pool. Two came back with SMART errors...ada30 and 37.

My question is... How the heck to find the disk? I remember oracle had a package that would identify the device and its position in the chassis (SUNWhd). I do have the SN, so theoretically, i could shut it down, open up the chassis and pull each drive one at a time to find it...and then replace the drives in question. Any other suggestions?
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
if you look in var/log/messages you can see the bootup sequence recoded and it will show the "adann" and then the type of disk and its serial number - then you would need to look at each physical disk.... i learnt this the hard way so now all disks are labelled on the outside edge so i can see without removing them!

thats the only way i know unless someone else has a short cut - i would also be interested! :)
 

rm-r

Contributor
Joined
Jan 7, 2013
Messages
166
you could also use

Code:
dmesg | grep ada
 

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
Thanks rm-r! I shut the system down after copying down the SN into excel. One other thing bugs me about the FreeNAS interface...you can export your list or even copy the contents of the disk information.

I replaced both disks that told me they had SMART errors. It was a joy to replace the disk in FreeNAS compared to my other production ZFS appliances (cough SUN/Oracle 74x0 cough). One thing I found tho...when I ran a scrub again, the whole pool went offline as it was finding checksum errors on most of the devices. I call shenanigans on that one...I don't have 25 bad disks. It was only a few months ago when I ran oracle's scrub via Solaris 11 without issue. I'm guessing there's a driver issue or something amiss somewhere.
 

chylewe

Cadet
Joined
Jul 18, 2013
Messages
4
Hi Ytsejamer
I was looking into running FreeNAS on a re-purposed X4500 that our company has.
Are you happy with the results so far?
You seem to have had problems with aspects / features of FreeNAS that we will also be using.
i.e. AD integration, pool scrubbing, etc.

Also where did you find drives to replace your faulty ones with?
 

chylewe

Cadet
Joined
Jul 18, 2013
Messages
4
Maybe this will help you previous question regarding disk locations
thumper-diagram-sm.jpg
 

Scareh

Contributor
Joined
Jul 31, 2012
Messages
182
16GB ram on 48 disks? 0_o

How big are your disks? And what kind of performance do you get on your system? (transferrate to and from your nas)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If your disks are so small that 16GB of RAM is "enough" based on the thumbrule, you could probably replace all of those disks with 3 or 4 4TB drives and have improved system reliability as a result. Not to mention the power saved from not running that many disks will probably pay for itself in a year or less!
 

untg

Cadet
Joined
Feb 24, 2014
Messages
4
For anyone who is interested, I've found that at the end of my disks (not the connector end) the serial number is on a little sticker.
 

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
Many thanks for the chart chylewe! I had a good idea of the disk positions as to the assignment within the chassis...but was trying to marry up the ada## in FreeNAS to the chart/chassis location.

As for my disks, they are 500GB SATA (see signature for the gory details). It is nice that Sun put the stickers on the top of the carrier.

The issue I seem to have is simply the scrub at this point. 9.2.0 x64 seems quite stable and I have had less oddball issues with that. As for the scrub, it starts to find checksum errors all over the place and then the whole system eventually locks and reboots. I've run SMART tests on all the disks and while two of them did show an error - I replaced them with two 500GB SATA disks I had on the shelf, the rest are clean. I can't find anything wrong. In fact when I was running Solaris 11 and had the same pool design, the scrubs ran without issue.

With AD integration...I have to say that once you get the hang of things (some of the sort-of gotchas), it works fine! I had small issues with sharing out mount points and setting permissions at those mount points. I'm used to how my NetApp and Sun/Oracle ZFS appliance handle the permissions at the mount point level. The only difference with the FreeNAS is that at the mount point via command line, I have to give my AD account full access. From there, I can configure the shares via Windows management as normally expected. I'm not a *nix guy...but its remedial even for me. Just an extra step to be aware of.

I wish I could add more memory, but this old beast is maxed at 16GB. I'm not looking to do more than store a few old things via CIFS...I almost don't care too much about performance, as long as it works well enough. It's not hosting production data and services. I haven't measured the performance in anyway during one of my test data transfers. I can't remember what the Richcopy rate was.
 

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
Here is a quick and dirty excel sheet I have with my information. Please feel free to use for your SunFire system...replacing the SNs and whatnot.
 

Attachments

  • x4500DiskLayout.zip
    11.6 KB · Views: 180

untg

Cadet
Joined
Feb 24, 2014
Messages
4
I've used the following commands in FreeBSD to get information out about the disks including serial number:
gpart show -lp
camcontrol devlist
smartclt -a /dev/adax
dmesg
camcontrol identify adax
 

joelmusicman

Patron
Joined
Feb 20, 2014
Messages
249
Posting output from "zpool status" would be quite helpful... I'm really curious about your pool design to say the least.
 

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
joelmusicman - I designed the pool layout based on the legacy Sun ZFS pool documentation/SolarisInternals stuff (http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide). They had an example pool design based on the Thumper box specifically.

[root@storage05] ~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
pool-0 18.9T 688G 18.2T 3% 1.00x ONLINE /mnt
[root@storage05] ~# zpool status
pool: pool-0
state: ONLINE
scan: resilvered 15.1M in 307445734561825859h24m with 0 errors on Tue Feb 11 13:28:10 2014
config:

NAME STATE READ WRITE CKSUM
pool-0 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/537c58c0-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/540a2d08-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/549c0c19-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/552c1c67-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/55bd01a6-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/564ec28a-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
gptid/577553d8-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/58059e24-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/589a160a-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/592bf9d0-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/59e07666-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5a74a487-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
gptid/5b9b8149-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5c346245-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5cc3baa6-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5d563984-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5dec4a8c-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/5e81a424-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
gptid/5fa7bbee-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6040e9ed-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/60d60fa3-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/616e360d-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6204cc88-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/629f1da0-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
raidz2-4 ONLINE 0 0 0
gptid/63cfd962-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/646d1825-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6504aa88-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/659f063d-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/663a2f0b-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/66d4f539-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
raidz2-5 ONLINE 0 0 0
gptid/68061096-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/68a67d53-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6941d472-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/69de095b-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6a7cb174-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/ed1dab43-9344-11e3-87e0-00144ff28650 ONLINE 0 0 0
raidz2-6 ONLINE 0 0 0
gptid/6c554a8a-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6cf35eb1-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6d9b6f21-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6e39461b-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6edcdd7b-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
gptid/6f84daaa-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
cache
gptid/7084f749-6f39-11e3-9c36-00144ff28650 ONLINE 0 0 0
spares
gptid/7129fe28-6f39-11e3-9c36-00144ff28650 AVAIL
gptid/72714fc5-6f39-11e3-9c36-00144ff28650 AVAIL
gptid/5be77053-9346-11e3-87e0-00144ff28650 AVAIL

errors: No known data errors
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526

joelmusicman

Patron
Joined
Feb 20, 2014
Messages
249
On a more serious note, your config has 14TB of actual storage capacity. I second what CJ said earlier about upgrading hard drives. I recognize that this system may not support 4tb disks, but if it does, a single 6x4tb RAIDZ2 config would net more useable space.

The system is probably unable to handle the huge amount of parity calculations and that's why it's puking.
 

Ytsejamer1

Dabbler
Joined
May 28, 2013
Messages
28
You've got time, right?
lol...yeah, I don't think that's entirely accurate. I replaced two disks that had SMART errors and add replaced them in the pool. It resilvered in about three minutes. I have no idea where that number is coming from!

I'm sure I probably could replace several 500GB disks with at least 1 or 2TB SATA disks. I'm not sure why it can't handle the parity calcs...it was doing fine running on Solaris with a similar pool setup. I just got tired of command line everything. Unless the BSD ZFS is a completely different beast... *shrugs*

This box is my plaything..I'm not doing much at all, but the idea of different disks is worth exploring.
 
Status
Not open for further replies.
Top