Citadel - Build Plan and Log

ctag · Jan 25, 2018

Hrm, the host doesn't have any sleep mode or anything. I was about to try returning the drive today (at the risk of replacing it with a different model once shucked) and noticed that all three drives failed the extended SMART tests... Something weird is going on.

I ran a system update on the host, rebooted it, and then plugged one of the drives back in and started an extended test, and it failed a little while later just like before:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Aborted by host			   90%	   190		 -
# 2  Short offline	   Completed without error	   00%	   189		 -
# 3  Extended offline	Aborted by host			   80%	   162		 -
# 4  Short offline	   Completed without error	   00%	   161		 -
# 5  Extended offline	Aborted by host			   90%		23		 -
# 6  Short offline	   Completed without error	   00%		22		 -
# 7  Extended offline	Aborted by host			   80%		 1		 -
# 8  Short offline	   Completed without error	   00%		 0		 -

ctag · Feb 3, 2018

Well the SMART tests pass just fine on another machine, so I can assume that the desktop I was using had issues. I ran the disk burnin on the last three drives, and everything passed. The spin-up-time values are still weird, but I'm going to roll with it. For starters, I checked the value on some old WDC 2TB red disks, and it was close to 170, which matches what the "odd" drive I have was reporting. So I actually think the drives that are returning 253 may be wrong or are still settling down to a 'real' value, since 253 is a perfect score.

I also noticed that the drives are half and half on SCT ERC. Some have it enabled, others disabled. I intend to go through and disable it on all of them, since that seems to be recommended for RAID setups.

Other than that I think I'm about ready to start building this thing :)

ctag · Feb 3, 2018

Well I wasted the morning searching around for my second 32GB flash drive that I bought to make the FreeNAS boot disk mirrored. It looks like you can go from a single disk to mirrored after installation though, so I guess I'll start with one and then order another.

I got through the initial installation, shutdown, removed the install media, rebooted. It chugged along for a bit, and then keeled over with a bunch of "CCB request completed with an error" messages.

s8c_KaRooBPRVvBWagGKxF2sNjkNxGH9f8vd3X0bTrvZQJNbgjZ2Kkxx_-17t3uIDf4O6Xv1Q4mXKw5NA6B8HOoPnqn1MHdtmYpEwZI8gpvXvesWumtuGrS5SVis3dwFsgKpKAMze6kO1SbeF9gOCZi9jeW-SWlBqnnbplT1Z01MLq9oc_fn0Zg-7Dn5CsMsGd4R8GFtl-eX7umYY98lFQEn7ggllJZa4Kc0NILmY91SYkVatdeYU-LyPbxlLWkbA7PbXTr5xSYW20fcN5_0imflKysrcm0a4UBiQClQnpTtJIjo9utGguj7xkoN0hY-pESfIeBn8TKH5ADRo0QxXH-av9uys0I63gkkgXqcGhX5_cEW-tl-bw0L0DNd8utQp6r0zJLyreYQRYitNNj6EfWO19RhU_44ylrEuzWACt4lYntTgKu75p8yTACY7z-ApfzoJQ24BU696nGXfsY6h0CEtvlkM-4wWMSn6WwHrqmgxzKm9Xphqln02CZZEbcByHYsHVwwZvmF2YTMlKHNXoe8Gs6Dd5Zb6Y0R48GVKRB5o9t6-RwI_oFH2mgLJX-n2FMYbXWJwFy30HWBpBR4daDUpuUcbK5-KIca5cXRSLHpyMnh2weTDg4R4eDrqxI3EZUYNvBsS6QeBkKeluGk6Nj1a5ks_kMp=w1303-h977-no

I assume this is an issue with the SAS card, since da6 was one of the WD drives and the boot flash drive was da7 during the first boot (though thinking about it now, I bet that changed when I removed the install media?). A lot of other forum threads seem to encounter this error and attribute it to the flash drive, but I haven't seen much in the way of solutions other than buying a real hard drive to boot from...

*An hour later*
Rebooted with the flash drive plugged into a different port. Everything came up OK and I went through the introductory Wizard in the web-ui. I set up email (with app-password), and created one main volume. Now I'm just dorking around trying to get a feel for things.

*Later*
I tried out the beta UI, and found it to be quite nice, although somewhat glitchy. Ultimately reboots wouldn't work from it, so I went back.
Also, NFS permissions is giving me trouble. Happened on the old NAS too, but now I'd like to figure it out and get things set up nicely rather than "just make it work".

*Later*
I was going to go back today and set all of the SCT/ERC SMART values to be 7 seconds on all the drives with smartctl -l scterc,70,70 /dev/daX, but it turns out they're all already set to that! I guess part of the pool creation process or something includes setting TLER stuff. Nice.

ctag · Feb 5, 2018

I'm planning to go back through a re-install with GELI, but before that I'm going to make the most of having everything set up: I'm going to simulate a disk failure and (hopefully) resilver.

First up, I can't get dd to wreck an in-use drive:

Code:

# dd if=/dev/zero of=/dev/da0 bs=64 count=64				
dd: /dev/da0: Operation not permitted

So I shut down the server, and pull a disk. Upon boot, the tty doesn't show anything amis, but the web UI has a critical warning for me:

vErF38Z-mH6j_1rPpNPAiqqUY2qCazIYB3D7qVgz91FCbj_DC9pMPB2F9t9eA6KMbxu4wxXZujXsegHJf60-WauH3GhX4Z9LTsR5zRpqcAN1eOQ42OiWHZ3h9rsvNuN4WdrM8VQMyHO7Ku0v6AHkMd6ZjWSvr3Yx7oRnCQBYmRe5FgbwpriGbZkkEiDj2fC8tJMX6N64iO691pP-s7xqUgFbSPEqC9B4FMGSwg7wz0M90SlzJxexsFEEzaq-spoAVgE7oHndIzEkVFFQtJwDpG11865wJEQh7UTRLHp-UZ07s1Xxz_DVTgR_jiCohc99Hg5yTayBn6HezdOiwjwOpIFCE147mQb8cNIKU-wshBV_GoZc11BM94k-Pi24qQyFfZZoE041ANo6POFvZ4MV7cFEW5SYr6d1qmogjec4RjzUDOn8o6HiVjg5UmaimdkZGvh6rwWjt5Wv7Xr2zro_1rHzbVKOd_rCs4HuaP8XLJbu4yIfWVthoyc7IDVZV9jzdEGXcBVRILN4QlXqhKSbHNhSs6INzzWNc4H88ZWbDriw4YaMqrYEX1GFKaYwUlDQnNkZGxltjdNUyRxTXlFrs1jFqC8TD8E_sBw5__xpVDdV37_ygJTbx3ebBRtfQYpYaEVcNwFZdNvsZJrF7Z55pjlbTxH7GZ7k=w633-h161-no

And I got an email from the server:

Code:

The volume main-pool state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.

So the computer knows that something's wrong, and let me know. That's to be expected, but it's nice to see it work.

Next, I "damage" the removed disk:

Code:

# dd if=/dev/zero of=/dev/sdf bs=4M count=1000

Shutdown the server and reinstall the disk. Bootup seems to take longer...

And back at the web UI... Nothing? No sign of the critical warning from before, and no new emails... The volume shows healthy. Did the pool resilver during boot?

I'm not sure I like that at all.. At least the pool/volume is fixed.. I think.

*Edit*
Running zpool status shows:

Code:

scan: resilvered 2.30M in 0 days 00:00:00 with 0 errors on Mon Feb  5 20:56:53 2018

So the pool did resilver. Good to know.

Chris Moore · Feb 6, 2018

ctag said:
Shutdown the server and reinstall the disk. Bootup seems to take longer...

And back at the web UI... Nothing? No sign of the critical warning from before, and no new emails... The volume shows healthy. Did the pool resilver during boot?

I'm not sure I like that at all.. At least the pool/volume is fixed.. I think.

If you had deleted the partition on the disk you removed, then you would have seen a full resilver, but it only restores the amount of data actually present, so if you pool is empty, the resilver goes very fast. The more data there is, the longer it takes. With hardware RAID, it doesn't know anything about the data, so it must do the entire disk, ZFS is smarter, and it knows the data.

Because you didn't delete the partition, it was recognized as being the same disk, and ZFS knew that all it needed was missing data restored, which it did without even asking. For a blank disk, you would have to direct the system to use it to begin the resilver.

ctag · Feb 6, 2018

@Chris Moore Thanks for the explanation, that makes sense now.

ctag · Feb 6, 2018

Something interesting happened!

I couldn't figure out how to manually instigate a scrub, so this morning I scheduled the next one to happen today at 12:00. At noon I got three emails from the machine all within a few seconds of each other:

Code:

starting scrub of pool 'main-pool'

Then:

Code:

scrub of pool 'main-pool' finished

And finally a critical alert:

Code:

The volume main-pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Corresponding message in the UI:

0JA5kARFsmrawi5y6gFJTS9dwSGUo3AAkuVSAkxY7S2VyplAOM2I-e5fVWwMyHXGzECDeC-AVwI3lP_Vt0m5gASgpUgbhceGyN_DEbMzJviqf8G_RLIJez77TdDTH9B-nPuiDB6GxOgzUu_VC6LThaicB8t6MLTXDigJHgSkhKe-IArJAg1cbCTkM9Hlt1jIp0gwR_gMLdDkQNMdb-rBJVczsMIDdSUxYMUe7eOT44GKmQdGFjZgMSovwJt25-r3dSulk8xTG27awOE8RWIPw_seWm57DAqPKioLE_hk2DQyvmMUCGdvOKHQDaRuQC7PdkETmwik7g_FU9cF4IiuzhF6oz1waHZEJDFi1xE91k1Z2nhp0-3wio3TPmxVMxBpPl8AtpCoSxOJPrdwFkLpe6pzdvhNjkLUcz76Wk1OCY6n-BdihhFIio2AVuvN6fSIKrBH875r6bUOQ0dcvXPSPewrhtN5vPDM3Uups-oPtMz4tfCie_gBbabUAAKHl2sJS1PwLeR10GcJG7c6BOeno1n_SzyMTWiDgO7vFwHDIgOjhh8o5PgPJur6e1xGnnYMOuAMHfqH-iluH23tA7Qn3ZP-fHjYl3mHVMOGZpvuHYBW1ZisDDN8lbGCx3hdkCBBi4ke67aYdR8no7cLJGX1crkle9NGsDWv=w616-h138-no

And the volume storage details:

rgVnqLnfmoMsIDqgc7NEiSGijcruJmPTTIHiyj5-DUaiQsVtmQvfhUSfJynLDDKzloJpQT_6mGVO3ItEuVAdTnaLxttdFUAmHmr_gR33hJ_RHcKw88zA0CIW4nNZOGstDvgXnT4Tq9jZmvRtUtrOQo8Kg9FV-eZrmN4I0dcZxEW0mx4lYbb7EIppqumNaodfNuq7Q5xElPztyOfjaAuGUJ5CpkZ4AkVoXp9xXHQ099zf-23_DZqVe8T4fouNFq3Q1FXruRc0XZ2e9rPyG1wbKkMmVOdpBZbkf_avObFXZ8JbQY_noL_0SB10UrTzjaQqDJSFfIF0UywQccWcSYjlOp4-39PbvH9zmg5m5pHlQaHuhpUEwziDOR8m_6MG6RYz1iPjmBS589lOPqhdfbB_mGEAPPNeXiCJIJFn-M_R04yt9dUIzt3ZRgG7c_4QVHnKlwRN2C8hU6mnDEvOSNmEy_M0wbK35e-feumxKNY9jo9GaOzNBhyn-6birwgyeiejo_C5oLcDR_E75DjGDdhlQ3k4G01InnPDjzwOQCDSigHf1NfOxmpzqUa68fGPPyDMTr-Q7lIUTGPRl5znZc7EoFiBHbloqIV0HM-55bKjOmIOmMVu70QNWKxOLpOPtcUiqxFQyxniiuWQZK84KwL5joK5COQCwI4w=w873-h321-no

So it looks like the scrub found some additional errors that needed to be fixed from the resilver? And it fixed them? But the critical message on the UI is still there as though I need to do something about it.

From the scrub email I learned what command starts a scrub. So I ran another:

Code:

# /usr/local/libexec/nas/scrub -t 0 main-pool				   
   starting scrub of pool 'main-pool'

And within a minute I got another scrub finished email, this time no repairs were reported.
Running zpool status again shows some more information:

Code:

  pool: main-pool															   
 state: ONLINE																 
status: One or more devices has experienced an unrecoverable error.  An		 
	   attempt was made to correct the error.  Applications are unaffected.	
action: Determine if the device needs to be replaced, and clear the errors	 
	   using 'zpool clear' or replace the device with 'zpool replace'.		 
   see: http://illumos.org/msg/ZFS-8000-9P									 
  scan: scrub repaired 0 in 0 days 00:00:18 with 0 errors on Tue Feb  6 12:19:29
 2018

So the error is hanging around for me to decide whether the disk needs replacement or not. That's handy. Since I'm pretty sure I caused the error by dd'ing part of the disk, I'm going to run zpool clear to reset the error message.

Chris Moore · Feb 6, 2018

ctag said:
So the error is hanging around for me to decide whether the disk needs replacement or not. That's handy. Since I'm pretty sure I caused the error by dd'ing part of the disk, I'm going to run zpool clear to reset the error message.

This is not the technical explanation, but a kind of 'in my own words' explanation of the way it works.
ZFS keeps a kind of history on every disk (like traffic tickets) and if they get enough it gives the offending disk the boot. Like an eviction notice. It advised you about it so you can make a determination because it expects you to do additional testing. I had a disk that was throwing CRC errors repeatedly and the alert from FreeNAS allowed me to check into it and I ultimately replaced the data cable. The disk was fine but the cable was bad.

By the way, like with a resilver, the more data you have, the longer the scrub will take.

ctag · Feb 7, 2018

Cool. I had written 6GB of documents to the pool before messing with it, but probably should have weighed it down a little more to get a feel for things.

Last night the scrub for boot-volume ran, and returned critical error email messages. Now the web UI is unresponsive, and the tty is filled with more CAM status: CCB request completed with an error
The emails:
11:02pm

Code:

The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

11:03pm

Code:

The boot volume state is ONLINE: One or more devices are faulted in response to IO failures.

3:45am

Code:

cannot get history for 'freenas-boot': pool I/O is currently suspended
cannot get history for 'freenas-boot': pool I/O is currently suspended
   skipping scrubbing of pool 'freenas-boot':
	  can't get last scrubbing date

Reading around, it seems this is definitely a USB issue, where USB devices in FreeBSD are presented as SCSI. I guess now I have to decide whether to try using non-USB3.0 flash drives, or switch to sata boot drives :(

SCSI system https://people.freebsd.org/~gibbs/ARTICLE-0001.html
SCSI CDB https://en.wikipedia.org/wiki/SCSI_CDB

Chris Moore · Feb 7, 2018

ctag said:
Reading around, it seems this is definitely a USB issue, where USB devices in FreeBSD are presented as SCSI. I guess now I have to decide whether to try using non-USB3.0 flash drives, or switch to sata boot drives

If you look at the build in my signature, I use a pair of 2.5 inch 40GB laptop hard drives for my boot pool. I do that in all three of my FreeNAS builds. It is more than adequate and since they are hard drives, you can do diagnostic testing on them just the same as any other drive in the system.
Manufacturers are making the USB memory sticks so cheap that they are just not reliable. Lots of the other forum members are suggesting using a regular (small as you can get) SSD because just one of those is more reliable than a mirrored pair of the USB sticks. There is actually a push to take the USB boot option out of the documentation, but it may not happen any time soon.
I went to the method I have now about three years ago because of having two USB sticks fail on me, one after another, within six months. It is a real annoyance, but if you have a copy of your system config DB, it is not to bad to recover from.

ctag · Feb 9, 2018

There are still SATA ports available, so if it comes to that it'll work. Unfortunately I can't seem to find ~40GB drives that don't look like sketchy junk.

I had already bought two USB 2.0 flash drives anyway, and want to give them a shot first since my gut feeling is that having the USB3.0 drive operating as USB2.0 was a contributing factor. All the way up until a few minutes ago my plan was to blow the system away and re-install with the new flash drives. This was because the original installation crashed with the USB issues, and I'm worried the system is somehow minutely corrupted.

But then I found this blog post about converting to mirrored boot, and decided to give it a shot. The process was surprisingly smooth, I added a drive to create a mirror, then replaced the original drive. For a few minutes the system was bugged out showing three drives with one of the drive IDs as a bunch of garbled numbers, but it eventually cleared up on its own. I'm really stoked that it was possible to totally replace the boot media with the system live; that's just awesome.

I ran a couple of boot scrubs back-to-back and everything checked out OK, so if that freezing issue is still there, it's biding its time.

Now I'm feeling a little conflicted. On one hand, it'd be super cool if the first ever FreeNAS installation on this machine was 'canon' and stayed with it forever. On the other hand it'd be nice to go back through the installation process just to be sure everything is kosher. Definitely not a bad problem to have though, this was fun!

wblock · Feb 10, 2018

Chris Moore said:
There is actually a push to take the USB boot option out of the documentation, but it may not happen any time soon.

This is news to me. I have de-emphasized USB somewhat, but there are still a lot of people with limited SATA ports and empty USB ports.

ctag · Feb 10, 2018

Well I might be going to SATA drives yet.

I went to re-install FreeNAS this morning, using the two new drives as a mirror, and can't get past the first screen:

Uu0h06ZwWskQdTOj3iYFG6m3U0xo-K9O0P3baS85JjASmNO6zY9UKfc20cpx9O9MPPUMVsVcCcy-z-tWsi-fS2g2RcsYtWsHuJ10o_vkrfDdmKosPuGSRM_WdUzFAeVWcZGUU__ZR1lds5_JPXDMMBSkCriuIstjsNvvhmPDZxysoM5PbgqkmDGvKRP5GfCUCsNcJT_o8YxMlNT0LUTeXpnbr7Q88fBERYCXWjr4w9w--H193hN-ETgZP0qPIMKTgret9jKE06gY5cnbgGhVgj9hABZPCJfLY5kUoNQoOxPm33P2qc1D6PrxkMdkZ3hVgnMCg8YflsDcx-OLzMyXY0yMcYYNyqLNpbatiMZGHA-Lgmp5SXvJCRuslh5G44JzpjtnazurD-p72o62buLMbnXh-1j4A_fsFZCFwxEoWQFLNJJNu8vAevsI8QthRiaUTQnDfCsUmchsHU9PokhukTLIZ7N7IY2T8ulhn_hJGAF5ruUrCF5LW8Zu5SyO8GZtHxbESYuRCKrt3f1bgWiXk5LVXrZXyOcj_HzzlmN5VCTpPjPpgGlkimZ47fktK-BgOoe3efe-2KVnUDuN28RMI1Cv3BSIwrxcWJNabYRr1UKRFTIn1ypqJszJP_QAQiH2BhnmebB2JEaNJ3gzDRBdS14MVRXrVxfm=w1303-h977-no

I'm worried, because bug reports about page faults say they indicate hardware issues. So far it's happened twice in a row at this same point.

wblock · Feb 10, 2018

ctag said:
There are still SATA ports available, so if it comes to that it'll work. Unfortunately I can't seem to find ~40GB drives that don't look like sketchy junk.

A consumer SSD is my preference. It does not have to be small, extra space is useful for boot environments, log files, and wear leveling.

Jailer · Feb 10, 2018

You don't have to get a small SSD for a boot drive. Get an inexpensive SSD that you can afford.

Edit: @wblock beat me to it.

ctag · Feb 10, 2018

Sounds good to me. If USB doesn't work out I'll look into small SSDs.

For some reason swapping the ports that the USB drives were in resulted in a successful installation... Not super encouraging, but oh well.

So I've got the system set up with mirrored boot drives, and a GELI main pool. I've backed up the master and recovery key. Now I just need to plan out how I want my services set up and then go do it.

ctag · Feb 10, 2018

Aaaaand a disk in the old ReadyNAS just failed. Absolutely incredible timing. Now things are a bit more imperative.

*Edit - liveblogging the disk replacement*

I don't have a forum thread for the ReadyNAS, so it's going here.
OK, so this happened because I had to reboot the ReadyNAS in order to move the UPS around and add the new FreeNAS box to my little networking closet. When the ReadyNAS rebooted, it ran a volume check (possibly for the first time since 2016) and discovered that a disk had gone bad!
I used to have an offline 2TB replacement, but it's gone, either lost or used.
Pulled the trouble disk and checked it on my desktop. It enumerates, but then keels over if I try to query SMART data. So it looks like it is actually FUBAR.
Try swapping in a 3TB WD Red disk, it's wasting 1TB, but that's fine by me.
Ugh! The ReadyNAS only supports up to 2TB! I had totally forgotten!
There's a 2TB disk in my desktop's software raid. I'm going to swap it with the 3TB disk.
First, run a scrub of the LVM raid array with lvchange --syncaction {check|repair} vg/raid_lv
And monitor it with lvs -o +raid_sync_action,raid_mismatch_count vg/raid_lv
Wait for the scrub to finish... Done.
Read the man pages. Read the man pages. man lvmraid was super helpful.
Run lvconvert --replace old-disk vg new-disk to migrate disks. I have no idea if the old disk needs to be left alone while this is happening, but I'm not touching it.
Remove the old disk from the lv (forgot the command)
Shutdown the desktop, swap the disks, put the dinky 2TB in the ReadyNAS and let it do whatever it does.
Done! Both arrays appear to have survived.

ctag · Feb 15, 2018

Scheduling things with the stable UI is a little tricky, but I think I have it set up the way I want:
- Long SMART tests run once a month, each disk on a different day from the 1st through the 7th
- Short SMART tests run once a week, each disk on a different day, always starting an hour before a Long test could start.
- Scrubs run twice a month, on days past the 7th.

I'm following advice from this post and also this other post.

danb35 · Feb 16, 2018

ctag said:
each disk on a different day

Why on earth would you complicate your setup this way? It makes your configuration far more complicated than it has any reason to be, and gains you nothing at all.

ctag · Feb 16, 2018

I'm glad you asked. Because it seemed like a bad idea to have all of the disks running a SMART test at the same time. It just makes sense to do them incrementally. Especially since there's advice around the forum to not scrub and SMART test at the same time, it seems there's some performance penalty to the tests, so mitigating it to a single disk at a time seems reasonable to me.

Important Announcement for the TrueNAS Community.

Citadel - Build Plan and Log

Patron

Patron

Patron

Patron

Hall of Famer

Patron

Patron

Hall of Famer

Patron

Hall of Famer

Patron

Documentation Engineer

Patron

Documentation Engineer

Not strong, but bad

Patron

Patron

Patron

Hall of Famer

Patron

Similar threads