SOLVED CAM status; SCSI status - only when using ZFS

sq8ijk · Oct 16, 2015

Hello,

It's my first post on the FreeNAS so I would like to say welcome to everyone !

I am running into issue that is very strange and I can not find similar issue on other topics (there are few topics that tells about CAM status and similar errors, but this time it's different as I was using two different boxes and 3 drives).

I had one pool which had three same hard drives in a mirror-0 configuration:

Drive A: WD2003FZEX
Drive B: WD2002FAEX
Drive C: WD2002FAEX

Those drives have exactly the same geometry and always one drive was unplugged in case pool goes dead. Every few weeks I was replacing one of the drives with the unplugged ones.

All of them were running on one system (miniITX with non-ECC memory), where I noticed that Drive B went unhealthy due to errors. In a dmesg I've spotted the following information for Drive B (commands are shortened as I don't have the original ones and using just what I copied/pasted):

CAM status: Uncorrectable parity/CRC error
Retrying command
WRITE_FPDMA_QUEUED. ACB: [...]

At this state I've decided to unplug Drive B and put Drive C so it will resilver from Drive A.
It did not as I was having the following errors similar to the one below from Drive A:

CAM status: CCB request terminated by the host
Retrying command
Read(10). CDB: [...]

At this stage Drive C went crazy and was marked as unhealthy.

Having only one copy of my data on Drive A, I've decided to use rsync to copy all my important data on the external USB drive. Everything went smooth without any errors in a dmesg.

Once I had copy I've decided to move the healthy Drive A to another system (very stable workstation ultra 40m2 with 16GB ECC memory).

The HDD's are running in a SATA 1.5 Gbit/s mode as the system is not supporting 3 nor 6Gbit/s.

I've wiped out Drive C and connected to the system. Created another pool (bkpool) and started moving data using send/recv of the snapshot from Drive A.

System started reporting straight away lot's of errors:

CAM status: CCB request terminated by the host
Retrying command
Read(10). CDB: [...]

I've canceled send/recv command and created manually all the zfs filesystems on Drive C. Using rsync I was able to copy all of the data to Drive C from Drive A without single error in dmesg.

Question: What is wrong and why when using rsync everything works as expected without errors and using zfs send/recv or adding disk to mirror creates such errors?

I thought it's cable/controller issue but separate box made me thing there could be a bug in freenas??

Freenas 9.1-STABLE

sfcredfox · Oct 16, 2015

I haven't personally Experienced this problem, but I'm wondering if you the GUI for removing the drives properly according to the docs?

sq8ijk said:
At this state I've decided to unplug Drive B and put Drive C so it will resilver from Drive A.
It did not as I was having the following errors similar to the one below from Drive A:

sq8ijk · Oct 16, 2015

sfcredfox said:
I haven't personally Experienced this problem, but I'm wondering if you the GUI for removing the drives properly according to the docs?

sfcredfox,

Forgot to mention - everything was done using CLI and proper zpool / zfs commands.
I was not using GUI for the removing drives, but attach/detach from zpool CLI.

jgreco · Oct 16, 2015

I won't bother telling you that you are doing really bad things since I expect you already know that.

Check disk C with a SMART long test and report the results please.

cyberjock · Oct 16, 2015

jgreco said:
I won't bother telling you that you are doing really bad things since I expect you already know that.

Check disk C with a SMART long test and report the results please.

+1

Having a 3-way mirror and thinking you'd be able to recover if a disk failed is non-sense. Sounds great in theory, but if you know how ZFS works you'd know it doesn't work that way. ;)

sq8ijk · Oct 17, 2015

Hi,

Sorry, but I don't get it. Why 3-way mirror is a bad thing? Note that one disk was always disconnected, so it's a golden copy that actually allowed me to recover from the current situation. If I had it connected, then I agree that entire corrupted pool would cause complete data loss, but here I just connected the 3rd disk to the system and was able to rsync data to another disk which allowed me to save my valuable data.

I am trying to set the system which will reduce to very minimum possibility of data loss as I already lost important information around 3 years ago (there was a lighthing strike close to my house and 2hdd's working in a NAS went very much dead).

What would you suggest? Have another pool on the 3rd drive and rsync it every a while from a 2-way mirror? Or maybe different filesystem on the 3rd drive? There is so much data that I can not burn all of it to DVDs or blue-ray.

Currently I am setting up new pool with all the data and setting 2 way mirror. Will run SMART long test on each of the drives, but it will take a while until I will post here the result.

cyberjock · Oct 17, 2015

What do you do when you remove a disk from a 3 way mirror for 3 weeks, then plug it back in? Which disk(s) do you trust when ZFS thinks one drive has 10000 transactions and two others have 50000 transactions?

Now what do you do when you have one disk fail, one disk claims 10000 transactions, and one disk that claims 50000 transactions? Which disk do you believe?

You wouldn't remove disks from a hardware RAID in this fashion, so why would you do this with ZFS?

Your fatal flaw was removing a disk and keeping it as a backup, thinking you could just plug it back in and get back up quickly. ZFS is not designed to allow this, and has no good mechanism to recover except to resilver the disk. But I'm sure you weren't wiping the disk and resilvering it back into the zpool after removing it for some period of time.

As for what I would suggest, I'll refrain from trying to make suggestions. Only you know what your options are, what risks you consider to be of the highest priority, and what provides you with the most piece of mind. Many people wouldn't consider a backup to be a backup unless it was a backup that was offsite.

sq8ijk · Oct 17, 2015

cyberjock,

One last question - if the system was shut down prior to disk swap the transactions could differ?

When I was doing poweroff/swap disk/poweron the resilvering was happening and so far they one which was just connected after few minutes was resilvered.

sq8ijk · Oct 18, 2015

Small update:

1. I've created fresh install with FreeNAS 9.3-RELEASE-p26
2. Connected Disk C only to the system that is running with 16G of ECC memory.
3. Ran long SMART test, no errors "Completed without error"
4. While trying to zero the Disk C using "dd if=/dev/zero of=/dev/da0 bs=1M" the following errors came up (not a lot of them, 9 different CDB values in total 27, meaning 3 retries per error):

Code:

 (da0: mpt0:0:0:0): WRITE(10). CDB: 2a 00 29 43 c6 00 00 01 00 00
(da0: mpt0:0:0:0): CAM status: CCB request terminated by the host
(da0: mpt0:0:0:0): Retrying command
(da0: mpt0:0:0:0): WRITE(10). CDB: 2a 00 29 43 c6 00 00 01 00 00
(da0: mpt0:0:0:0): CAM status: CCB request terminated by the host
(da0: mpt0:0:0:0): SCSI status: Check Condition
(da0: mpt0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

I found that similar problem was discussed (http://christopher-technicalmusings.../freebsd-scsi-sense-errors-did-you-check.html) but it happened to be a bad cables. In my situation the system is a second box with very much different cables/motherboard so I don't think there is issue with the cables.

Can this be caused that disks that are capable of running in SATA 3 6Gb/s are running on the 1.5Gb/s speed (utra 40 does support 1.5 only).

Any recommendations?

sq8ijk · Oct 18, 2015

FYI I am trying to use jumpers on the WD Black as I just found post that was talking about forcing WD drive to use lower speeds by the jumpers and possibly solving issue. Also enabled spread spectrum for SATA in BIOS, may help... Will report.

sq8ijk · Oct 18, 2015

Hi,

I think the problem is resolved.

The solution was to put the jumper to force WD Black to 3.0 Gb/sec, now "smartctl -a /dev/da0" reports "current: 3.0 Gb/s", before it was 1.5 Gb/s.

Another thing was to update BIOS to enable SSC for SATA devices (Spread Spectrum Clocking - I believe this was more important the the speed to get around the issues seen).

If the BIOS doesn't allow to enable SSC there is jumper in the WD Black to disable this option.

As for the backup solution I will listen experts (you) and will always run 2 HDDs in a mirror.
3rd one will be not part of the pool itself and I won't be switching the HDDs anymore.

sq8ijk · Oct 18, 2015

Just to confirm.
The problem is gone ;):)

. Copied over 1TB of data from the Disk A, without single CAM/SCSI error message.

cyberjock · Oct 19, 2015

A tip... if you are having to enable SSC to have a stable system, you've got problems. You, as a course of business, shouldn't *have* to enable SSC to have a stable system. Especially on a system as small as yours.

No clue what is wrong though. Could be your chassis isn't making a good faraday cage, your PSU, almost anything.

Important Announcement for the TrueNAS Community.

SOLVED CAM status; SCSI status - only when using ZFS

sq8ijk

Cadet

sfcredfox

Patron

sq8ijk

Cadet

jgreco

Resident Grinch

cyberjock

Inactive Account

sq8ijk

Cadet

cyberjock

Inactive Account

sq8ijk

Cadet

sq8ijk

Cadet

sq8ijk

Cadet

sq8ijk

Cadet

sq8ijk

Cadet

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED CAM status; SCSI status - only when using ZFS

Cadet

Patron

Cadet

Resident Grinch

Inactive Account

Cadet

Inactive Account

Cadet

Cadet

Cadet

Cadet

Cadet

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "CAM status; SCSI status - only when using ZFS"

Similar threads