Storage is OFFLINE after wrong Disk replacement

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
Hi, I'm new here, nice to meet you!
I did a big mistake and I'm in trouble, I had a fault hard drive to replace, so this is what I did:
- Storage > Pools > settings icon of the faulted hard disk (ada2)
- opened my server's case and detached the cable of the wrong hdd (the system was turned on!) and then attached back the cables
- detached and replaced the right faulted hard drive
- reboot the system

This is the message I got:

1664459094961.png


Here my hard drive list:

1664459121022.png


My raid system: 8 hdd, 6 in use and 2 of fault tolerance.

Thank you in advance for your help,

Dante
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
Ok, I discovered that a cable of the wrong detached hard disk wasn't properly attached... I have fixed it and booted up the TrueNAS system, now I see those:

1664462414213.png



1664462428954.png


It's resilvering... But 1 hdd is "unavail" and another one is "offline". What should I do? Wait for the resilvering to complete and then...?

Dante
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
What do you see from zpool import at the shell?
Thank you for your reply! I wrote another post, but I had to wait for its approval by administration, should I run that command now? It's resilvering... Dante
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
From what I can see, you just online the offline drive, and it should re-silver any changes.

Then re-evaluate the situation. You have an unavailable drive, one drive with lots of checksum errors and 2 other drives with one checksum error each. Where you go forward, depends on exactly why that unavailable drive is unavailable.

But, restore your offline drive first to restore some redundancy. Any loss of a block now and you would have data loss. Any additional disk lost, you loose the pool completely.
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
From what I can see, you just online the offline drive, and it should re-silver any changes.

Then re-evaluate the situation. You have an unavailable drive, one drive with lots of checksum errors and 2 other drives with one checksum error each. Where you go forward, depends on exactly why that unavailable drive is unavailable.

But, restore your offline drive first to restore some redundancy. Any loss of a block now and you would have data loss. Any additional disk lost, you loose the pool completely.
Holy... If I can get out of this situation, I will change the 3 failed hard drives. By the way, these errors have not been reported to me before.
Now I have set the hard disk that was on "offline" to "online", it's stacked here:

1664466695415.png



1664466715479.png



I remain patient and wait.

Dante
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
1664468635364.png



It didn't work I think... Did it stop the resilvering? It still scanning by the way... Dante
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
ada2 is in serious trouble. 30629 checksum errors is um not good.
However its probably, possibly maybe a faulty cable, or badly seated cable. If these are sata cables then note that they don't last many insertions so may need replacing

ada3 & ada0 are hopefully just due to you messing around as its only 1

Also, please post your hardware as per forum rules. In particular how the disks are connected to the computer - via what device etc)
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
ada2 is in serious trouble. 30629 checksum errors is um not good.
However its probably, possibly maybe a faulty cable, or badly seated cable. If these are sata cables then note that they don't last many insertions so may need replacing

ada3 & ada0 are hopefully just due to you messing around as its only 1

Also, please post your hardware as per forum rules. In particular how the disks are connected to the computer - via what device etc)
Hi! I've updated my signature with the hardware of my server... Hard drives are connected directly to the motherboard.
Resilvering / Scanning finished, this is the actual status:

Pool

1664492044567.png


I think "/dev/gptid/*" are 2 hard drives not recognized by the system.


Disks

1664492051126.png



It recognizes only 7 hdd out of 8, and 1 of the 7 is "N/A" in the Pool column...

What should I do now? Have I to remove the new hard drive and put the old one back?

Dante
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Are your "WD Red 3TB" drives SMR?

SMR are problematic with ZFS, and are highly not recommended, especially from Western Digital. They can work, but odd problems can crop up.

Next, do you have a new hard drive inserted?

If so, then run the standard disk replacement for the first disk listed with "/dev/gptid/". See the documentation for the procedure. But, be careful as any mistake can cause problems.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
You might also have a PSU problem as follows:
SATA power connectors are not really designed to power much more than a single HDD. You are powering 4 from a single connector. If it was a molex then I wouldn't have an issue. This is a maybe.

Do you have a backup - your pool has issues and you need to get to the bottom of it before going much further.
Which disk is the SMR?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
If you lose another disk your pool goes bye bye, and ada2 doesn't really look promising. After you checked the cables, and possibily regained some redundancy, run a long smart test on ada2.
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
There is only 1 hard drive "SMR" and it is the new one (the last inserted), the other 7 have CMR recording technology. I don't have any backup... And I have 2 "unavail" hdd, as shown in the screenshot, I don't know how to manage them :-/
Of those 2 "unavail" hdd, 1 is the new hdd that replaced the failed hdd, and the other one was the one that I wrongly detached. The failed hdd wasn't really failed, I had the error "error count increased from 0 to 1", for this reason I'm wondering if I have to put it back by removing the new hdd drive (SMR). Dante
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
Well, do create one then.
It's impossible now, actually I transfer files at the speed of 800 KB/s to 2 MB/s. I don't have enough money right now to buy a new machine dedicated to backup data. I will plan this in the future, for now I wil like to rescue my pool... I wouldn't run a smart test, I'm scared to stress the hard drives more... If I have to replace another hard disk, it will start resilvering again, and the hard drives will be stressed a lot again... Which move should be the best now? Dante
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
If the drive you in-correctly removed is both back in and usable, use it as a replacement drive.

The new SMR drive should be sent back and replaced with a CMR drive.
 

D4nt3

Cadet
Joined
Sep 29, 2022
Messages
9
If the drive you in-correctly removed is both back in and usable, use it as a replacement drive.

The new SMR drive should be sent back and replaced with a CMR drive.
That's the point, the drive incorrectly removed is back but is one of the 2 "unavail" drives. Why is it unavail? Dante
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
That's the point, the drive incorrectly removed is back but is one of the 2 "unavail" drives. Why is it unavail? Dante
Offline it and re-online it, see if that helps. Make sure to do so to with the correct drive.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
It's impossible now, actually I transfer files at the speed of 800 KB/s to 2 MB/s. I don't have enough money right now to buy a new machine dedicated to backup data.
May I suggest you do the following with high priority:
  • Prioritize your data with 1-3
  • Get (borrow or purchase) USB drives and/or cloud storage
  • Copy your data over, starting with the most important one
You are really playing Russian roulette otherwise.

Frankly, the argument about transfer speed is nonsense. If it takes a week to copy your precious data over, so be it. My initial cloud backup (as a tertiary off-site storage) took 6 months. So what? I don't mean to be rude. But please think with a calm head :smile:
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I will also add that the Resilver Priority Task is your friend.
Oh, and since your pool occupancy is (way) over 80% you are having a performance drop. It will only worsen in the future.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
That's the point, the drive incorrectly removed is back but is one of the 2 "unavail" drives. Why is it unavail? Dante
If you removed the drive improperly, ZFS considers it gone, (aka UNAVAIL). ZFS is a "live" file system, constantly changing, even if the only thing is last access times. So any removed drive becomes out of date rather quick.

Now you HAVE to perform the replacement procedure. Including re-syncing all data on this drive. Just be careful to perform the task on the correct drive.


No disrespect intended, but this is exactly why in some cases I suggest to prospective new users of TrueNAS to either read up on ZFS features / abilities, and TrueNAS procedures. Or find more user friendly NAS software to use.

ZFS, while I think it is a great piece of software, has quirks, and oddities that other NAS file systems don't have. Thus, a learning curve, sometimes a steep learning curve.
 
Top