FreeNAS v9.2.1 lu_disk_lbwrite() Error after crash

Status
Not open for further replies.

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
Hello All,

So I had a copy in progress on my iSCSI mounted Volume. Suddenly the performance went really low and then traffic stopped.

This was a Windows mounted iSCSI volume.

Things got so bad that i had to restart everything.

Now I'm getting ***ERROR*** lu_disk_lbwrite() failed consistently on the display screen of the FreeNAS monitor and its filling up the /var/log messages file consistently.

I have checked the RAID volume in the manager and in command line by running:

zpool status -v

Shows no errors.

Also no errors in relation to S.M.A.R.T. either so far, but I don't know the command to run the test manually.

I am able to access the Volume from windows and it mounts fine, but when i try to write to the drive it acts up or does not complete.

I had one error in relation to it being 98% full but that doesn't make sense considering it is 5 TB volume and it is reporting over 1.7 TB remaining space?

Is there any way to fix this volume or issue?

One thought I had was checking to see if the USB drive I'm running the FreeBSD was full. What files should I check to make sure this isn't happening? Messages keeps filling up.

Also, tried running Windows disk check, I was able to run an offline check and repair as well but it is not telling me anything.

The only thing left is to try unmounting the iSCSI volume from Windows and re-adding it.

Help appreciated.
 
Last edited by a moderator:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
So I've discovered that every time i try to reconnect to this iSCSI target, Windows keeps trying to write to the volume and the errors appear.
Windows is giving me:
Code:
The IO operation at logical block address 0x3771b681 for Disk 4 (PDO name: \Device\000000b7) was retried.

in the event log.

How do I kill this operation? System has already been rebooted several times?

Is there a way to cancel all active writes or sessions from the Windows side?

It appears that a copy and write function was stopped half way on a large bunch of files that were getting copied over from one iSCSI SAN to the FreeNAS.. this appears to be still taking place to this target even though both hosts have been rebooted... i disconnected the session in Windows, but as soon as I reboot and reconnect the iSCSI target the problem starts happening again. Any commands or ideas on how to flush this failed write buffer? Rebooting doesn't seem to help for some reason. This is definitely a new one for me.
 
Last edited by a moderator:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
An update.

So the FreeNAS system continues spit out these errors in the messages log and on screen on the console
Code:
[My Volume name] istgt[2759]: istgt_lu_disk.c:[**different number each time**]:istgt_lu_disk_execute: ***ERROR*** lu_disk_lbwrite() failed

Logs are filling up like crazy...

I have to keep running cat /dev/null > /var/log/messages to clear them (and any new message files it is creating).

- So did some more investigation

Currently I'm running a manual scrub operation to see if there are any disk errors: zpool scrub <name of your pool> from the shell.
---------------------------------------------
After that I started looking at Smart Data from my Drives to see if its one of those cases where a drive is failing but even SMART doesn't know it is bad yet. (i really hate those ones).

So I went into shell and ran a few commands to check:

> zpool status - shows current pool status, also good to show current pool status during scrub operation.
> smartctl -a /dev/ada0 - This will give you a detailed view of the drive you want to look at. - ada = my sata drives 1-4 (actually shows as 0-3)
Ran that for each, ada1, ada2, ada3.

Found that all SMART TEST = Passed

However

ada1 has some Uncorrectable Errors (Smart Code 187) in the chart that the command spits out and lists the last 5 errors, turns out its got 44 errors.

I also check the following Smart error codes:
5 - Reallocated_Sector Count
187 - Rported_Uncorrectable_Errors
188 - Command_Timeout
197 - Current_Pending_Sector_Count
198 - Offline_Uncorrectable

All of which are bad to have any SMART data other than 0 on.

All the rest of my drives are fine.

So.....

I need to manually replace dev/ada1 as soon as the zpool scrub is completed and see if i get any more errors.

I'll keep you all posted!
 
Last edited by a moderator:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
Another clue...
So ZFS file systems and volumes are new to me. I've just read that over filling them past 80% is not a good idea and ideally past 50% isn't great. What genius came up with this idea ? Doesn't sound like great filesystem overhead..

I'm running a ZFS VOLUME that has NTFS Windows GPT volume on it.
It is a 5TB volume - 4 x 2TB drives with RAIDZ or RAID 5 for us normal folks.
Currently filled to 70% (as of recent).

I tried canceling my zpool scrub but got this message:
Code:
cannot cancel scrubbing FREENAS-ARCHIVE-VOLUME: out of space

How can it be out of space ? I have over 1.57TB of Windows reported space available?

Everything checks out when running zpool status and all volumes are health and online..

I can only conclude that there is a disconnect or that that one hard drive is causing some kind of temporary issue..

I'll know more once I try to take that one drive that has SMART errors offline from the pool.

I'll update you soon.
 
Last edited by a moderator:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
One more Quick thing..

Error messages in the GUI showed this as one of the errors...
Code:
The capacity for the volume 'FREENAS-ARCHIVE-VOLUME' is currently at 98%, while the recommended value is below 80%.

Clearly the Windows Drive manager is either not reporting the size properly or there is a serious disconnect here between the Windows manager and ZFS Volume manager...

Is it possible this is a bug in ver 9.2.1.9 x86 version of FreeNAS ?

I am not able to upgrade to x64 I don't think because the hardware I'm running this FreeNAS on isn't 64-bit (at least I don't think it is). It is like 13 year old hardware (2005)..
 
Last edited by a moderator:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
Another Update and tutorial on how to replace a drive that isn't failed yet but clearly has issue with it:

NOTE: This has not solved my lu_disk_write errors yet and is still an ongoing investigation.
-----------------------------------------------------------------

So i successfully completed the scrub the Z volume. No errors.
**Apparently when you have a windows mounted iSCSI volume, the system doesn't let you cancel a scrub session due to having taken up all the space. Seems weird but thats probably the reason i couldn't cancel mine before.**

I then started to proceed to replace the drive which had SMART errors but was labeled as PASSED by SMART.

The following commands were used in order:

zpool status -v

1. to first show status of pool

smartctl -a /dev/ada[replace with your # - usually 0-3 depending on how many drives you have] | grep ^Serial

2. Check each sata drive to get all serial #s to figure out what actual drive is on what sata device - starts with 0, in my case i have 4 physical drives so 0 to 3.
Record the list of ada# to which serial #s

smartctl -a /dev/ada1 (as example)

3. Read through each drives Smart Error logs to see if there are any problems. Check each smart error codes i posted in the previous posts. Even if the drive says PASSED on the Smart Test, if there are any errors or issues, consider replacing that drive**

4. Record drive that potentially has issues or errors
glabel status - shows you the gptid for each of the drives in the Z volume and shows the corresponding /dev/ada# sata device (minus the p2 or whatever at the end)

5. Copy into the notepad the ada# disk's gptid of the z volume in the left hand column that has issues along with the ada# of the drive and serial number of the physical harddisk.

6. Make sure no volume operations or scrubbing is taking place. On slower hardware I recommend taking the unit completely offline and have no users write to it while do the next bit.

zpool offline [name of your volume] [gptid of the drive you want offline]

7. This command takes the drive offline from the zpool - (safer then just yanking a drive)

8. Shutdown server

9. Physically check the Hard drives and replace the physical drive for the Serial # of the drive that relates to the ada# and gptid that was at fault.
Optional: Make sure to label your SATA connector and Harddrives themselves (i label mine 1-4 and then any replaced drive i label 4R as example. Any time i further replace a drive i'll label it 4R2 to show its been replaced twice) make sure not to accidentally reattach the wrong one if you pulled out more than one!

10. Power back on server

zpool replace -f <zpool volume name> <gptid id of previous failed drive> <new device on same sata port>
Example: zpool replace -f POOLNAME gptid/3f28db65-94a6-11e5-a20b-6805ca01c4d8 /dev/ada1

11. Run the command above for the new drive you just replaced copying back in the old gptid from the failed drive and the same ada# port that you plugged in the new drive to. The Resilvering process will start.

zpool status
12. check the zpool status to see current progress of the resilvering.
**Users can now access the system through this process but the performance will be slower.

I am currently in the middle of a resilvering process myself. I will let you know when the process is complete.

Note: The Resilvering process takes a VERY long time in comparison to a Scrubbing operation. I have a 7 TB volume and its taking aprox 34 hours on an old Pentium D SATA 3 machine. Previous to that, the scrubbing took about 26 hours. your milege may vary greatly.

Note: I plan to try upgrading the Volume to version 11 of Freenas if i can't find the error to see if the newer version will work for me better.

I'll keep you posted!
 
Last edited:

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
Console still showing two errors:

Code:
 [the current date & time] VOLUMENAME istgt[2752]: istgt_lu_disk.c:[some process ID]:istgt_lu_diskexecute: ***ERROR*** lu_disk_lbwrite() failed 


and

Code:
 [the current date & time] VOLUMENAME istgt[2752]: istgt_lu_disk.c:[some process ID]:istgt_lulbwrite: ***ERROR*** lu_disk_write() failed 


Any thoughts?
 

kev2018

Cadet
Joined
Feb 24, 2018
Messages
8
Okay an update.

So I performed the upgrade to FreeNAS version 11.1. The upgrade was 100% successful!

However it did not fix my volume problem.

The system kept thinking that the volume was full and thus could not even upgrade the volume to version 11.1 (it wouldn't allow me to add the upgrade flags)

So something really really went wrong with the volume most likely when that partially failed SMART drive may have been corrupting the data.. i don't actually know..

The good news is that the data was still accessible for everything that was not being written at the time of the original crash so i was able to recover and copy back over the important stuff before i upgraded to version 11.1.

I've now deleted the Volume and ZVOL info entirely and creating a brand new one under version 11.1 and reattached it to my iSCSI settings.

Things are now working smoothly. I am also enjoying the new "Sparse" option for my ZVOL's as well because it will let me increase my volume size later on when i replace my drives with bigger ones. For now i have it set to the maximum size available under my ZFS Volume (in my case from a total amount of 7.1 TB i have 5.1 available).

The other thing i noticed is speed!! The new volume is about 150 to 200% faster on average now than before using the same hardware on average! That's a welcome change. Its possible the old volume had some form of compression on, i'm not sure.

So bottomline is if you get the errors i had above. Your volume is toast. Move off your good data if you can, upgrade and create a new ZVOL and VOLUME.

Hope that helps someone. Its been a hell of a process to go through but i finally have answers.
 
Status
Not open for further replies.
Top