Critical error in "freens-boot" immediately after installation

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
Short version: Immediately after installing 11.1-U7, boot then run "zpool scrub freenas-boot", which finds errors. Repeatable after multiple retries. Boot SSD verified good.

Not clear where to look next. Help/pointers appreciated.

Longer description:

After installing 11.1-U7, upgrading to 11.3-U4, creating mirrored pool for backups from clients, getting SMB config and ACLs right so that client backups ran smoothly, I noticed a critical alert on the web console. The alert reported a degraded boot pool. Web searches generally said some variant on "bad boot hardware, replace the USB and reinstall".

The server was installed with a brand new WD internal SSD as the boot drive. No USB. "dmesg" showed no errors related to any disk or SSD.

First step: backed up the config, then reinstalled 11.1-U7, boot and scrub. Multiple errors reported. Repeated several times.

Second step: boot and run WD diagnostics from a bootCD. Zero errors on quick SMART test, extended SMART test, and writing 0's to entire SSD. Drive appears to be perfect.

Third step: download 11.1-U7 ISO again. Binary compare with original. No corruption in ISO image. Burn new installation DVD just in case.

Fourth step: install 11.1-U7 from DVD again. Final message was (again) "Installation completed. No errors reported."

Fifth step: Booted FreeNAS. No messages during the boot indicating disk errors. Immediately started shell on console and ran "zpool scrub freenas-boot" then "zpool status". Again, the same errors!

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
...
scan: scrub repaired 3.50K in 0 days 00:00:04 with 14 errors on <date>
...
errors: 9 data errors, use "-v" for a list

and "zpool status -v" showed:

<metadata>:<0x24>
<metadata>:<0x3f>
freenas-boot/ROOT/default:<0x0>
//
freenas-boot/ROOT/default@3030-09-06-00:56:07:<0x0>

plus four files in the python3.6 "__pycache__" directory.

The errors are not exactly the same each time I redo the install, and the error count fluctuates slightly (11, 14, 12, ...). But so far the files affected are always related to python.

I'm stumped. Help? Suggestions for next steps to debug (or for a fix)?

Thanks.
 

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
Another clue (?): From the web admin after the most recent clean install, "System->Update->Verify Install" raises this error:


Request Method:POST
Request URL:
Software Version:FreeNAS-11.1-U7 (b45bfcf29)
Exception Type:MiddlewareError
Exception Value:[MiddlewareError: [Errno 5] Input/output error]
Exception Location:./freenasUI/system/views.py in update_verify, line 1781
Server time:Sat, 5 Sep 2020 19:14:00 -0700

Traceback


Environment: Software Version: FreeNAS-11.1-U7 (b45bfcf29) Request Method: POST Request URL: http://jamesgang-nas/system/update/verify/ Traceback: File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py" in inner 42. response = get_response(request) File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py" in _legacy_get_response 249. response = self._get_response(request) File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py" in _get_response 178. response = middleware_method(request, callback, callback_args, callback_kwargs) File "./freenasUI/freeadmin/middleware.py" in process_view 162. return login_required(view_func)(request, *view_args, **view_kwargs) File "/usr/local/lib/python3.6/site-packages/django/contrib/auth/decorators.py" in _wrapped_view 23. return view_func(request, *args, **kwargs) File "./freenasUI/system/views.py" in update_verify 1781. raise MiddlewareError(handler.error) Exception Type: MiddlewareError at /system/update/verify/ Exception Value: [MiddlewareError: [Errno 5] Input/output error]


Request information
GET
No GET data
POST

VariableValue
__form_id'form_str'
FILES
No FILES data
COOKIES

VariableValue
csrftoken'********'
fntreeSaveStateCookie'root%2Croot%2F8%2Croot%2F59%2Croot%2F59%2F60'
sessionid'nk9m1u6g197qxz4lisr0bbzz67t6xlp8'
META

VariableValue
 

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
And on the off chance that doing the upgrade to 11.3 STABLE might replace and thus repair the damage, I ran that update (as I had done originally).

No such luck.

The update did run to completion and the system rebooted. The first reboot appears to have run several python scripts that generated a lot of console output, then the system rebooted again and came up saying that it is now running 11.3.

Logging in on the web admin GUI shows the same alert:

CRITICAL
Boot pool status is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected..
Sat, 5 Sep 2020 07:30:50 PM (America/Los_Angeles)

and "zpool status -v" shows the same errors as before the upgrade. All related to python.

Uploaded the config that I had saved at the beginning of this whole debug process. Ran a test backup from one client, which succeeded.

I appear to be in the same state: the backup server is working fine for backups to the backup pool, but it's still quite unhappy with the boot pool.

Again, any pointers to what the heck is really wrong (and/or the best way to fix it) will be very much appreciated. Thanks!
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Is there any particular reason you are using version 11.1-U7?

I've had good luck with 11.2-U8 -- I run it on servers both at work and here at home.
 

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
11.1-U7 is the last version that used the Grub boot manager, which supports booting from CD. This particular systems does not support booting from USB. It's old hardware but has been in use for years as a backup server, running a Fedora release. But the old boot hard disk failed. Because I needed to reinstall everything anyway, I installed an SSD boot disk and am trying to switch to FreeNAS.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
11.1-U7 is the last version that used the Grub boot manager, which supports booting from CD. This particular systems does not support booting from USB. It's old hardware but has been in use for years as a backup server, running a Fedora release. But the old boot hard disk failed. Because I needed to reinstall everything anyway, I installed an SSD boot disk and am trying to switch to FreeNAS.
Ah! Makes sense.
 

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
The server was installed with a brand new WD internal SSD as the boot drive.
What model? It pays to be specific about hardware around here.

Your problem sounds like the one described in the following links, and at least one of the reports concerns a "WD Green" SSD, so it may be worth trying a different SSD.

https://jira.ixsystems.com/browse/NAS-100276

https://forums.freenas.org/index.php?threads/freenas-installer-sandisk-ssd-checksum-errors.64049/

https://forums.freenas.org/index.php?threads/transcend-ssd-boot-disk-zfs-checksum-errors.64321/

https://forums.freenas.org/index.php?threads/boot-disk.63829/
 

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
The server was installed with a brand new WD internal SSD as the boot drive.


What model? It pays to be specific about hardware around here.
Sorry: it's a WD Green - WDS240G2G0A. Ordered/received/installed in the last week of August.

camcontrol: sending ATA READ_NATIVE_MAX_ADDRESS48 with timeout of 5000 msecs
camcontrol: sending ATA GET_NATIVE_MAX_ADDRESS_EXT with timeout of 30000 msecs
pass0: <WDC WDS240G2G0A-00JH30 UF510000> ACS-2 ATA SATA 3.x device
pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)

protocol ACS-2 ATA SATA 3.x
device model WDC WDS240G2G0A-00JH30
firmware revision UF510000

Thank you for those references. They do describe very similar issues.

Ordering another new SSD now.
 

RefereeBeau

Cadet
Joined
Aug 26, 2020
Messages
9
Replaced WD-Green with Kingston A400:
root@jamesgang-nas:~ # camcontrol identify /dev/ada0 | head
pass0: <KINGSTON SA400S37120G 03150002> ACS-3 ATA SATA 3.x device
pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)

protocol ACS-3 ATA SATA 3.x
device model KINGSTON SA400S37120G
firmware revision 03150002
serial number 50026B76837CA8D8
WWN 502b2a201d1c1b1a
Installed 11.1-U4, scrubbed boot pool, upgraded to 11.3-U7, scrubbed boot pool, restored saved config, scrubbed boot pool and backup pool. Zero errors so far. Keeping my fingers crossed ...

It does looks like, with the WD-Green SSD boot drive, I ran into bug #35065.

The final comment in:
https://jira.ixsystems.com/browse/NAS-100276
says:
Alexander Motin added a comment - 31/Jul/20 10:22 PM

It seems the affected SSDs left the wide market before I got to this,
Apparently not quite. Mine just came from Amazon.

Thanks again for the quick and very helpful pointers.
 
Top