[build] SSD raidz2 20 partitions linux

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
Hi!

It's not a typo - I really mean twenty. And of course it is not a NAS.

The idea for such a crazy scenario was that my SSD had h2testw errors. It means the SSD controller (or some other component) couldn't detect the errors and passed corrupted data to the OS as correct.

My employer bought a few new desktop PCs as a replacement for our old laptops. The PCs are equipped with an SSD + an HDD. My PC appeared to have both the disks faulty and I asked our IT department to RMA them. They did but the PC was returned with only the HDD replaced and the SSD is still the same. I have to use it. This is my reason for ZFS.

The h2testw stopped failing but I don't know why - the sectors might have been (I guess) remapped temporarily somewhere or - I had some incident withe the SSD - it got overheated once for a short while. The HDD was not affected by the overheating.

General specifications:
CPU: i7-7700
Motherboard: ASUS H110M
RAM: 16GB Crucial CT16G4DFD824A
GPU: MSI GeForce 710 1GB
SSD: ADATA ULTIMATE SU800 512GB
HDD: Toshiba 2TB DT01ACA200
DVD-RW
OS: Fedora 28
Code:
$ uname -r
4.17.19-200.fc28.x86_64


Before the build I read:
https://github.com/zfsonlinux/zfs/wiki/Fedora
https://github.com/zfsonlinux/zfs/wiki/Ubuntu-18.04-Root-on-ZFS
https://rudd-o.com/linux-and-free-software/installing-fedora-on-top-of-zfs
https://www.csparks.com/BootFedoraZFS/index.md

And then I roughly followed the last one (https://www.csparks.com/BootFedoraZFS/index.md)

EDIT: My partitions
Code:
$ lsblk
NAME		 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0		  7:0	0	40G  0 loop
└─veracrypt1 253:0	0	40G  0 dm   /media/veracrypt1
sda			8:0	0   477G  0 disk
├─sda1		 8:1	0   100M  0 part
├─sda2		 8:2	0	16M  0 part
├─sda3		 8:3	0   512M  0 part /boot/efi
├─sda4		 8:4	0   900M  0 part
├─sda11		8:11   0  23.8G  0 part
├─sda12		8:12   0  23.8G  0 part
├─sda13		8:13   0  23.8G  0 part
├─sda14		8:14   0  23.8G  0 part
├─sda15		8:15   0  23.8G  0 part
├─sda16	  259:0	0  23.8G  0 part
├─sda17	  259:1	0  23.8G  0 part
├─sda18	  259:2	0  23.8G  0 part
├─sda19	  259:3	0  23.8G  0 part
├─sda20	  259:4	0  23.8G  0 part
├─sda21	  259:5	0  23.8G  0 part
├─sda22	  259:6	0  23.8G  0 part
├─sda23	  259:7	0  23.8G  0 part
├─sda24	  259:8	0  23.8G  0 part
├─sda25	  259:9	0  23.8G  0 part
├─sda26	  259:10   0  23.8G  0 part
├─sda27	  259:11   0  23.8G  0 part
├─sda28	  259:12   0  23.8G  0 part
├─sda29	  259:13   0  23.8G  0 part
└─sda30	  259:14   0  23.8G  0 part
sdb			8:16   0   1.8T  0 disk
├─sdb1		 8:17   0   300M  0 part
├─sdb2		 8:18   0	65G  0 part [SWAP]
├─sdb3		 8:19   0	65G  0 part
├─sdb4		 8:20   0	50G  0 part
├─sdb5		 8:21   0	 1G  0 part
├─sdb6		 8:22   0   840G  0 part /run/media/jadis/linuxhome
└─sdb7		 8:23   0 841.7G  0 part
sr0		   11:0	1  1024M  0 rom

My pool
Code:
$ zpool list
NAME	   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
narniarp   472G  17.1G   455G		 -	 0%	 3%  1.00x  ONLINE  -

Code:
$ zpool status
  pool: narniarp
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Tue Sep 11 16:16:35 2018
config:

	NAME													 STATE	 READ WRITE CKSUM
	narniarp												 ONLINE	   0	 0	 0
	  raidz2-0											   ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part11  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part12  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part13  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part14  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part15  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part16  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part17  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part18  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part19  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part20  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part21  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part22  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part23  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part24  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part25  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part26  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part27  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part28  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part29  ONLINE	   0	 0	 0
		/dev/disk/by-id/ata-ADATA_SU800_2I1020023348-part30  ONLINE	   0	 0	 0

errors: No known data errors

Code:
$ zfs list
NAME				   USED  AVAIL  REFER  MOUNTPOINT
narniarp			  15.2G   391G   256K  none
narniarp/ROOT		 11.6G   391G   256K  none
narniarp/ROOT/fedora  11.6G   391G  9.85G  /
narniarp/home		 1.48G   391G  1.30G  /home
narniarp/home/root	 714K   391G   448K  /root
narniarp/srv		   256K   391G   256K  /srv
narniarp/var		  2.06G   391G   256K  none
narniarp/var/cache	1.47G   391G   838M  /var/cache
narniarp/var/log	   127M   391G  58.9M  legacy
narniarp/var/spool	 243M   391G   215M  /var/spool
narniarp/var/tmp	   234M   391G   170M  legacy

My datasets
Code:
$ zfs get -s local all narniarp
NAME	  PROPERTY			  VALUE				  SOURCE
narniarp  mountpoint			none				   local
narniarp  atime				 off					local
narniarp  aclinherit			passthrough			local
narniarp  canmount			  off					local

$ zfs get -s local all narniarp/ROOT
NAME		   PROPERTY			  VALUE				  SOURCE
narniarp/ROOT  atime				 off					local

$ zfs get -s local all narniarp/ROOT/fedora
NAME				  PROPERTY			  VALUE				  SOURCE
narniarp/ROOT/fedora  mountpoint			/					  local
narniarp/ROOT/fedora  atime				 off					local

$ zfs get -s local all narniarp/home
NAME		   PROPERTY			  VALUE				  SOURCE
narniarp/home  mountpoint			/home				  local
narniarp/home  compression		   lz4					local
narniarp/home  atime				 off					local
narniarp/home  setuid				off					local

$ zfs get -s local all narniarp/home/root
NAME				PROPERTY			  VALUE				  SOURCE
narniarp/home/root  mountpoint			/root				  local
narniarp/home/root  atime				 off					local

$ zfs get -s local all narniarp/srv
NAME		  PROPERTY			  VALUE				  SOURCE
narniarp/srv  mountpoint			/srv				   local
narniarp/srv  compression		   lz4					local
narniarp/srv  atime				 off					local

$ zfs get -s local all narniarp/var
NAME		  PROPERTY			  VALUE				  SOURCE
narniarp/var  compression		   lz4					local
narniarp/var  atime				 off					local
narniarp/var  exec				  off					local
narniarp/var  setuid				off					local
narniarp/var  canmount			  off					local

$ zfs get -s local all narniarp/var/cache
NAME				PROPERTY			   VALUE				  SOURCE
narniarp/var/cache  mountpoint			 /var/cache			 local
narniarp/var/cache  compression			lz4					local
narniarp/var/cache  atime				  off					local
narniarp/var/cache  com.sun:auto-snapshot  false				  local

$ zfs get -s local all narniarp/var/log
NAME			  PROPERTY			  VALUE				  SOURCE
narniarp/var/log  mountpoint			legacy				 local
narniarp/var/log  compression		   lz4					local
narniarp/var/log  atime				 off					local
narniarp/var/log  xattr				 sa					 local
narniarp/var/log  acltype			   posixacl			   local

$ zfs get -s local all narniarp/var/spool
NAME				PROPERTY			  VALUE				  SOURCE
narniarp/var/spool  mountpoint			/var/spool			 local
narniarp/var/spool  compression		   lz4					local
narniarp/var/spool  atime				 off					local

$ zfs get -s local all narniarp/var/tmp
NAME			  PROPERTY			   VALUE				  SOURCE
narniarp/var/tmp  mountpoint			 legacy				 local
narniarp/var/tmp  compression			lz4					local
narniarp/var/tmp  atime				  off					local
narniarp/var/tmp  exec				   on					 local
narniarp/var/tmp  com.sun:auto-snapshot  false				  local

Code:
$ zfs list -t snapshot
NAME														USED  AVAIL  REFER  MOUNTPOINT
narniarp@2018-08-24-15-00-before-update					   0B	  -   256K  -
narniarp@2018-09-12-after-kern-upd-to-4.17.19				 0B	  -   256K  -
narniarp/ROOT@2018-08-24-15-00-before-update				  0B	  -   256K  -
narniarp/ROOT@2018-09-12-after-kern-upd-to-4.17.19			0B	  -   256K  -
narniarp/ROOT/fedora@firstboot-but-emergency				464M	  -  9.16G  -
narniarp/ROOT/fedora@2018-08-24-right-after-datasets	   11.7M	  -  8.74G  -
narniarp/ROOT/fedora@2018-08-24-15-00-before-update		49.8M	  -  8.76G  -
narniarp/ROOT/fedora@2018-09-12-after-kern-upd-to-4.17.19   141M	  -  8.87G  -
narniarp/home@2018-08-24-right-after-datasets			   192K	  -   277K  -
narniarp/home@2018-08-24-15-00-before-update			   2.62M	  -  4.45M  -
narniarp/home@2018-09-12-after-kern-upd-to-4.17.19		  189M	  -   428M  -
narniarp/home/root@2018-08-24-right-after-datasets		  267K	  -   437K  -
narniarp/home/root@2018-08-24-15-00-before-update			 0B	  -   448K  -
narniarp/home/root@2018-09-12-after-kern-upd-to-4.17.19	   0B	  -   448K  -
narniarp/srv@2018-08-24-right-after-datasets				  0B	  -   256K  -
narniarp/srv@2018-08-24-15-00-before-update				   0B	  -   256K  -
narniarp/srv@2018-09-12-after-kern-upd-to-4.17.19			 0B	  -   256K  -
narniarp/var@2018-08-24-right-after-datasets				  0B	  -   256K  -
narniarp/var@2018-08-24-15-00-before-update				   0B	  -   256K  -
narniarp/var@2018-09-12-after-kern-upd-to-4.17.19			 0B	  -   256K  -
narniarp/var/cache@2018-08-24-right-after-datasets		  156M	  -   337M  -
narniarp/var/cache@2018-08-24-15-00-before-update		   117M	  -   436M  -
narniarp/var/cache@2018-09-12-after-kern-upd-to-4.17.19	 170M	  -   662M  -
narniarp/var/log@2018-08-24-right-after-datasets		   17.5M	  -  21.1M  -
narniarp/var/log@2018-08-24-15-00-before-update			20.0M	  -  23.8M  -
narniarp/var/log@2018-09-12-after-kern-upd-to-4.17.19	  30.0M	  -  36.6M  -
narniarp/var/spool@2018-08-24-right-after-datasets		  299K	  -  1.29M  -
narniarp/var/spool@2018-08-24-15-00-before-update		   427K	  -  28.6M  -
narniarp/var/spool@2018-09-12-after-kern-upd-to-4.17.19	 469K	  -  31.6M  -
narniarp/var/tmp@2018-08-24-right-after-datasets			341K	  -   427K  -
narniarp/var/tmp@2018-08-24-15-00-before-update			 363K	  -   448K  -
narniarp/var/tmp@2018-09-12-after-kern-upd-to-4.17.19	  62.8M	  -   167M  -
The h2testw past result
Code:
Warning: Only 457906 of 487368 MByte tested.
The media is likely to be defective.
447.1 GByte OK (937791277 sectors)
105.5 KByte DATA LOST (211 sectors)
Details:0 KByte overwritten (0 sectors)
0 KByte slightly changed (< 8 bit/sector, 0 sectors)
105.5 KByte corrupted (211 sectors)
0 KByte aliased memory (0 sectors)
First error at offset: 0x000000005f386000
Expected: 0x000000005f386000
Found: 0x0000000200000060
H2testw version 1.3
Writing speed: 112 MByte/s
Reading speed: 167 MByte/s
H2testw v1.4

Edit: Present SMART result for the SSD
Code:
$ sudo smartctl -a /dev/disk/by-id/ata-ADATA_SU800_2I1020023348
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.17.19-200.fc28.x86_64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:	 ADATA SU800
Serial Number:	2I1020023348
LU WWN Device Id: 5 707c18 1006252ab
Firmware Version: Q0913A
User Capacity:	512,110,190,592 bytes [512 GB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	Solid State Device
Form Factor:	  2.5 inches
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue Sep 18 16:10:16 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection:		 (	0) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 (  30) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:			(0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x0000   100   100   000	Old_age   Offline	  -	   0
  5 Reallocated_Sector_Ct   0x0000   100   100   000	Old_age   Offline	  -	   1
  9 Power_On_Hours		  0x0000   100   100   000	Old_age   Offline	  -	   1330
 12 Power_Cycle_Count	   0x0000   100   100   000	Old_age   Offline	  -	   87
148 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   124
149 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   12
150 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   2
151 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   8
159 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   0
160 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   16
161 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   40
163 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   19
164 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   10961
165 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   44
166 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   5
167 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   23
169 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   100
177 Wear_Leveling_Count	 0x0000   100   100   050	Old_age   Offline	  -	   0
181 Program_Fail_Cnt_Total  0x0000   100   100   000	Old_age   Offline	  -	   0
182 Erase_Fail_Count_Total  0x0000   100   100   000	Old_age   Offline	  -	   0
192 Power-Off_Retract_Count 0x0000   100   100   000	Old_age   Offline	  -	   17
194 Temperature_Celsius	 0x0000   100   100   000	Old_age   Offline	  -	   32
196 Reallocated_Event_Count 0x0000   100   100   016	Old_age   Offline	  -	   191
199 UDMA_CRC_Error_Count	0x0000   100   100   050	Old_age   Offline	  -	   1
232 Available_Reservd_Space 0x0000   100   100   000	Old_age   Offline	  -	   100
241 Total_LBAs_Written	  0x0000   100   100   000	Old_age   Offline	  -	   139900
242 Total_LBAs_Read		 0x0000   100   100   000	Old_age   Offline	  -	   218566
245 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   394596
250 Read_Error_Retry_Rate   0x0000   100   100   000	Old_age   Offline	  -	   0
251 Unknown_Attribute	   0x0000   100   100   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
Invalid Error Log index = 0x0b (T13/1321D rev 1c Section 8.41.6.8.2.2 gives valid range from 1 to 5)

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  1311		 -
# 2  Short offline	   Completed without error	   00%	  1283		 -
# 3  Short offline	   Completed without error	   00%	  1246		 -
# 4  Conveyance offline  Completed without error	   00%	   582		 -
# 5  Extended offline	Completed without error	   00%	   558		 -
# 6  Short offline	   Completed without error	   00%	   558		 -
# 7  Short offline	   Completed without error	   00%	   208		 -
# 8  Conveyance offline  Completed without error	   00%	   208		 -
# 9  Extended offline	Completed without error	   00%		69		 -
#10  Conveyance offline  Completed without error	   00%		63		 -
#11  Short offline	   Completed without error	   00%		63		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Steps taken so far (roughly):
  1. RAM, HDD, SSD and CPU burn-in
  2. Install Fedora to HDD
  3. Create pool on the SSD
  4. Move the OS to the pool and make the rig boot from the SSD
  5. Add more datasets and make use of them
  6. Update the OS to a newer kernel (the one above ie. 4.17.19-200)
  7. Add scrubs and smartctl tasks to crontab
Steps planned:
  • Remove smartctl from crontab and use smartd instead
  • Remove snapshots from some datasets (like tmp etc.)
  • Setup automatic snapshots
  • Run a Windows 10 VM when I need some Windoze-specific software (sometimes)
  • More read/write benchmarks in future (to monitor performance)

Steps under consideration:
  • Some kind of backup to the HDD.
  • Enabling the unsupported ZFS_DEBUG_MODIFY flag
Benchmarks for now
Code:
$ sudo mkdir /mnt/fs
[sudo] password for jadis:
$ sudo fio --loops=5 --size=64000m --filename=/mnt/fs/fiotest.tmp --stonewall --ioengine=libaio   --name=Seqread --bs=1m --rw=read   --name=Seqwrite --bs=1m --rw=write   --name=512Kread --bs=512k --rw=randread   --name=512Kwrite --bs=512k --rw=randwrite   --name=4kQD32read --bs=4k --iodepth=32 --rw=randread   --name=4kQD32write --bs=4k --iodepth=32 --rw=randwrite
Seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
Seqwrite: (g=1): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
512Kread: (g=2): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
512Kwrite: (g=3): rw=randwrite, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
4kQD32read: (g=4): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
4kQD32write: (g=5): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.3
Starting 6 processes
Seqread: Laying out IO file (1 file / 64000MiB)
^Cbs: 1 (f=1): [_(4),r(1),P(1)][15.3%][r=6924KiB/s,w=0KiB/s][r=1731,w=0 IOPS][eta 11h:59m:01s]
fio: terminating on signal 2

Seqread: (groupid=0, jobs=1): err= 0: pid=30851: Tue Sep 18 15:11:43 2018
   read: IOPS=435, BW=435MiB/s (456MB/s)(313GiB/735051msec)
	slat (usec): min=226, max=108121, avg=2290.98, stdev=4019.15
	clat (nsec): min=525, max=412918, avg=2616.08, stdev=2864.27
	 lat (usec): min=227, max=108131, avg=2294.57, stdev=4020.41
	clat percentiles (nsec):
	 |  1.00th=[  1032],  5.00th=[  1272], 10.00th=[  1400], 20.00th=[  1592],
	 | 30.00th=[  1768], 40.00th=[  1944], 50.00th=[  2160], 60.00th=[  2416],
	 | 70.00th=[  2768], 80.00th=[  3376], 90.00th=[  4192], 95.00th=[  4960],
	 | 99.00th=[  7840], 99.50th=[ 10816], 99.90th=[ 25216], 99.95th=[ 38656],
	 | 99.99th=[101888]
   bw (  KiB/s): min=210410, max=468292, per=80.38%, avg=358329.14, stdev=68489.60, samples=1469
   iops		: min=  205, max=  457, avg=349.45, stdev=66.91, samples=1469
  lat (nsec)   : 750=0.10%, 1000=0.74%
  lat (usec)   : 2=42.08%, 4=45.17%, 10=11.33%, 20=0.42%, 50=0.12%
  lat (usec)   : 100=0.02%, 250=0.01%, 500=0.01%
  cpu		  : usr=0.25%, sys=29.08%, ctx=129113, majf=11, minf=264
  IO depths	: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
	 submit	: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 issued rwts: total=320000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
	 latency   : target=0, window=0, percentile=100.00%, depth=1
Seqwrite: (groupid=1, jobs=1): err= 0: pid=3703: Tue Sep 18 15:11:43 2018
  write: IOPS=227, BW=228MiB/s (239MB/s)(313GiB/1404580msec)
	slat (usec): min=123, max=214947, avg=4372.21, stdev=1640.54
	clat (nsec): min=457, max=18901k, avg=7149.55, stdev=46963.23
	 lat (usec): min=124, max=214951, avg=4381.84, stdev=1643.04
	clat percentiles (nsec):
	 |  1.00th=[   1544],  5.00th=[   2992], 10.00th=[   3824],
	 | 20.00th=[   4640], 30.00th=[   5216], 40.00th=[   5472],
	 | 50.00th=[   5728], 60.00th=[   6112], 70.00th=[   6496],
	 | 80.00th=[   7008], 90.00th=[   7648], 95.00th=[   8384],
	 | 99.00th=[  21120], 99.50th=[  42240], 99.90th=[ 321536],
	 | 99.95th=[ 518144], 99.99th=[1564672]
   bw (  KiB/s): min=18256, max=1088412, per=86.93%, avg=202796.49, stdev=49712.27, samples=2809
   iops		: min=   17, max= 1062, avg=197.55, stdev=48.54, samples=2809
  lat (nsec)   : 500=0.01%, 750=0.05%, 1000=0.13%
  lat (usec)   : 2=1.81%, 4=10.07%, 10=85.84%, 20=1.03%, 50=0.64%
  lat (usec)   : 100=0.18%, 250=0.12%, 500=0.09%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu		  : usr=1.73%, sys=16.40%, ctx=2874712, majf=0, minf=10
  IO depths	: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
	 submit	: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 issued rwts: total=0,320000,0,0 short=0,0,0,0 dropped=0,0,0,0
	 latency   : target=0, window=0, percentile=100.00%, depth=1
512Kread: (groupid=2, jobs=1): err= 0: pid=5509: Tue Sep 18 15:11:43 2018
   read: IOPS=621, BW=311MiB/s (326MB/s)(313GiB/1029018msec)
	slat (usec): min=81, max=105048, avg=1605.45, stdev=472.77
	clat (nsec): min=435, max=23578, avg=903.66, stdev=363.32
	 lat (usec): min=82, max=105055, avg=1606.68, stdev=472.90
	clat percentiles (nsec):
	 |  1.00th=[  660],  5.00th=[  724], 10.00th=[  756], 20.00th=[  788],
	 | 30.00th=[  812], 40.00th=[  836], 50.00th=[  860], 60.00th=[  884],
	 | 70.00th=[  916], 80.00th=[  964], 90.00th=[ 1064], 95.00th=[ 1176],
	 | 99.00th=[ 1528], 99.50th=[ 2192], 99.90th=[ 5088], 99.95th=[ 7520],
	 | 99.99th=[13632]
   bw (  KiB/s): min=12224, max=258527, per=73.92%, avg=235394.44, stdev=14038.36, samples=2057
   iops		: min=   23, max=  504, avg=459.25, stdev=27.42, samples=2057
  lat (nsec)   : 500=0.06%, 750=9.00%, 1000=75.75%
  lat (usec)   : 2=14.61%, 4=0.36%, 10=0.18%, 20=0.04%, 50=0.01%
  cpu		  : usr=0.16%, sys=12.13%, ctx=700033, majf=1, minf=133
  IO depths	: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
	 submit	: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 issued rwts: total=640000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
	 latency   : target=0, window=0, percentile=100.00%, depth=1
512Kwrite: (groupid=3, jobs=1): err= 0: pid=7605: Tue Sep 18 15:11:43 2018
  write: IOPS=533, BW=267MiB/s (280MB/s)(313GiB/1198865msec)
	slat (usec): min=63, max=1492.5k, avg=1859.68, stdev=2942.37
	clat (nsec): min=518, max=5245.2k, avg=5024.18, stdev=27601.79
	 lat (usec): min=64, max=1492.5k, avg=1866.41, stdev=2943.80
	clat percentiles (nsec):
	 |  1.00th=[	964],  5.00th=[   1656], 10.00th=[   2288],
	 | 20.00th=[   2832], 30.00th=[   3216], 40.00th=[   3536],
	 | 50.00th=[   3888], 60.00th=[   4256], 70.00th=[   4704],
	 | 80.00th=[   5408], 90.00th=[   5984], 95.00th=[   6752],
	 | 99.00th=[  15168], 99.50th=[  35072], 99.90th=[ 236544],
	 | 99.95th=[ 366592], 99.99th=[1056768]
   bw (  KiB/s): min= 2522, max=876180, per=85.46%, avg=233570.61, stdev=119610.40, samples=2395
   iops		: min=	4, max= 1711, avg=455.70, stdev=233.62, samples=2395
  lat (nsec)   : 750=0.10%, 1000=1.11%
  lat (usec)   : 2=5.93%, 4=46.62%, 10=44.91%, 20=0.51%, 50=0.44%
  lat (usec)   : 100=0.15%, 250=0.12%, 500=0.06%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu		  : usr=1.78%, sys=16.32%, ctx=2971021, majf=0, minf=7
  IO depths	: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
	 submit	: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 issued rwts: total=0,640000,0,0 short=0,0,0,0 dropped=0,0,0,0
	 latency   : target=0, window=0, percentile=100.00%, depth=1
4kQD32read: (groupid=4, jobs=1): err= 0: pid=1783: Tue Sep 18 15:11:43 2018
   read: IOPS=1759, BW=7038KiB/s (7207kB/s)(22.9GiB/3415393msec)
	slat (usec): min=2, max=175930, avg=566.42, stdev=339.97
	clat (usec): min=2, max=1576.9k, avg=17615.90, stdev=5211.94
	 lat (usec): min=788, max=1611.6k, avg=18182.56, stdev=5365.74
	clat percentiles (msec):
	 |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
	 | 30.00th=[   17], 40.00th=[   18], 50.00th=[   18], 60.00th=[   18],
	 | 70.00th=[   19], 80.00th=[   19], 90.00th=[   20], 95.00th=[   20],
	 | 99.00th=[   21], 99.50th=[   27], 99.90th=[   39], 99.95th=[   57],
	 | 99.99th=[  132]
   bw (  KiB/s): min=   45, max= 7446, per=84.36%, avg=5936.91, stdev=683.28, samples=6829
   iops		: min=   11, max= 1861, avg=1483.85, stdev=170.83, samples=6829
  lat (usec)   : 4=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=97.58%, 50=2.35%
  lat (msec)   : 100=0.05%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu		  : usr=0.34%, sys=9.27%, ctx=5355010, majf=0, minf=39
  IO depths	: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
	 submit	: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
	 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
	 issued rwts: total=6009758,0,0,0 short=0,0,0,0 dropped=0,0,0,0
	 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=313GiB (336GB), run=735051-735051msec

Run status group 1 (all jobs):
  WRITE: bw=228MiB/s (239MB/s), 228MiB/s-228MiB/s (239MB/s-239MB/s), io=313GiB (336GB), run=1404580-1404580msec

Run status group 2 (all jobs):
   READ: bw=311MiB/s (326MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=313GiB (336GB), run=1029018-1029018msec

Run status group 3 (all jobs):
  WRITE: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=313GiB (336GB), run=1198865-1198865msec

Run status group 4 (all jobs):
   READ: bw=7038KiB/s (7207kB/s), 7038KiB/s-7038KiB/s (7207kB/s-7207kB/s), io=22.9GiB (24.6GB), run=3415393-3415393msec

Run status group 5 (all jobs):


Questions so far:
  • I've stopped the benchmark using ^C - was it too early? Does the benchmark look moot because of this?
  • Would you recommend (and why?) or welcome results from some other benchmark methods?
  • How often would you recommend pool read/write benchmarks?
  • Will you please give advise on ZFS_DEBUG_MODIFY flag?
  • Should I expect a sudden performance drop because of lots of partitions before I hit 80% limit?
  • Or should I rather expect a step-by-step performance degradation because of lots of partitions?
Finally I guess you know better than me what can be posted here :)
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
Sorry, I did not completely read your epic post...

In general, if a SSD / USB flash drive returns the wrong result too many times, it's probably bad.
Get it replaced. Even if it's new, or a new purchase.
In rare cases, a firmware update will solve the problem, (but so may returning it for replacement).

Further, some people are selling fake flash with falsified sizes. Meaning they say it's 32GB but it's really only 8GB. It just keeps the most recent writes available, the oldest are lost / over-written. Some do compression to fake even larger sizes. I routinely test all my flash drives with a tool for Linux, f3write & f3read. Had one fail, got a refund. Now I only buy from reputable brands and stores.

I wish flash drives, (whether it uses SATA, SAS or USB interface), were clear on the error detection and correction. Hard drives use a scheme of;
  • Sector header
  • Sector data
  • Sector CRC
  • Error correction code
  • Sector trailer
  • Sector gap
And in some case I read many years ago, they talked about how many errors the ECC code was good for, like 9 bits in a row. I worry that flash drives are a lot stupider, meaning they don't include dedicated space, (or not as much), for the ECC part of the sectors. If I were designing a flash drive, I'd have 2 sections, the sector data, and the sector CRC & ECC.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
Should anyone be interested (or rather find it by accident or by google-foo means): the below didn't work for me (I needed it after one of the kernel updates)
$sudo grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg resulting in a weird
Code:
/usr/sbin/grub2-probe: error: failed to get canonical path of `/dev/ata-ADATA_SU800_2I1020023348-part11'.

Instead I used that as root: #grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
need an old priest and a young priest
Are you sure? Maybe I'll check if my SSD started spinning, first? But how? ;)
if a SSD / USB flash drive returns the wrong result too many times
Thanks, I'll monitor the scrub results and/or learn if I can change the firmware.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Are you sure? Maybe I'll check if my SSD started spinning, first? But how? ;)
Well, zvol defaults for SSDs show an RPM of 1, so ...

Seriously, this definitely merits being in "Off Topic" because this is way, way outside the expected use case. Using a bajillion partitions and RAIDZ2 to work around bad blocks on an HDD might work in theory, but on an SSD it falls apart since they do LBA remapping on the fly and you won't actually be writing to the same physical NAND pages you initially partitioned.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
ut on an SSD it falls apart since they do LBA remapping on the fly
HDDs do the same for bad sectors, so I really doubt it would help there. As you say, on an SSD there's zero doubt that it will not work.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
on an HDD might work in theory, but on an SSD it falls apart since they do LBA remapping on the fly and you won't actually be writing to the same physical NAND pages you initially partitioned.
on an SSD there's zero doubt that it will not work.
I'm not quite following you.

I'm shy to ask for more elaboration yet but maybe you could validate my reasoning (below) please?

I can imagine 2 scenarios and they both seem to work for me.

For the sake of the two scenarios let's assume the SSD has 2 evil sectors (a & b). Evil sectors = sectors reported as healthy but currupting data. Sectors ok: c~ z.

Scenario no.1:
  1. Initially all data is spread across sectors c~y.
  2. I write sth and it gets written and mapped to an evil sector a.
  3. I run scrub and it detects checksum error and since it's raidz it uses parity to calculate and write proper data.
  4. Data gets written and remapped to a healthy sector z.
  5. Sector a is not in use now.
  6. I've just recovered from corrupted data.
Scenario no.2:
  1. Initially all data is spread across sectors c~y (same as above).
  2. I write sth and it gets written and mapped to an evil sector a.
  3. I run scrub and it detects checksum error and since it's raidz it uses parity to calculate and write proper data.
  4. Data gets written to an evil sector a or b (no remapping or remapping to other evil sector).
  5. I run scrub and it detects checksum error and since it's raidz it uses parity to calculate and write proper data.
  6. Data gets written to an evil sector a or b (no remapping or remapping to other evil sector) again.
  7. The same happens a few more times but ZFS is still able to detect checksum error and calculate proper data.
  8. Finally data gets written and remapped to a healthy sector z.
  9. Sector a is not in use again.
  10. I've just recovered from corrupted data.
What am I missing?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
let's assume the SSD has 2 evil sectors (a & b). Evil sectors = sectors reported as healthy but currupting data.

The problem with that assumption is that the logical LBAs of "evil sectors (a & b)" will be mapped to different physical LBAs over time as the SSD's controller does wear leveling, so the LBA that lives on Evil Sector will be different and inconsistent. Your attempt to resilver from "evil sector A" could land in "evil sector B" or you might hit three failed reads because there's no consistency.

That's also a bit of a big assumption, since that sort of failure is what SMART is intended to detect.

And finally, when you're getting to the point of hardware that's assumed to be maliciously evil and ruining your data, you don't band-aid around it with a bunch of RAIDZ2 partitions - you take it out behind the woodshed and give it the Old Yeller treatment.
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
FWIW: looks like I've just successfully upgraded to Fedora 29.

the Old Yeller treatment.
:D

At first I thought you meant Old Yeller eating the meat behind the shed :D and imagined a disk ground in dog's muzzle but then I guessed you meant the case when the dog had rabies...
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
FWIW: looks like I've just successfully upgraded to Fedora 29.
This is the way the command line upgrade process looked like:
0. Kernel was 4.18.x, zfs 0.7.13
1. dnf system-upgrade dowload
2. dnf system-upgrade reboot
3. Booted into 4.18 kernel
4. During the boot packages got upgraded. Eg zfs -> 0.8.1
5. System booted into kernel 4.18 and zfs libs seemed to be already 0.8.1. Kernel 5.2.7 was prepared somehow.
6. I had to run dracut because the new kernel didn't have zfs module in initramfs
7. I had to run grub2-mkconfig because the new kernel didn't have its entry in grub.cfg
8. Reboot into new kernel

That's all folks

Sent from my phone
 
Top