IPMI / FreeNAS GUI / FreeNAS SSH suddenly unreachable, but existing SSH login keeps on working?

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Hi everyone,

I have a weird problem... I was running the solnet-array-test script and during the parallel array seek test, after a couple days of running, suddenly
  • My GUI became unreachable on both interfaces
1582372314491.png

  • My IPMI became unreachable (ping timeout)
  • My SSH login also stopped working on both interfaces
Code:
C:\Users\m4st4>ssh -vvv root@192.168.0.10
OpenSSH_for_Windows_7.7p1, LibreSSL 2.6.5
debug3: Failed to open file:C:/Users/m4st4/.ssh/config error:2
debug3: Failed to open file:C:/ProgramData/ssh/ssh_config error:2
debug2: resolve_canonicalize: hostname 192.168.0.10 is address
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to 192.168.0.10 [192.168.0.10] port 22.
debug1: Connection established.
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_rsa error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_rsa.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_rsa type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_rsa-cert error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_rsa-cert.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_rsa-cert type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_dsa error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_dsa.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_dsa type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_dsa-cert error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_dsa-cert.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_dsa-cert type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ecdsa error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ecdsa.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_ecdsa type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ecdsa-cert error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ecdsa-cert.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_ecdsa-cert type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ed25519 error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ed25519.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_ed25519 type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ed25519-cert error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_ed25519-cert.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_ed25519-cert type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_xmss error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_xmss.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_xmss type -1
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_xmss-cert error:2
debug3: Failed to open file:C:/Users/m4st4/.ssh/id_xmss-cert.pub error:2
debug1: key_load_public: No such file or directory
debug1: identity file C:\\Users\\m4st4/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_for_Windows_7.7
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0-hpn14v15
debug1: match: OpenSSH_8.0-hpn14v15 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 192.168.0.10:22 as 'root'
debug3: hostkeys_foreach: reading file "C:\\Users\\m4st4/.ssh/known_hosts"
debug3: record_hostkey: found key type ECDSA in file C:\\Users\\m4st4/.ssh/known_hosts:3
debug3: load_hostkeys: loaded 1 keys from 192.168.0.10
debug3: Failed to open file:C:/Users/m4st4/.ssh/known_hosts2 error:2
debug3: Failed to open file:C:/ProgramData/ssh/ssh_known_hosts error:2
debug3: Failed to open file:C:/ProgramData/ssh/ssh_known_hosts2 error:2
debug3: order_hostkeyalgs: prefer hostkeyalgs: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
debug3: receive packet: type 20
debug1: SSH2_MSG_KEXINIT received
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c
debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none
debug2: compression stoc: none
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,aes128-cbc,none
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,aes128-cbc,none
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none
debug2: compression stoc: none
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug3: receive packet: type 31
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:ObBF/vvqMsbX8ltJx+asuBqgEceleflY+OksclB1HHQ
debug3: hostkeys_foreach: reading file "C:\\Users\\m4st4/.ssh/known_hosts"
debug3: record_hostkey: found key type ECDSA in file C:\\Users\\m4st4/.ssh/known_hosts:3
debug3: load_hostkeys: loaded 1 keys from 192.168.0.10
debug3: Failed to open file:C:/Users/m4st4/.ssh/known_hosts2 error:2
debug3: Failed to open file:C:/ProgramData/ssh/ssh_known_hosts error:2
debug3: Failed to open file:C:/ProgramData/ssh/ssh_known_hosts2 error:2
debug1: Host '192.168.0.10' is known and matches the ECDSA host key.
debug1: Found key in C:\\Users\\m4st4/.ssh/known_hosts:3
debug3: send packet: type 21
debug2: set_newkeys: mode 1
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: receive packet: type 21
debug1: SSH2_MSG_NEWKEYS received
debug2: set_newkeys: mode 0
debug1: rekey after 134217728 blocks
debug3: unable to connect to pipe \\\\.\\pipe\\openssh-ssh-agent, error: 2
debug1: pubkey_prepare: ssh_get_authentication_socket: No such file or directory
debug2: key: C:\\Users\\m4st4/.ssh/id_rsa (0000000000000000)
debug2: key: C:\\Users\\m4st4/.ssh/id_dsa (0000000000000000)
debug2: key: C:\\Users\\m4st4/.ssh/id_ecdsa (0000000000000000)
debug2: key: C:\\Users\\m4st4/.ssh/id_ed25519 (0000000000000000)
debug2: key: C:\\Users\\m4st4/.ssh/id_xmss (0000000000000000)
debug3: send packet: type 5
debug3: receive packet: type 7
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug3: receive packet: type 6
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug3: send packet: type 50
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,password
debug3: start over, passed a different list publickey,password
debug3: preferred publickey,keyboard-interactive,password
debug3: authmethod_lookup publickey
debug3: remaining preferred: keyboard-interactive,password
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Trying private key: C:\\Users\\m4st4/.ssh/id_rsa
debug3: no such identity: C:\\Users\\m4st4/.ssh/id_rsa: No such file or directory
debug1: Trying private key: C:\\Users\\m4st4/.ssh/id_dsa
debug3: no such identity: C:\\Users\\m4st4/.ssh/id_dsa: No such file or directory
debug1: Trying private key: C:\\Users\\m4st4/.ssh/id_ecdsa
debug3: no such identity: C:\\Users\\m4st4/.ssh/id_ecdsa: No such file or directory
debug1: Trying private key: C:\\Users\\m4st4/.ssh/id_ed25519
debug3: no such identity: C:\\Users\\m4st4/.ssh/id_ed25519: No such file or directory
debug1: Trying private key: C:\\Users\\m4st4/.ssh/id_xmss
debug3: no such identity: C:\\Users\\m4st4/.ssh/id_xmss: No such file or directory
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password
debug3: failed to open file:C:/dev/tty error:3
debug1: read_passphrase: can't open /dev/tty: No such file or directory
root@192.168.0.10's password:
debug3: send packet: type 50
debug2: we sent a password packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (password).
Authenticated to 192.168.0.10 ([192.168.0.10]:22).
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug3: send packet: type 90
debug1: Requesting no-more-sessions@openssh.com
debug3: send packet: type 80
debug1: Entering interactive session.
debug1: pledge: network
debug1: console supports the ansi parsing
debug3: Successfully set console output code page from:850 to 65001
debug3: Successfully set console input code page from:850 to 65001
debug3: receive packet: type 80
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug3: receive packet: type 91
debug2: channel_input_open_confirmation: channel 0: callback start
debug2: fd 3 setting TCP_NODELAY
debug2: client_session2_setup: id 0
debug2: channel 0: request pty-req confirm 1
debug3: send packet: type 98
debug2: channel 0: request shell confirm 1
debug3: send packet: type 98
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 0
debug2: PTY allocation request accepted on channel 0
debug2: channel 0: rcvd adjust 65536
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 0
debug2: shell request accepted on channel 0
debug2: client_check_window_change: changed
debug2: channel 0: request window-change confirm 0
debug3: send packet: type 98
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype keepalive@openssh.com reply 1
debug3: send packet: type 100
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype keepalive@openssh.com reply 1
debug3: send packet: type 100
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype keepalive@openssh.com reply 1
...


But I heard that the seek-array test was still running, as my HDDs were still rambling... So I waited for a few more days and today, finally the script has completed and the SSH window that ran it, did return the prompt to me.

I wonder how I can debug this... I didn't reboot yet, as I'd like to investigate this issue a bit further first...
  • How can I restore / restart my GUI?
  • Perhaps I can restart one of my two interfaces (the one on which I'm not logged in atm)?
  • Which log files should I check?

fyi,
I'm not sure if it is related, but the solnet-array-test completed without errors, but did have weird performance... This is the second time I run it (first time with a zpool and second time without a zpool) and the first time the parallel seek-array test took between 3d4h till 4d4h per HDD, which I already found a big variance.
But now the second run it became even more crazy... One HDD took 3d6h, 4 HDDs took more than 4 days and 3 HDDs took more than 5 days (up till 5d11h!)

When the GUI was still working, I was able to take some HDD IO screenshots and there it is very clear when it becomes slower
1582371995408.png

1582371958001.png

1582371972423.png

da4 is the only fast / normal one. With da5 and da6 you can see performance drop to around 40MB/sec near the end and sometimes restore a little later. The other HDDs, from which I didn't take a screenshot yet (because they weren't completed yet), have an even much longer performance drop to 40MB/sec... Again, not sure if this is related, but I do find this very weird...
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Some more debug:
Code:
root@FreeNAS[/]# service django status
django is running as pid 1325.
root@FreeNAS[/]# service nginx status
nginx is running as pid 1255.
root@FreeNAS[/]# ifconfig
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: 10Gbit
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether d0:50:99:d3:fd:fe
        hwaddr d0:50:99:d3:fd:fe
        inet 192.168.10.10 netmask 0xffffff00 broadcast 192.168.10.255
        nd6 options=9<PERFORMNUD,IFDISABLED>
        media: Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)
        status: active
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: 1Gbit
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether d0:50:99:d3:fd:ff
        hwaddr d0:50:99:d3:fd:ff
        inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
        nd6 options=9<PERFORMNUD,IFDISABLED>
        media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
        status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        groups: lo
root@FreeNAS[/]#
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Aha... After doing "cd /var/log" the SSH window, that I still had open, hangs... So it seems like filesystem related...
Code:
root@FreeNAS[/]# cd /var/log

^C^C




^C^C


Could it be that my USB stick has died? It already had some errors in the past, but I didn't care very much, as I was still testing and supposed the OS was running in memory after booting. Before going live, I wanted to install FreeNAS on my faithful Intel SSD (which now still contains Windows 10 and Linux for testing purposes).
But now that I think of it, it probably makes sense for FreeNAS to write to the USB stick regularly as well, as RAM doesn't store logs and configs very well ;)

So can a broken FreeNAS USB stick cause
  • IPMI / GUI / SSH to become unreachable
  • Performance of other HDDs to drop
?

Edit:
Just discovered that /var is stored in the RAM
root@FreeNAS[/var/log]# df -h |grep var
tmpfs 5.3T 34M 5.3T 0% /var

And just before that I was able to list files in /etc/rc.d (which is on the USB stick). So it seems like there was no issue with the USB stick afterall?

Edit2:
Scrub of the USB stick also found no issues
Code:
root@FreeNAS[~]# zpool status -v freenas-boot
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:12:35 with 0 errors on Sat Feb 22 17:00:52 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da8p2     ONLINE       0     0     0

errors: No known data errors
root@FreeNAS[~]#
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
As I had no way into the system anymore, I've powered off the NAS completely (incl IPMI). After turning it on again, there is nothing in the IPMI log, there are no new warnings in FreeNAS, zpool scrub of the USB stick didn't find any errors.
I've also just gone through all log files and didn't really find anything relevant at all, except perhaps messages like

1582377782179.png


1582377798720.png


But these are just SMART warnings, no? Is there any consequence to these kind of warnings? Does FreeNAS throttle performance if temps reach these thresholds?
When comparing graphs of the temps and IO, I don't really see a direct correlation with the performance drops. Before the performance dropped, those HDDs have been running 45°C+ for many hours already and the performance drop didn't occur after a temperature peak

Edit:
I've just done some more searching in the log files and missed below logs because they were rolled over
Code:
Feb 20 12:41:20 FreeNAS kernel: ix0: link state changed to DOWN
Feb 20 12:41:20 FreeNAS kernel: ix0: link state changed to DOWN
Feb 20 12:42:45 FreeNAS kernel: ix0: link state changed to UP
Feb 20 12:42:45 FreeNAS kernel: ix0: link state changed to UP

Maybe that is when I lost the connection?

Anyone know what else I can investigate please? I really don't like having completely losing control of my NAS without any indication of what may have caused it...
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Is there anything I can do to get any help / advice at all? (Provide more / less / different info? Have more patience?)

Thanks!
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Patience didn't really help much either ;) Anyone please?
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Can the weird performance consistency perhaps be caused by overheating the HBA? My HBA gets airflow, but only very little. About 13cm from the PSU fan (+-600rpm) and 30cm from from the front fan (+-1200rpm, with HDDs in between).
Does a LSI SAS controller throttle when overheating or does it burn itself to death?

If the performance inconsistency is indeed caused by an overheating HBA, can that cause these other weird "unresponsive" issues with FreeNAS?
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Because of my suspicion of the overheating HBA, I wanted to replace the heatsink on it with a bigger heatsink. But while removing the card (unplugging those cables is hard btw!) I actually damaged my HBA and broke of a resistor :'(

I did try to solder it back on, but my solder skills are really bad and I'm afraid also the pad, to which the resistor should attach, might be ripped off. So I'm not sure if my extremely uggly solder job did actually reconnect the resistor...
The card does still work... But now I even trust it less than before... So I didn't proceed with the heatsink swap and I'm searching for a replacement HBA now.... Finding an genuine one is pretty hard over here :(

Before all of this (and probably overheating the HBA)
1586701427144.png


My extremely bad solder job on the broken off resistor (I think maybe because of a missing pad)
1586701575992.png


I also creating something out of wood to direct a fan straight to the HBA
1586701760872.png

The fan is a low rpm (+-1000rpm) 120mm fan. Would this be sufficient without replacing the heatsink? Or should I still replace the heatsink as well in case I don't want any high rpm or small fans in my case?
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I have a follow up for my monologue here ;)

As I didn't really trust my LSI controller anymore, I bought a new / second hand Dell H310. I replaced the thermal paste of it and placed a 120mm fan right at it (as in the picture above).

Then I re-ran the solnet array test script and although things have improved compared to the LSI, there are still some weird things happening...

The good:
  • Heatsink temp during a couple days non-stop stress seek testing is "ok to hold your finger on" (just), so very ok.
  • IOstat serial read = 228-242MB/sec, so every HDD has a normal / consistent speed. Also FreeNAS I/O report shows a nice and normal declining line. da1/da5 are slowest and da0/da4/da7 are fastest.
  • dd parallel read = 228-244MB/sec, so every HDD has a normal / consistent speed. Also FreeNAS I/O report shows a nice and normal declining line. da1/da5 are slowest and da0/da4/da7 are fastest.
  • So normal read speeds are very consistent for each HDD
  • The system didn't crash and didn't become unreachable

The bad:
  • Nothing really bad this time.

The ugly:
  • dd parallel seek-stress read has improved (a little?), but there still is weird behaviour and large inconsistency (30%).
  • Although the time for the parallel read is correctly measured by the solnet array test script (when comparing them to the FreeNAS I/O reporting graphs), it seems like something is wrong with the seek-stress read time measurement. The slowest HDD, according to the script, took 289132 seconds (3d8h) to complete, while that test actually took more than 5 days (see output below). Also the FreeNAS I/O reporting graphs confirm this huge and weird difference.
Code:
Performing initial parallel seek-stress array read
Sat Apr 25 14:33:46 CEST 2020
...
Awaiting completion: initial parallel seek-stress array read
Thu Apr 30 15:47:22 CEST 2020
Completed: initial parallel seek-stress array read

  • The parallel seek-stress read, according to the solnet array test script timers per HDD
    • The test per HDD took between 2d21h and 3d8h
    • da1/da4 are slowest and da0/da6 are fastest during dd parallel seek-stress read. (so different from normal read)
    • It still marks an HDD as "fast" and as "slow" in the output. If I understand the script correctly, it only does this when results "jump out" and are "not normal"?
    • Compared to the "LSI-run" from some months ago, it ran almost 25% faster.
  • The parallel seek-stress read, according to FreeNAS I/O reporting graphs per HDD
    • The test per HDD took between 3d22h and 5d2h (so an insane difference)
    • da5/da6/da3 are slowest and da4/da2/da0 are fastest during dd parallel seek-stress read. (so very different from all above)
    • Compared to the "LSI-run from some months ago, it only ran about 5% faster.
    • The graph starts out pretty normal and consistent, but near the end there is again quite some variance with both I/O drops and I/O peaks. Exactly the same as with the LSI, but perhaps a little less extreme (it took a little less long and minimal speed was higher).
As I've now replaced the HBA, solved the temperature issues, I was actually hoping for even more consistency. It is better, but it still looks a little problematic, no?

Below the full output of the script and screenshots of the FreeNAS I/O reporting graphs (you may ignore the first peak before 25 April as this was from an aborted run).
Performing initial serial array read (baseline speeds)
Fri Apr 24 23:12:06 CEST 2020
Fri Apr 24 23:30:10 CEST 2020
Completed: initial serial array read (baseline speeds)

Array's average speed is 237.013 MB/sec per disk

Disk Disk Size MB/sec %ofAvg
------- ---------- ------ ------
da0 9537536MB 241 102
da1 9537536MB 228 96
da2 9537536MB 237 100
da3 9537536MB 240 101
da4 9537536MB 241 102
da5 9537536MB 231 97
da6 9537536MB 236 99
da7 9537536MB 242 102

Performing initial parallel array read
Fri Apr 24 23:30:10 CEST 2020
The disk da0 appears to be 9537536 MB.
Disk is reading at about 242 MB/sec
This suggests that this pass may take around 656 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da0 9537536MB 241 244 101
da1 9537536MB 228 228 100
da2 9537536MB 237 240 101
da3 9537536MB 240 240 100
da4 9537536MB 241 241 100
da5 9537536MB 231 233 101
da6 9537536MB 236 236 100
da7 9537536MB 242 242 100

Awaiting completion: initial parallel array read
Sat Apr 25 14:33:46 CEST 2020
Completed: initial parallel array read

Disk's average time is 52391 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da0 10000831348736 51051 97
da1 10000831348736 54216 103
da2 10000831348736 52273 100
da3 10000831348736 52096 99
da4 10000831348736 51460 98
da5 10000831348736 53648 102
da6 10000831348736 52694 101
da7 10000831348736 51690 99

Performing initial parallel seek-stress array read
Sat Apr 25 14:33:46 CEST 2020
The disk da0 appears to be 9537536 MB.
Disk is reading at about 227 MB/sec
This suggests that this pass may take around 699 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da0 9537536MB 241 228 95
da1 9537536MB 228 209 92
da2 9537536MB 237 220 93
da3 9537536MB 240 225 94
da4 9537536MB 241 222 92
da5 9537536MB 231 213 92
da6 9537536MB 236 224 95
da7 9537536MB 242 224 93

Awaiting completion: initial parallel seek-stress array read
Thu Apr 30 15:47:22 CEST 2020
Completed: initial parallel seek-stress array read

Disk's average time is 267933 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da0 10000831348736 248448 93
da1 10000831348736 288345 108 --SLOW--
da2 10000831348736 263132 98
da3 10000831348736 266320 99
da4 10000831348736 289132 108 --SLOW--
da5 10000831348736 270470 101
da6 10000831348736 252665 94
da7 10000831348736 264953 99
1588322387806.png

1588322395255.png

1588322400830.png

1588322409312.png

1588322417113.png

1588322423924.png

1588322429220.png

1588322434142.png
 
Last edited:

dak180

Patron
Joined
Nov 22, 2017
Messages
310
As I didn't really trust my LSI controller anymore, I bought a new / second hand Dell H310. I replaced the thermal paste of it and placed a 120mm fan right at it (as in the picture above).
You may want to check out my setup for cooling your HBA: specifically, the XSPC Wire Sensor 10K, PCI Slot Bracket Three Fan Rack Mount Set and a Noctua - NF-A9x14 29.7 CFM 92mm Fan combined with my Fan Control Tool should give you dynamic control of the fans including the on servicing the HBA. Your board supports an additional temp sensor if you get and plug in the one I used you should be able to put the sensor in the gap between the heatsink and and the chip it should be able to use that to get a good read on the temp.

This will also allow you to control your other case fans based on HD temp rather than cpu or steady state, though since you have slightly different board than I do you may have to adjust the the ipmi raw commands in the config file; if this is the case you should contact ASRock Rack support for info on the raw commands for controlling the fans.

With this setup I have not had to worry about temp issues (with very few exceptions) despite not having AC even in the summer.
 
Last edited:

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Thanks for the advice!

I can't really spot a header for attaching a temp sensor on my board though. Am I looking over it?

Also that fan mount set isn't really clear to me how it works or why it should be better / more stable then my current solution.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
I had thought that the extra temp sensor was standard on all their server boards but you are right, it is not on that one, (maybe just on the intel ones?) in which case you will need an external fan controller like a Corsair Commander Pro; even if it is a more complicated setup.
 
Top