Hello. We have a 2-node file server cluster (2 virtual machines running WS2016, I know, not officially supported by HIT Kit yet) running on a WS2016 Hyper-V host. Both these VMs connect directly to couple EQL LUNs via ISCSI (2 ISCSI vNICs per VM) and they utilize MPIO via HIT Kit 4.9.
We're experiencing frequent random disconnects from the EQL. The file server usually continues to work thanks to MPIO, but 2 times, it has failed (I assume it was a situation where both paths failed and therefore the disk wasn't visible to the file server node and the resource failed.
Events like this one are logged on the EQL group:
Info 30.1.2017 10:31:16 EQL-PS6110X 7.2.15 | 7.2.24 | 7.2.29 iSCSI session to target '192.168.130.135:3260, iqn.2001-05.com.equallogic:0-1cb196-df4f63705-aa40060ffcc582ac-<LUN_name>' from initiator '192.168.130.165:63706, iqn.1991-05.com.microsoft:<computer_FQDN>' was closed. | iSCSI initiator connection failure. | No response on connection for 6 seconds.
Couple seconds later (~20), the session is reestabilished.
This would indicate a networking problem, but this is also happening while I'm logged on the virtual server and pinging the group ISCSI IP without packet loss.
Some of the sessions haven't been interrupted for couple days (since last reboot), some drop every couple minutes or hours.
Are there any special requirements when connecting to EQL via ISCSI from virtual machines? Btw. we observed similar behavior on a WS2012R2 virtual file server cluster, so it's not just on the WS2016 one.
No there's no special requirements for a VM vs. a Physical server. That "no response" means that the VM did not respond to a Keep Alive packet. By iSCSI spec when those repeatedly fails that connection must be torn down and re-established.
Sometimes I lack of flowcontrol on the switch can cause this problem.
I would strongly suggest you open a support case. They'll need the array diags and switch configuration information.
Social Media and Community Professional#IWork4DellGet Support on Twitter - @dellcarespro
Hm, AFAIK, flowcontrol is enabled on the switches and none of the hosts are experiencing such issues, it's only the VMs.
I would suggest you open a support case to better triage this issue.