SMB client errors after a cluster node reboots

In a hyper-converged cluster implemented using the Dell EMC Microsoft Storage Spaces Direct Ready Nodes with Dell EMC PowerEdge R740xd and Mellanox CX4 LX adapters for storage traffic, you may see SMB client errors (event id 30803) in Windows event viewer (Applications and Services Logs -> Microsoft -> Windows -> SMB client -> Connectivity) when a cluster node reboots.


Figure 1 - SMB client errors in Windows Event Viewer

While this is normal in a failover cluster during a node reboot, you may occasionally see these errors re-appear on the cluster nodes at a regular interval even after all cluster nodes are fully functional. This behavior is due to a failure in creating the SMB listeners for every storage interface in the node that restarted. These errors appear on the surviving nodes in the cluster and not on the node that restarted. The error description indicates the server to which the SMB client is trying to connect and the Server Address in the description indicates the node that just restarted.

In a normal functional state of the cluster nodes, after a node reboot, running netstat –xan should show an IPv4 and IPv6 listener associated with every storage interface on the node. The following output of netstat.exe was gathered on a node with two storage adapters.

Active NetworkDirect Connections, Listeners, SharedEndpoints
  Mode   IfIndex Type   Local Address  Foreign AddressPID
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:61476   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:62244   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:61988   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:62756   0
  Kernel   4 Connection 10.128.100.101:12541   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:12797   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:14077   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:14333   10.128.100.100:445 0
  Kernel  14 Connection 10.128.100.133:445 10.128.100.132:27454   0
  Kernel  14 Connection 10.128.100.133:445 10.128.100.132:27198   0
  Kernel  14 Connection 10.128.100.133:237510.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:62535   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:62791   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:64071   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:64327   10.128.100.132:445 0
  Kernel   4 Listener   [fe80::4cae:cb05:4932:f226%4]:445  NA 0
  Kernel   4 Listener   10.128.100.101:445 NA 0
  Kernel  14 Listener   10.128.100.133:445 NA 0
  Kernel  14 Listener   [fe80::5180:55b6:c0f0:ae8d%14]:445  NA 0

However, when you start seeing the SMB client errors in the cluster, the node that rebooted may not have all the listeners associated with every storage interface in the system.

Active NetworkDirect Connections, Listeners, SharedEndpoints
  Mode   IfIndex Type   Local Address  Foreign AddressPID
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:61476   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:62244   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:61988   0
  Kernel   4 Connection 10.128.100.101:445 10.128.100.100:62756   0
  Kernel   4 Connection 10.128.100.101:12541   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:12797   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:14077   10.128.100.100:445 0
  Kernel   4 Connection 10.128.100.101:14333   10.128.100.100:445 0
  Kernel  14 Connection 10.128.100.133:2375    10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:62535   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:62791   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:64071   10.128.100.132:445 0
  Kernel  14 Connection 10.128.100.133:64327   10.128.100.132:445 0
  Kernel   4 Listener   [fe80::4cae:cb05:4932:f226%4]:445  NA 0
  Kernel   4 Listener   10.128.100.101:445 NA 0

Therefore, in the above example, SMB client attempting to connect on the interface index 14 will eventually result in connection refused messages and SMB client errors (event ID 30803) related to RDMA as shown in Figure 1.

Impact

The Dell EMC Microsoft Ready Node network architecture recommends two storage adapters per every cluster node, there won’t be any disruption in cluster functionality when this issue occurs. Also, the adapter that is missing a listener can still be used to send RDMA traffic. However, since there is no listener on one of the storage adapters, writes using RDMA cannot be performed. This adapter falls back to using TCP for any writes or receiving traffic. This may result in lower write performance depending on the workload. There is no data loss or functionality limitations when this issue occurs.

Where is the issue?

This has been identified as a bug in the Mellanox CX4 LX WinOF2 driver versions 1.70 and below.

Steps to remediate

The SMB listener can be recreated by restarting the virtual storage adapter that has no associated SMB listener after a reboot. You can identify the right virtual adapter to restart by following the steps outlined below.

Identify the adapter based on the interface index

From the netstat -xan output, you can see that there is a listener missing for one of the storage adapters. The interface index for the missing adapter can be found using the Get-NetAdapter cmdlet.

PS C:\> Get-NetAdapter
Name  InterfaceDescription  ifIndex Status   MacAddress LinkSpeed
----  --------------------------- ------   ---------- ---------
vEthernet (Storage2)  Hyper-V Virtual Ethernet Adapter #3  14 Up   00-15-5D-09-C4-0210 Gbps
vEthernet (Storage1)  Hyper-V Virtual Ethernet Adapter #2   4 Up   00-15-5D-09-C4-0110 Gbps
vEthernet (Management)Hyper-V Virtual Ethernet Adapter 10 Up   00-15-5D-09-C4-0010 Gbps
Ethernet  Remote NDIS Compatible Device 9 Not Present  50-9A-4C-A7-F9-DF  0 bps
NIC2  Intel(R) Ethernet 10G X710 rNDC   6 Disconnected 24-6E-96-52-CC-A410 Gbps
NIC4  Intel(R) I350 Gigabit Network Connec...  15 Disconnected 24-6E-96-52-CC-C3  0 bps
NIC3  Intel(R) I350 Gigabit Network Conn...#2   8 Disconnected 24-6E-96-52-CC-C2  0 bps
NIC1  Intel(R) Ethernet 10G 4P X710/I350 rNDC  13 Disconnected 24-6E-96-52-CC-A210 Gbps
SLOT 1 Port 2 Mellanox ConnectX-4 Lx Ethernet Ad...#2   2 Up   24-8A-07-59-4C-6910 Gbps
SLOT 1 Port 1 Mellanox ConnectX-4 Lx Ethernet Adapter  11 Up   24-8A-07-59-4C-6810 Gbps

Identify and restart the interface with no associated listener

By looking at the netstat –xan output (shown above), you can see that interface with index 14 has no listener associated with it. From the Get-NetAdapter cmdlet, you can see that the interface index 14 is the virtual adapter vEthernet (Storage2).

Note: This network adapter name may be different based on how you have named storage adapters in the management OS.

You can now enable and disable RDMA binding on the interface with missing listener.

Disable-NetAdapterRdma –Name 'vEthernet (Storage2)'
Enable-NetAdapterRdma –Name 'vEthernet (Storage2)'

Once this process is complete, you can check netstat –xan to ensure that the listener is created. This process may take a few minutes. Once the listener is created, the cluster nodes will start communicating normally over RDMA and new SMB client errors will stop appearing in the event viewer.