Replication issue (routing I think) - General Discussion (Retired/Read Only) - TechCenter Extras - Dell Community

Replication issue (routing I think)

TechCenter Extras

TechCenter Extras
Dell Technical solutions information on various topics, hosted on the Dell TechCenter IT community platform

Replication issue (routing I think)

  • We have 3 PS4000's, one in each of our three offices. We are trying to setup replication.
    The offices are linked over VPN, using Watghguard firewalls.

    Office 1:
    Computer/VM network: 10.0.0.0/255.255.0.0
    Storage Network: 192.168.99.0/255.255.255.192

    Office 2:
    Computer/VM network: 10.3.0.0/255.255.0.0
    Storage Network: 192.168.99.64/255.255.255.192

    Office 3:
    Computer/VM network: 10.4.0.0/255.255.0.0
    Storage Network: 192.168.99.192/255.255.255.192


    There are tunnels setup between the offices for the storage networks.
    If I login to my esxi boxes, I can ping the hosts in the other offices.

    I setup replication between office#1 and office#2, and it actually worked fine for a couple of weeks. Data replicated without issue. It was just 1 way, from office1 to office2.
    Once night, at 6am, I got an email alert saying replication had failed.
    Errors in the event logs from office1 say:
    Partner crcphgroup: No connection could be established. Verify that the partner IP address is correct.

    Errors in the log in office2:
    Partner CRCGroup: iSCSI: login timed out. Make sure the partner IP address is correct and reachable.

    if I ssh into the equallogic in office1, I can ping the group IP (192.168.99.102) of the unit in office 2.
    If I ssh into the equallogic in office2, i can NOT ping the group IP (192.168.99.6) of the unit in office 1. Traceroute says no route found.

    On unit 1, the default gateway is set to 192.168.99.50 which is the IP of the watchguard
    on unit 2, the default gateway is set to 192.168.99.125, which is the IP of the watchguard.

    I tried manually adding a route on the equallogic in office 2 (route -add -net 192.168.99.0 -netmask 255.255.255.192 192.168.99.125), it appears to add, but when I do route show, I don't see it.

    Any thoughts? I don't know why it broke suddenly.
  • The array uses a feature of the iSCSI specification called redirection which defines how iSCSI targets connect to each other. With replication the connections are from member to member, when the initial connection is requested by Group A the request is sent to Group B’s WKA (Well Known Address, a.k.a. the Group IP address). Once the request is acknowledged, Group B will redirect the connection to one of the physical ETH ports in group B.

    For redirection to work, you need to ensure that both Groups can ping (and traceroute) the Group IP address, AND ALL the physical ETH port IP Address of all the members in the group on both location.

    Verify this from each member in Group A and each member in Group B. Note, you need to test each interface from Group A to each interface in Group B.

    Usage: ping "-I "
    The sourceIP is the IP address of a specific array ETH port. This is done from a group prompt after logging into the array. The quotes are needed from the group prompt.

    Example from one network '10.1' to another '10.3' for checking the connectivity for replication:
    groupname>ping " -I 10.1.20.11 10.3.20.100"

    Example of a local test, pinging member to member in the same group:
    groupname>ping " -I 10.1.20.11 10.1.20.100"
    Note: the above is a Capital eye "I", ping without -I will only use eth0

    Trace Route:
    To actually traceroute to an address:
    GrpName>support traceroute "192.168.3.5"

    To traceroute out of a specific ETH port interface add a switch to choose the interface IP as shown below:
    GrpName>support traceroute -s [ETH port source IP] [destination IP to traceroute to]
    Note, do this for each source interface to each destination interface.

    Also, ensure you have ICMP and TCP Port 3260 (iSCSI) configured to be routable.
    Joe

    -Joe

    Social Media and Community Professional
    #IWork4Dell
    Get Support on Twitter - @dellcarespro

    Follow me on Twitter: @joesatdell 

  • As a followup, it turned out to be an issue with the Equallogic PS4000 in one of the offices.
    After trying all the above and other steps, I tried restarting the PS4000, and it immediately started working.

    My suspicion is that when we had a switch go bad in office#2 a while back, it somehow caused the routing table on the ps4000 in office#1 to get messed up. Even though the faulty switch was removed, the ps4000 still had wrong routing info. Restarting it wiped that, and we immediately had replication going again.
  • Joe - Sorry to resurrect an old thread - I have a very similar issue, I changed the default gateway for a group, but the members keep talking to the old gateway. Equallogic support told me the only supported way to update the GW was to was to reboot the members...

    Rebooting the members would take me hours of work, and impact performance more than the problem is causing now...

    Any thoughts would be more than welcome

  • The members do require a controller restart to refresh the changed route (this is akin to a controller failover), and needs to be done for each member.  A failover typically only takes 15-30 seconds per member.

    Provided that each host iSCSI disk timeout's are setup properly they should "ride out" the failover without issue (to ensure your settings are correct, see the support site, select your version of FW curretly running on your array(s), and you should see a link for "iSCSI Initiator and Operating System Considerations", this document has all OS iSCSI disk timeout values).

    Even with that said, I would plan on this during low I/O, or a maintenance period.

    -joe

    -Joe

    Social Media and Community Professional
    #IWork4Dell
    Get Support on Twitter - @dellcarespro

    Follow me on Twitter: @joesatdell 

  • Joe - I was hoping you wouldn't say that. Our policy is to make sure there are no connections into a member before performing any work on them. So restarting a member is a big deal.

    That said - I'd be more than happy to shut the old gateway for a few minutes during maintenance. that would be much less disruptive. Would the members, upon discovering, that the gateway was no longer existent update their tables?

    Thanks,

        Ed