Replication "Partner Down" / Packet Errors on Mgmt Interface

Storage

Storage
Information and ideas on Dell storage solutions, including DAS, NAS, SAN and backup.

Replication "Partner Down" / Packet Errors on Mgmt Interface

This question has suggested answer(s)

Hello,

i got strange replication issues on two customer installation.

Installation 1:

- PS6110XS replicating to PS4110E using Smart Replicas (HIT/VE)
- Both EQLs running V. 6.0.1
- Both EQLs connected to PC8024F switch stack. Configured as "Dell Best Practice (Flow Control, Jumbo frames, Port Fast, STP off)
- Group Admin Web Interface shows up "Partner down" suddenly for some replications. Last replication stucks in "in progress". Other replications (to same member) working fine.
- After manual controller failover "partner down" changes back to "in progress" but no data is transfered.

Installation 2:

- PS6110XV replicating to PS6110XV using Smart Replicas (HIT/VE)
- Both EQLs running V. 6.0.1
- Each EQL is connected to PC8024 switch stack. Both stacks connected via 2* 10GBit LAG (~300 Meter) Configured as "Dell Best Practice (Flow Control, Jumbo frames, Port Fast, STP off)
- Group Admin Web Interface shows up "Partner down" suddenly for some replications. Last replication stucks in "in progress". Other replications (to same member) working fine.
- After manual controller failover "partner down" changes back to "in progress" but no data is transfered.
- Massiv "packet errors" (around ~100 errors / minute) on all management interfaces (even after manual controller failover). Change switch port to 100/Fdx. No Change. Switch (HP 8212 and HP 2810) shows no errors. Can't belive in four broken cable.

Any Ideas?

Marcel Mertens

All Replies
  • You need to open a case, but first thing is to make sure that DCB is disabled on the 8024 switch.  That switch doesn't support the iSCSI features needed to fully support DCB on iSCSI SANs.  

    -don

  • Case is already open. But at the moment it doesn't look like that support gets the clou.

    DCB is disabled on switch and eql

    We don't know if the packet errors on management interface are causing the partner down error.

    I change the switch, duplex settings, cable -> Still packet errors.

    Also updated the replication site to 6.0.2. Still packet errors.

  • Packet errors on Mgmt interface won't impact replication at all.   Replication is an iSCSI connection between members. Are you using the default VLAN on the 8024's?   WAN accelerator?   Support should provide you with a script to ping FROM all member ports TO all member ports to make sure all ports are accessible.  

    -don

  • No, iSCSI is running on VLAN 10. No WAN Accelerators. Replication site (~400 meter) is connected via 20Gbit LAG (two PC8024er stacks (one stack á 2 switches each site).

    There are some strange things:

    Some replications are working fine while others are run into "partner down". I have to delete the complete replica set and start over. Sometimes it works for a few replications until it ran into "partner down" again.

    The replication destinations always shows that the replication with "partner down" was successfull.

    As you can see (source site). Replication from 14:31 still "in progress" / replication status "partner down"


    Destination site shows that this replication is completed:

  • Have you disabled DCB on the 8024's?   You need to be at the most recent version to do this.  In the EQL GUI, is DCB enabled there and the VLAN set to 10?

    -don

  • DCB is disabled on EQL.

    AFAIK is there no "DCB off" switch on PC8024. Firmware is 5.0.0.4

    PFC (priority flow control) is inactive

  • There is a way to turn it off on the switch.  Worst case is disable LLDP.  Is the VLAN set to 10 in the EQL GUI?

    (I know it seems counter intuitive if DCB is turned off)  

    -don

  • iSCSI VLAN 10 is untagged to all storage ports.

    See pictures:

    LLDP is active for all storage ports.

  • Hi Guys,

    Did you ever find a resolution to this? We are seeing something similiar.

    Replication is working fine on particular volumes but on others are stopping with the message 'parnter-down' we have logged a call with Dell who at the moment are going through networking troubleshooting tasks but if some volumes are working it cant be a networking issue.

    Interestingly if i cancel the replica creation and start it again it stops at EXACTLY the same point each time.

  • Jep,

    very simple (in our case):

    On the destination site must be a small amount (at least 20-30gb) space left for the incoming replication. I had configured all space of the array for delegated space. After reducing the delegated space by 20-30gb so that there is a small amount "free space" everything was fine.

    You can't find this in the manual, it is written in the release notes of the V6 firmware. It took a while for EQL support to figure this out.

    I hope this will help you...

  • They say 5% or 100GB to 200GB right? of free pool space needed as best practice.  Must have plenty of space to test failover of all volumes too.  If HQ is RAW 14.4 TB, figure bare min. >=28.8TB at DR site to fire up cloned volumes leaving original replicas alone when using vmware SRM.  I'd have DR site at 3X's more RAW capacity then HQ site, factor in dedicated SRM volumes, etc...

  • You need more raw space on recovery site as on primary. how much more depends on your restore points and replication schedule. But double the size as primary site is a good deal.

    None of customers is using replication over wan link currently. Most use case is that customers have two datacenter on the same campus and replicate between both datacenters.

    The application servers (mostly vmware hosts) are spreads in both datacenter. so in case of failover you just have to fire up the replication. No usecase of SRM

  • resource have always been listed in the firmware details for each version. Although the fix list is not a complete list and not all symptoms may be listed. 6.0.4 and later has address replication issues you should review.  

    A partner down message can also just mean that; you don't have great communication between the groups or there is too much going on.

    you should review you logs for replication start times and replication complete times. You may need to spread out your replication schedules.