Lost connection to MD3000 when performing a controller replacement - PowerVault Storage Forum - Storage - Dell Community

Lost connection to MD3000 when performing a controller replacement

Storage

Storage
Information and ideas on Dell storage solutions, including DAS, NAS, SAN and backup.

Lost connection to MD3000 when performing a controller replacement

  • Hi All,

    I have an MD3000 with two controllers. A while ago I checked out Modular Disk Storage Manager health status and I had a controller that needed replacement, so I replaced it today and rebooted the system because the error persisted.

    After the reboot I completely lost connection to the device.

    These are the observations I made:

    1. When the array powers up pin to the management interface of one of the controllers is OK.

    2. After some time this management interface stops answering to ping.

    3. Tried with the substituted controller out as well, with the sam result.

    I can ping the iscsi interfaces but it looks that with this software I cannot add manually the array with this IP addresses, neither do I with the management interface of course.

    So practically I have a huge amount of data completely lost, because the array is unaccessible, what a mess!!

    Is there a way that I can access this data anyhow??

    PLEASE HELP:...

    Thanks! javi

  • Hello Javi,

    When you added the replacement controller did you let it sync with the controller that was in your MD3000? If you use the serial cable that comes with your MD3000, then we can see if the controller is able to boot fully, or if it is getting stopped at some point.

    Startup a terminal emulation program like putty, teraterm, minicom or hyperterminal using these terminal settings (115200-8-n-1).

    Pull the controller from the system & wait about a minute then insert it back into your MD3000i and you should see the controllers boot process.

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Hi! Thanks for the help. Everytime I boot up the controller it happens the same. If you can see, at the end of the log there is this message: "Exception: Data Abort cpsr:  60000013   pc:  0x". After this, no connection to the management interface. Here is the full log:

    -=<###>=-
    Attaching interface lo0... done

    Adding 9767 symbols for standalone.
    Error
    09/20/17-10:11:21 (GMT) (tRootTask): NOTE:  I2C transaction returned 0x0423fe00




    Reset, Power-Up Diagnostics - Loop 1 of 1
    3600 Processor DRAM
         01 Data lines                                                  Passed
         02 Address lines                                               Passed
    3300 NVSRAM
         01 Data lines                                                  Passed
    5900 Ethernet 91c111 #1
         01 Register read                                               Passed
         02 Register test                                               Passed
    3A00 NAND Flash
         06 Bad Blocks Test                                             Passed
    2310 Application Accelerator Unit
         01 AAU Register Test                                           Passed
    6D00 LSI SAS 1068 IOC--Base Board
         01 IOC Register Read Test                                      Passed
         02 IOC Register Address Lines Test                             Passed
         03 IOC Register Data Lines Test                                Passed
    6F01 QLOGIC EP4032 CHIP 0
         01 Register Read Test                                          Passed
         02 Register Address Lines Test                                 Passed
         03 Register Data Lines Test                                    Passed
    3900 Real-Time Clock
         01 RT Clock Tick                                               Passed
    Diagnostic Manager exited normally.


    Current date: 09/20/17  time: 02:10:07

    Send <BREAK> for Service Interface or baud rate change
    09/20/17-10:11:40 (GMT) (tRAID): NOTE:  Set Powerup State
    09/20/17-10:11:40 (GMT) (tRAID): NOTE:  SOD Sequence is Normal, 0
    09/20/17-10:11:40 (GMT) (tRAID): NOTE:  SOD: removed SAS host from index 0
    09/20/17-10:11:40 (GMT) (tRAID): NOTE:  In iscsiIOQLIscsiInitDq.  iscsiIoFstrBas
    e = 0x0
    09/20/17-10:11:40 (GMT) (tRAID): NOTE:  Turning on tray summary fault LED
    09/20/17-10:11:42 (GMT) (tRAID): NOTE:  SYMBOL: SYMbolAPI registered.
    09/20/17-10:11:42 (GMT) (tRAID): NOTE:  lost persistent dq data because buffer w
    as modified or size changed.
    esmc0: LinkUp event
    09/20/17-10:11:43 (GMT) (tNetCfgInit): NOTE:  Network Ready
    09/20/17-10:11:46 (GMT) (tRAID): NOTE:  Initiating Drive channel: ioc:0 bringup
    09/20/17-10:11:48 (GMT) (tRAID): NOTE:  IOC Firmware Version: 00-24-63-00
    09/20/17-10:11:56 (GMT) (tSasEvtWkr): NOTE:  sasIocPhyUp: chan:1 phy:0 prevNumAc
    tivePhys:2 numActivePhys:2
    09/20/17-10:11:56 (GMT) (tSasEvtWkr): NOTE:  sasIocPhyUp: chan:1 phy:1 prevNumAc
    tivePhys:2 numActivePhys:2
    09/20/17-10:12:06 (GMT) (tRAID): NOTE:  IonMgr: Drive Interface Enabled
    09/20/17-10:12:06 (GMT) (tRAID): NOTE:  SOD: Instantiation Phase Complete
    09/20/17-10:12:06 (GMT) (tRAID): WARN:  No attempt made to open Inter-Controller
     Communication Channels
    09/20/17-10:12:06 (GMT) (tRAID): NOTE:  Failing The Alternate Controller
    09/20/17-10:12:06 (GMT) (tRAID): WARN:  Alt Ctl Reboot:
                                    Reboot CompID: 0x401
                                    Reboot reason: 0x6
                                    Reboot reason extra: 0x0
    09/20/17-10:12:06 (GMT) (tRAID): NOTE:  holding alt ctl in reset
    09/20/17-10:12:06 (GMT) (tRAID): NOTE:  LockMgr Role is Master
    09/20/17-10:12:06 (GMT) (tRAID): WARN:  FBM:validateSubModel: Exception - Alt co
    ntroller not ready
    09/20/17-10:12:06 (GMT) (tSasDiscCom): NOTE:  SAS Discovery complete task spawne
    d
    09/20/17-10:12:07 (GMT) (tRAID): NOTE:  spmEarlyData: No data available
    09/20/17-10:12:07 (GMT) (sasCheckExpanderSet): NOTE:  Expander Firmware Version:
     0116-e05c
    09/20/17-10:12:07 (GMT) (sasCheckExpanderSet): NOTE:  Expander SAS address: Hi =
     x5a4badb4 Low = x4e0f0f10
    09/20/17-10:12:12 (GMT) (tSasDiscCom): WARN:  SAS: Initial Discovery Complete Ti
    me: 30 seconds
    09/20/17-10:12:12 (GMT) (tRAID): NOTE:  WWN baseName 0004a4ba-db4e0c98 (valid==>
    SigMatch)
    09/20/17-10:12:12 (GMT) (tRAID): NOTE:  IonMgr: Host Interface Enabled
    09/20/17-10:12:12 (GMT) (tRAID): NOTE:  SOD: Pre-Initialization Phase Complete
    09/20/17-10:12:13 (GMT) (tRAID): WARN:  BID: initialize(): Power latched!
    09/20/17-10:12:23 (GMT) (tRAID): NOTE:  ACS: Icon ping to alternate failed: -2,
    resp: 0
    09/20/17-10:12:23 (GMT) (tRAID): NOTE:  ACS: autoCodeSync(): Process start. Comm
     Mode: 0, Status: 0
    09/20/17-10:12:23 (GMT) (tRAID): WARN:  ACS: autoCodeSync(): Skipped since alt n
    ot communicating.
    09/20/17-10:12:23 (GMT) (tRAID): NOTE:  SOD: Code Synchronization Initialization
     Phase Complete
    09/20/17-10:12:23 (GMT) (tRAID): NOTE:  Caught IconSendInfeasibleException Error
     in iop::requestAltIopDelay
    09/20/17-10:12:24 (GMT) (tRAID): NOTE:  CheckInMonitor: Check-in failed (IconSen
    dInfeasibleException Error)
    09/20/17-10:12:24 (GMT) (NvpsPersistentSyncM): NOTE:  NVSRAM Persistent Storage
    updated successfully
    09/20/17-10:12:24 (GMT) (tRAID): NOTE:  USM Mgr initialization complete with 0 r
    ecords.
    09/20/17-10:12:24 (GMT) (tRAID): WARN:  Received IconSendInfeasibleException Err
    or adding small edr records from alt controller
    09/20/17-10:12:25 (GMT) (tRAID): WARN:  spm: unable to exchange features, assumi
    ng none
    09/20/17-10:12:25 (GMT) (tRAID): NOTE:  SPM acquireObjects exception: IconSendIn
    feasibleException Error
    09/20/17-10:12:25 (GMT) (tRAID): NOTE:  DBRead               0.176 secs
    09/20/17-10:12:25 (GMT) (tRAID): NOTE:  sas: Peering Disabled (Alt Unavailable)
    09/20/17-10:12:26 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
     03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
    09/20/17-10:12:53 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
     timeout
    09/20/17-10:12:53 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
    eout, cmd = 0x69
    09/20/17-10:12:54 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
    :/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
    09/20/17-10:12:54 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
     returned -1

    09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:12:54 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:12:55 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
     03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
    09/20/17-10:13:23 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
     timeout
    09/20/17-10:13:23 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
    eout, cmd = 0x69
    09/20/17-10:13:23 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
    :/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
    09/20/17-10:13:23 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
     returned -1

    09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:13:23 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:13:25 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
     03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
    09/20/17-10:13:52 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
     timeout
    09/20/17-10:13:52 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
    eout, cmd = 0x69
    09/20/17-10:13:53 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
    :/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
    09/20/17-10:13:53 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
     returned -1

    09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:13:53 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:13:54 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
     03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
    09/20/17-10:14:21 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
     timeout
    09/20/17-10:14:21 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
    eout, cmd = 0x69
    09/20/17-10:14:22 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
    :/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
    09/20/17-10:14:22 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
     returned -1

    09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:14:22 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:14:23 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
     03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
    09/20/17-10:14:50 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
     timeout
    09/20/17-10:14:50 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
    eout, cmd = 0x69
    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
    :/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
    09/20/17-10:14:51 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
     returned -1

    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:14:51 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:14:52 (GMT) (tRAID): WARN:  QLStartAdapter: ControllerErrorCount exc
    eeds threshold.
    09/20/17-10:14:52 (GMT) (tRAID): ERROR: QLInitializeDevice: QLStartAdapter faile
    d
    09/20/17-10:14:52 (GMT) (tRAID): ERROR: QLAddDevice: controller/device/chip init
    ialization failed.
    09/20/17-10:14:52 (GMT) (tRAID): ERROR: qlgEnableHostInterface: QLInitializeDevi
    ce failed.
    09/20/17-10:14:52 (GMT) (tRAID): NOTE:  ****************************************
    ****************************************
    09/20/17-10:14:52 (GMT) (tRAID): NOTE:    QLogic Target Application, Version 2.0
    1.08 6-13-2005 (W2K)
    09/20/17-10:14:52 (GMT) (tRAID): NOTE:          iSCSI Target Application
    09/20/17-10:14:52 (GMT) (tRAID): NOTE:   ***************************************
    *****************************************

    Exception: Data Abort
    cpsr:  60000013   pc:  0x

  • Hello Javi,

    Thanks for the serial capture as it helps. So after looking at the capture I see the following error:

    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
    09/20/17-10:14:51 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
    ed.  Stat f000
    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
     4543
    09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
    09/20/17-10:14:52 (GMT) (tRAID): WARN:  QLStartAdapter: ControllerErrorCount exc
    eeds threshold.

    When I see that error that is normally means that the controller is dead. I know you stated that you replaced the controller already once. The controller can’t go through its own POST so that is why you are getting this error. What you will need to see is if that is a slot issue or a controller issue. If you put the controller in the other slot and it reports the same then it is the controller.

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Hi Sam,

    Thanks for the help. We had another controller replacement and we inserted the controller with no other controllers attached to the array and we were able to boot the storage up.

    Can you tell me if there is a specific procedure defined to add a new controller to an MD3000 that is currently running on only one controller? Recovery guru just says to attach it and that's it.

    Cheers,

    Javi

  • Hello Javi,

    If the system was running dual controller before then yes you will insert the controller and wait bout 10minutes. The 10 minutes is to allow the replacement controller to sync with the current controller, and gather all the information. Once that is done then all you will need to do is to online the controller in MDSM.

    If your MD3000 is running in simplex mode then you will need to do the conversion to duplex mode. Here is a guide that explains how that is done. http://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_powervault/powervault-md3000i_user%27s%20guide5_en-us.pdf

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Hi Sam! Thanks for the information. I have alway had two controlles but I do not know if I have simplex or duplex mode, how can I check it? Is this information sayin I am running on duplex mode?

    Ethernet port:              1                  
                Link status:             Up                 
                MAC address:             a4:ba:....
                Negotiation mode:        Manual setting     
                   Port speed:           100 Mbps           
                   Duplex mode:          Full duplex        
                Network configuration:   Static             

  • Hello javiervila,

    If you have had 2 controllers then you are already in duplex mode. So I would just insert the controller and give it about 10 minutes to sync with the active controller. Once that is complete then you will want to online it in MDSM.

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Hi Sam,

    I was able to replace the controller and now I have the controller online. However I have two errors on recovery guru:

    1)

    Storage array:  STORAGE_PFN_2
    Component reporting problem:     Thermal sensor  
      Status:     Not available
      Location:  Expansion enclosure 0
      Component requiring service:  Temperature sensor

     

     

    2)

     

    Storage array:  STORAGE_PFN_2
    Component reporting problem:     Host Board Left
      Status:     Not available
      RAID Controller Module:  Slot 0
      Service action (removal) allowed:  No
        Service action LED on component:  No

    Are these critical errors? How can I solve them?

    Thanks,

    Javier

  • Hello Javier,

    When you replace a controller it is not uncommon to see these errors come up. Once the controller has been replaced I would give the MD3000 about 5 minutes then run the check in the Recovery Guru again to see if the errors are still present.

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Thanks Sam,

    Two days after the replacement, these problems are still present and I have no IP connection to the replaced controller.

    I can also see that clock dates are not synced, and when trying to sync both controllers I get error 1009.

    Any thoughts?

    Thanks,

    Javier

  • Hello Javier,

    Can I get you to gather a new support log from your MD3000 so that I can review it? I will send you an email that you can reply back to with the log.

    Please let us know if you have any other questions.

    DELL-Sam L
    Dell | Social Outreach Services - Enterprise
    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)