PE 2500: Fatal Error: Controller Monitor Failed

Servers

Servers
Information and ideas on Dell PowerEdge rack, tower and blade server solutions.

PE 2500: Fatal Error: Controller Monitor Failed

  • My main file server ( PowerEdge 2500 ) locked up over the weekend, and when I reboot, it hangs on Waiting for Array Controller #0 to start....

    Eventually it gives me the error

    Fatal Error: Controller Monitor Failed  

    Array Controller not started

    I found the following messages on a linux list (see below), but I'm not sure what to do with it.  I am running Windows 2000, and of course, my last good backup was three days ago.  I don't want to lose the data on the 3 disks (Raid 5). 

    I have tried unplugging the drives (all at once, and one at a time) with the same error resulting.

    I have replaced the the backplane entirely at the suggestion of Dell Tech. Support with no better luck (also replaced the scsi cable connecting the backplane to the mother board).

    When I unplug the backplane from the motherboard, it no longer hangs on waiting for the Array Controller, it seems to start the controller, but of course with no disks attached.

    Any suggestions are greatly appreciated!

    Ari

     

     

     

     

    Message 1:

    ok, I'm shaken now....

    I just reboot PE4600 machine..
    then this message shows up
    even before booting to OS
    ( Debian with 2.4.19 kernel, built-in aacraid driver )

    Array Controller #0

    Fatal Error : Controller Monitor Failed
    Array Controller not started


    I do not have any clue about this...
    it's working just fine before...

    help .. ?

    Message 2:

    Ok,
    we've rebuilt the container with Dell Utility setup
    again...luckily it's was still a fresh install...
    no data on it yet..just OS.

    Then after some time, I heard three ticking noise from the machine
    then it said REBUILD complete..
    The light turns blue again...
    the message is gone...

    both said Ok now :

    Container#0-RAID5 4Gb     Ok
    Container#0-RAID5 63.4Gb  Ok


    Can I depend on this ?
    or there's a possibilty it will happen again ?

    Shall We just replace Disk 0 ?

  • Hi Arigluck,

      With the controller monitor not starting that sounds like you are using the PERC 3/DI or perhaps the PERC 2 (quad channel) controller. try reseating the DIMM and ROMB key for the 3/DI. You have already replaced the BP and SCSI cable, your other options are the controller and the power system. Most likely it is the controller, if you are buying parts, the power distrubition board is cheap compared to replacing the system board

    Call into tech support and have them provide you the part numbers (assuming you are out of warranty)

  • Thanks GaryS,

        I'm still all covered by warranty, and in fact support is sending me out (get this) new motherboard, backplane, cables, DIMM and ROMB key. It's a perc 3/di.  Here's my follow-up question for you (and thank you very much for the response).  I did test with the backplane power plugged in, and not plugged in, and it appears the power to the backplane/PERC is working ok.  Is there a chance with the slew of new parts that I will still be missing the issue?

    Also, the tech support person had me remove the ROMB key and DIMM and reboot, and now the BIOS doesn't recognize the RAID card at all - I only have Off or SCSI options for the card.  I'm assuming with the new Motherboard etc. that I don't need to worry about the current MB BIOS, but I am afraid of losing data (latest backup is about 4 days old because of the Thanksgiving day holiday and a full tape). 

    I'm wondering if the all the new hardware will recognize my RAID 5 ok (realizing I need to be sure it's set to RAID in the BIOS), so that I don't lose those drives.

    Any thoughts, suggestions, feedback, or warnings appreciated.

    Ari

     

  • ari,

      Do not have the drives attached to the server while you are enabling the ROMB in the BIOS. The controller will read the configuration from the drives.

    The parts that are being sent SHOULD resolve your issue, the most likely is the M/B itself. Power is the underlying basis for any system, and cause problems that most people would not expect. The parts ordered have a better than 95% chance of resolving the controller monitor failed to start issue.

     

    DELL_GaryS

  • I had just about finished up typing my thanks for the advice, and to say that the motherboard seems to have done the trick, but my server just went down again.

    The Dell technician came out with all new pieces, replaced the motherboard, romb key, backplane, cables, and then we rebooted, the controller found the raid, and it took a long time for the server to start up (15 instead of the normal 5 minutes), but when it did, everything was in place and looked great.  

    It's been up and running all day, and then about 45 minutes ago, it locked up again.  The drives were flashing amber, and when I rebooted I got a series of different errors.

    The first reboot gave me this message:

    System Parity Error Interrupt at F000:AB84  and then asked if i wanted to shut off NMI or Reboot or press any key.

    I thought perhaps it was a memory problem, so I took out one of the memory pairs, and rebooted, and then it showed one of the drives bad.  I rebooted again, and it came up fine, but when I logged in, it crashed with "Unknown Hard Error"  in Services.exe.

    I tried re-seating the drives and memory, and at last reboot, I'm back to exactly where I was yesterday, waiting for array controller to start, and it fails.

    ** Quick update:  As I try to narrow down the possibilities, I continue to get completely different results.  Just from re-seating the drives or rebooting, I get a variety of responses from the array controller.  I have seen:  no response from the controller,  unknown containers,  no containers, and recognizaing the container but telling me that one drive (not always the same one) is missing from the array.

    Any suggestions?

    thanks.

    Ari

     

    Message Edited by arigluck on 12-03-2003 08:30 PM

  • Hi Ari,

      Time to look at power if on a UPS move to the wall outlet, check the P/S's for any amber lights. Call tech support and ask for a H/W escalation as two dispatches have failed to eliminate the problem. I would replace the power subsystem at this point.

    DELL-GaryS

  • Thanks Gary,

       Here's where I stand now, and I think I have a handle on what happened, although I could use a second opinion before proceeding. 

    I spoke with another technician, and after doing some diagnostics, she felt it was likely that hardware had caused a corruption with the array, since the raid container utility was seeing all the drives, but showing members missing.  She had me do a Ctrl+R (after warning me I'd lose my data, which would be unfortunate, but my backups were recent enough that it's not critical) to restore the raid, and it worked enough to bring the machine online to get the most recent data.  It crashed after a few hours, and remains unstable.

    Her logic was that the problem was likely caused by hardware, but that in the process, the raid array was corrupted, and even though the hardware was replaced, it was too late for the raid.  It makes sense to me, but the one nagging doubt I have is that I still get somewhat inconsistent results when rebooting, including waiting for the array controller until it fails to start. 

    Her suggestion was to rebuild the containers, and rebuild the server and restore the data.  I'm willing to do that, but I'll be extremely frustrated if I go through that and continue to have problems with the controller.  I'd still rather have the raid array back, but at this point I'm trying to cut my losses and get the server rebuilt, since I believe I have the most recent data.

    Thanks once more for your expert advice,

    Ari

     

  • Hi Ari,

     

      You have my advice above, rebuilding the O/S MIGHT fix it but I expect not, was this the escalation's tech advice? In my opinion the symptoms are not consistent with a damaged O/S being the only problem.

     

    DELL-GaryS