PE SC1435 SAS 5iR adaptor failure?

Servers

Servers
Information and ideas on Dell PowerEdge rack, tower and blade server solutions.

PE SC1435 SAS 5iR adaptor failure?

This question is answered

Hi, folks

I am pretty certain that the SAS 5ir Adaptor in our PE server has failed. I have been quoted £250 to replace it. Because of this I would like to make absolutely sure that this is the component that has failed. Here is what happened:

Last Sunday night, our PowerEdge SC1435 (<ADMIN NOTE:Service tag removed per privacy policy>) shut down because of a hardware error. The message on the screen was:

*** Hardware Malfunction
Call your hardware vendor for support
*** The system has halted ***

Restarting the system allows Windows 2003 R2 to boot but the system soon shuts down again. The longest it stayed up for was about 1.5hrs which was enough time for me to run DSET and copy some data off the server.
 
One of the last reboots I tried failed with the following on the BIOS screen:

PCIe Fatal Error interrupt at 9B82:41F6

Pressing 'R' to reboot the system resulted in the system rebooting to Windows but it only lasted a few minutes before the hardware failure kicked in. I have not restarted it since.

The DSET report shows that under Storage > SAS 5_iR Adaptor Embedded that the State is Degraded:

ID        0
Name  SAS 5/iR Adapter
State    Degraded
Firmware         Version 00.10.49.00.06.12.02.00
Minimum Required Firmware Version  00.10.51.00.06.12.05.00
Driver Version  1.25.05.00
Minimum Required Driver Version       1.28.03.01
Storport Driver Version            5.2.3790.3959
Number of Connectors            1
Rebuild Rate    Not Applicable
BGI Rate         Unknown
Check Consistency Rate         Unknown
Reconstruct Rate        Unknown
Security Capable         Not Applicable
Security Key Present   Not Applicable
SCSI Initiator ID          Not Applicable
Cache Memory Size    MB
Patrol Read Mode       Disabled
Patrol Read State        Unknown
Patrol Read Rate         %
Patrol Read Iterations  Unknown


I tried installing the latest firmware update available from the Dell Support site but when I ran Flash.bat from a command prompt several messages appeared stating that it was unable to find the required files.

By the time I tried installing the latest driver the system was not staying up long enough for me to even run the driver installation package.

According to the DSET report, everything else seems OK.
 
There are no errors reported in the event logs before the failure occurs. Device Manager shows everything is fine.

I am a little confused about the terminology used in the DSET report, particularly the 'embedded' description. As far as I understand it, the SAS 5iR is an adaptor card and is shown as 'adaptor' in Device Manager. I understand that the SAS 5iR comes in two forms: an adaptor card and embedded in the system. I assume our SC1435 contains the card and it certainly looks that way - the drives are connected by two leads that terminate in one connector that is attached to the card which sits on a PCI riser.

 
One other thing I am uncertain about is the PCIe error message. I don't know if this relates to the SAS card or to the PCI Riser card which the SAS adaptor is connected to or perhaps another PCI connection on the motherboard.
 
Because there are no errors in the Windows Event Logs, Device Manager shows everything as being OK and because the DSET report shows the adaptor's status as degraded, I assume that the SAS adaptor has failed.
 
What do you people think? Anything I could try to make certain the card is the point of failure? £250 is a lot to spend if the problem lies elsewhere.
 
Thanks!
 
Mark
Verified Answer
  • The IPMI error is a cached error.

    Date stamp

    Oct 02 22:53:22 2011

    But if you have reseated the card, and it is still causing the system to hang or crash. Then you may need to proceed with replacement.

    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • The Perc5 ir should not cost that much money! My company sells those with a 1-year warranty for under $100. (Aventis Systems)

    It is kind of rare to see these controllers fail, but when they do, they usually give a PCI-E training error. You will usually also see drives randomly dropping/picking back up. This could explain the rebooting.

All Replies
  • The PCI error would be indicative of a device on the PCI bus not operating properly. One of the first and easiest things to do is a simple reseat of the internal components. A simple loose connection can cause these errors as well. Here is some documentation on where devices are located and how they are removed and put back in. Please make sure memory, data and power connectors, Riser, and PCI cards.

    Once the reseat is done then try powering it back on again and see there is any status changes.

    We also have some diagnostics you can boot the server to. You can run an express test, or extended test over all the hardware. Or do a custom test and select specific devices to tests. Here is the download link.

    support.dell.com/.../download.aspx

    You can use the 32 bit diagnostics errors to further isolate the card as the issue.

    Let us know the outcome.

    Thanks.

    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • Hello, Daniel

    Many thanks for replying.

    I reseated everything on the motherboard.

    I started the server (the first time since last week), and ran the 32bit diags from a bootable CD and it displayed the following failure:

    Test resultys : Fail

    Device : IPMI

    Test : IPMI_System_Event_Log_Check

    Error Code : 2900:0221

    Msg : IPMI - Oct 02 22:53:22 2011 : System Firmware :: Critical interrupt sensor (PCIE Fatal Err) Bus Fatal Error

    The hard drives and SAS controller were not listed in the available tests.

    A restart resulted in another PCIe fatal error - F000:E891. After pressing (r) to reboot the system it booted normally.

    Ran the PowerEdge Diagnostics and opted to test everything (SAS 5ir and drives are listed). All tests passed. All entries under the Configuration tab have a green tick mark beside them. Test took 2hrs 22mins - the majority of that was the hard drive tests.

    The system has been up for nearly three hours now. Presumably, reseating the components did not help as the PCIe fatal error occurred again.

  • The IPMI error is a cached error.

    Date stamp

    Oct 02 22:53:22 2011

    But if you have reseated the card, and it is still causing the system to hang or crash. Then you may need to proceed with replacement.

    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

  • The Perc5 ir should not cost that much money! My company sells those with a 1-year warranty for under $100. (Aventis Systems)

    It is kind of rare to see these controllers fail, but when they do, they usually give a PCI-E training error. You will usually also see drives randomly dropping/picking back up. This could explain the rebooting.

  • Thanks for the feedback. The server is still running so I'll leave it and see if lasts a few days.

    @IcanBENCHurCAT:

    Do you have a UK office? I was gob-smacked when I was quoted £250. Even buying it from you guys and getting it shipped across the big pond would be cheaper.

  • Agreed ... you can get a PERC 5 (much better controller) for half the quoted cost of the SAS 5.

  • No UK office, but we have pretty decent shipping rates. If you have your own account we can use that, too. Just need to call in for a sales rep.

  • I have received a replacement PERC 5 i/R card and have placed it in the system and connected the drives. Problem is, I cannot see how to configure it. During boot the following is displayed:

    SAS 6 Host Bus Adaptor BIOS

    MPT-6.22.03.00

    Copyright 2000 2008 LSI

    Initialising...

    Vol 00:130 is currently in state inactive/optimal

    Enter SAS configuration utility to investigate

    The previous card's BIOS displayed a CTRL+key combination to use to launch the configuration utility but nothing is displayed.

    Can anyone help with this, please?

    Thanks

  • OK...

    Saw the CTRL+C prompt (Whoops)

    Now I have a different problem. I used the menu to do the following:

    CTRL+C  displays a screen on which SAS6IR is listed. Is this correct? The invoice states it is a PERC 5i/R.

    Anyway from the menu I drilled down:

    SAS6IR > RAID Properties > Manage Array > Activate Array

    I chose to activate the array and exited the utility.

    The server reboots and after Initializing.. I see Vol (00:000) is currently in state RESYNCHING

    The drive is then identified and listed (two drives in RAID 1 configuration).

    After the server BIOS finishes loading, a white progress meter rapidly completes across the bottom of a black screen, the Windows Server 2003 spalsh screen appears and then the server reboots. I get the same when trying to start the OS in safe mode.

    Presumably I need to delete the array and start from scratch because the wrong driver is loaded. I am quite happy to do that, but would prefer to avoid it if possible. Is there a way to install the driver without re-installation?

    Cheers!

  • Yeah, your driver must be updated. The only way I know how is to take a backup with software like Acronis. Then, you can specify drivers to add when you put the backup back on to the array. Takes about the same amount of time as reinstalling.

    Maybe you could put in a wIndows cd repair and add the driver. Or maybe a livecd and add driver?

  • Thanks. I'll reinstall from scratch

  • Right, a little more help, if I may please :)

    I can't find a driver for the adaptor. When I search Dell's site for 'SAS 6/ir driver' all I see are drivers for integrated controllers, not for a SAS 6/iR adaptor, which I assume is different.

    Anyone know where I can get one for Windows 2003, please?

    Thanks!

  • Google is your friend:

    support.us.dell.com/.../format.aspx

    However, looking at the page you have no idea what this is for. Looking at the text file tells you.

  • Thanks to everyone who contributed.

    I saved £100 by buying from the US. This saving was reduced by a £30 customs charge when it arrived at our office, but it still meant a saving of £70!

    The new card is in - had to use the Dell USB F6 Utility to format my USB flash drive before Windows Setup would recognise it. Everything seems to be OK :)

    I appreciated everyone's help :)

  • This is an additional note on this thread, for everyone that may read it.

    The quickest way to identify if a card has failed on the expansion bus, usually a riser if in a 1U chassis, is to simply removed the suspected card from the PCI bus- don't need to unattached any of its other cables-- just lay it carefully aside so it won't short anything on itself or elsewhere, and won't end up in a fan or block something important.  See if the system boots up past the point without problems.

    If it gets this far, you've found the card that is the problem.

    Some history-- once you pull a card, look at the little barrel-shaped components with XXXXuF written on the side.  These are called capacitors.  Dell was a victim of a capacitor scam (the great 'Capacitor Plague' of early/mid 2000's - that continues to cause failures up until 2010 and beyond), that may be why the card has failed.

    You may also see a black-smudge across the top, where it should normally be silver- this is because electrolye has been vented onto the top of the capacitor, also showing a failure sign.  There are many other signs, but this is likely the cause of the card failure.  Cap overheated due to poor venting, or failed due to poor quality.  Suspect the former first, then ask Dell about the latter-  they may be able to see if you are due a replacement due to OEM defect.