I am pretty certain that the SAS 5ir Adaptor in our PE server has failed. I have been quoted £250 to replace it. Because of this I would like to make absolutely sure that this is the component that has failed. Here is what happened:
The IPMI error is a cached error.
Oct 02 22:53:22 2011
But if you have reseated the card, and it is still causing the system to hang or crash. Then you may need to proceed with replacement.
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)
The Perc5 ir should not cost that much money! My company sells those with a 1-year warranty for under $100. (Aventis Systems)
It is kind of rare to see these controllers fail, but when they do, they usually give a PCI-E training error. You will usually also see drives randomly dropping/picking back up. This could explain the rebooting.
The PCI error would be indicative of a device on the PCI bus not operating properly. One of the first and easiest things to do is a simple reseat of the internal components. A simple loose connection can cause these errors as well. Here is some documentation on where devices are located and how they are removed and put back in. Please make sure memory, data and power connectors, Riser, and PCI cards.
Once the reseat is done then try powering it back on again and see there is any status changes.
We also have some diagnostics you can boot the server to. You can run an express test, or extended test over all the hardware. Or do a custom test and select specific devices to tests. Here is the download link.
You can use the 32 bit diagnostics errors to further isolate the card as the issue.
Let us know the outcome.
Many thanks for replying.
I reseated everything on the motherboard.
I started the server (the first time since last week), and ran the 32bit diags from a bootable CD and it displayed the following failure:
Test resultys : Fail
Device : IPMI
Test : IPMI_System_Event_Log_Check
Error Code : 2900:0221
Msg : IPMI - Oct 02 22:53:22 2011 : System Firmware :: Critical interrupt sensor (PCIE Fatal Err) Bus Fatal Error
The hard drives and SAS controller were not listed in the available tests.
A restart resulted in another PCIe fatal error - F000:E891. After pressing (r) to reboot the system it booted normally.
Ran the PowerEdge Diagnostics and opted to test everything (SAS 5ir and drives are listed). All tests passed. All entries under the Configuration tab have a green tick mark beside them. Test took 2hrs 22mins - the majority of that was the hard drive tests.
The system has been up for nearly three hours now. Presumably, reseating the components did not help as the PCIe fatal error occurred again.
Thanks for the feedback. The server is still running so I'll leave it and see if lasts a few days.
Do you have a UK office? I was gob-smacked when I was quoted £250. Even buying it from you guys and getting it shipped across the big pond would be cheaper.
Agreed ... you can get a PERC 5 (much better controller) for half the quoted cost of the SAS 5.
No UK office, but we have pretty decent shipping rates. If you have your own account we can use that, too. Just need to call in for a sales rep.
I have received a replacement PERC 5 i/R card and have placed it in the system and connected the drives. Problem is, I cannot see how to configure it. During boot the following is displayed:
SAS 6 Host Bus Adaptor BIOS
Copyright 2000 2008 LSI
Vol 00:130 is currently in state inactive/optimal
Enter SAS configuration utility to investigate
The previous card's BIOS displayed a CTRL+key combination to use to launch the configuration utility but nothing is displayed.
Can anyone help with this, please?
Saw the CTRL+C prompt (Whoops)
Now I have a different problem. I used the menu to do the following:
CTRL+C displays a screen on which SAS6IR is listed. Is this correct? The invoice states it is a PERC 5i/R.
Anyway from the menu I drilled down:
SAS6IR > RAID Properties > Manage Array > Activate Array
I chose to activate the array and exited the utility.
The server reboots and after Initializing.. I see Vol (00:000) is currently in state RESYNCHING
The drive is then identified and listed (two drives in RAID 1 configuration).
After the server BIOS finishes loading, a white progress meter rapidly completes across the bottom of a black screen, the Windows Server 2003 spalsh screen appears and then the server reboots. I get the same when trying to start the OS in safe mode.
Presumably I need to delete the array and start from scratch because the wrong driver is loaded. I am quite happy to do that, but would prefer to avoid it if possible. Is there a way to install the driver without re-installation?
Yeah, your driver must be updated. The only way I know how is to take a backup with software like Acronis. Then, you can specify drivers to add when you put the backup back on to the array. Takes about the same amount of time as reinstalling.
Maybe you could put in a wIndows cd repair and add the driver. Or maybe a livecd and add driver?
Thanks. I'll reinstall from scratch
Right, a little more help, if I may please :)
I can't find a driver for the adaptor. When I search Dell's site for 'SAS 6/ir driver' all I see are drivers for integrated controllers, not for a SAS 6/iR adaptor, which I assume is different.
Anyone know where I can get one for Windows 2003, please?
Google is your friend:
However, looking at the page you have no idea what this is for. Looking at the text file tells you.
Thanks to everyone who contributed.
I saved £100 by buying from the US. This saving was reduced by a £30 customs charge when it arrived at our office, but it still meant a saving of £70!
The new card is in - had to use the Dell USB F6 Utility to format my USB flash drive before Windows Setup would recognise it. Everything seems to be OK :)
I appreciated everyone's help :)
This is an additional note on this thread, for everyone that may read it.
The quickest way to identify if a card has failed on the expansion bus, usually a riser if in a 1U chassis, is to simply removed the suspected card from the PCI bus- don't need to unattached any of its other cables-- just lay it carefully aside so it won't short anything on itself or elsewhere, and won't end up in a fan or block something important. See if the system boots up past the point without problems.
If it gets this far, you've found the card that is the problem.
Some history-- once you pull a card, look at the little barrel-shaped components with XXXXuF written on the side. These are called capacitors. Dell was a victim of a capacitor scam (the great 'Capacitor Plague' of early/mid 2000's - that continues to cause failures up until 2010 and beyond), that may be why the card has failed.
You may also see a black-smudge across the top, where it should normally be silver- this is because electrolye has been vented onto the top of the capacitor, also showing a failure sign. There are many other signs, but this is likely the cause of the card failure. Cap overheated due to poor venting, or failed due to poor quality. Suspect the former first, then ask Dell about the latter- they may be able to see if you are due a replacement due to OEM defect.