md3000i Degraded Physical Disk Channel - DELL|EMC Storage Forum - Storage - Dell Community

md3000i Degraded Physical Disk Channel

Storage

Storage
Information and ideas on Dell storage solutions, including DAS, NAS, SAN and backup.

md3000i Degraded Physical Disk Channel

This question is not answered

Hi,

I have an md3000i with attached md1000.  We had a HDD fail which the hot spare took over with no down time or issues.

Now I replaced the failed drive and it was put back into the array.  Once that completed now I'm showing Degraded Physical Disk Channel on channel 0,1 (I'm assuming that's the LUN number since LUNs 0 and 1 are part of the RAID group with the failed disk).

Also it seems to have created a Virtual Disk Not On Preferred Path issue.  This md3000i has dual controllers and the ESX hosts are all multipathed but not sure why all of this is coming around after replacing a failed disk.

Thanks,
Josh.

All Replies
  • Hello, Josh.

    So, there are a number of things working here. We'll start with the message "Degraded Physical Disk Channel 0,1"  It's telling us that channels 0 and 1 are marked as degraded by the controllers. This is probably because of the failed disk. The disk more than likely failed because it maxed a number of errors, further causing chatter down the channels. It's an easy fix though.

    You'll need to run some commands to clear the channels of the errors. (it's like resetting a counter back to zero)

    Here's the commands to do so:

    show allPhysicalDiskChannels stats;

    clear allPhysicalDiskChannels stats;

    set physicalDiskChannel [0] status=optimal;

    set physicalDiskChannel [1] status=optimal;

    set physicalDiskChannel [2] status=optimal;

    set physicalDiskChannel [3] status=optimal;

    show allPhysicalDiskChannels stats;

    To get access to the command line, you'll need to open the cli window in windows, and navigate to the SMcli folder:  C:\Program Files <x86>\Dell\MD Storage Manager\client or C:\Program Files\Dell\MD Storage Manager\client

    (This depends on if the version is 32 bit or 64 bit)

    Then, start your commands with:

    >smcli -n "NameOfArray" -c "set physicalDiskChannel [1] status=optimal;"

    As far as the Virtual Disk Not On Preferred Path,

    SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

    Run this command AFTER you've cleared the channel error counters, and let me know if it stays good.

    I know I've given a lot to do here, so let me know if you have any questions.

    Have a great rest of the week!

  • Thanks for the informative answer.  This is a live SAN, are there any issues with running those commands?  Looks like you['re just clearing counters and resetting the sensors?  This is safe to do?

    Thanks,

    Josh.

  • Absolutely safe to run these, yes. That's exactly what you're doing. I should tell you as well, once in a great while, this won't clear it. Sometimes the 'message' is just traded between the controllers, and "sticks" in the GUI. IF these don't clear the message, you'll need to boot the SAN. (not fun, I know.)

    But, the chances are good that these commands are all you need.

    Let me know!

  • What will this command do to live data?

    SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

  • It doesn't touch data. It "redistributes" the ownership of virtual disks. IF you have multipath drivers installed (MDSM GUI and Host access tools) and, both raid controllers are  cabled, then you *shouldn't* see any sort of disconnect. The transfer of ownership from one controller to the next, *shouldn't* take longer than the timeouts are set.

    Still, if you'd feel more comfortable waiting for an open maintenance window, then do that.

  • Daniel,

    I'm seeing lots of RAID Controller Module errors.  please see the issued command below:

    DRIVE CHANNELS----------------------------

      SUMMARY

         CHANNEL  PORT              STATUS

         1        In,Out,Expansion  Degraded

         2        In,Out,Expansion  Degraded

      DETAILS

         DRIVE CHANNEL 1

            Port: In, Out, Expansion

               Status: Degraded

                  Reason: Error threshold exceeded

               Max. Rate: 3 Gbps

               Current Rate: 3 Gbps

               Rate Control: Switched

               DRIVE COUNTS

                  Total # of attached physical disks: 29

                  Connected to: A (left), Port In

                     Attached physical disks: 14

                        Expansion enclosure: 1 (14 physical disks)

                  Connected to: 0, Port Expansion

                     Attached physical disks: 15

                        Expansion enclosure: 0 (15 physical disks)

               CUMULATIVE ERROR COUNTS

                  RAID Controller Module 0

                     Baseline time set:                       11/18/14 5:32:52 PM

                     Sample period (days, hh:mm:ss):          371 days, 20:04:01

                     RAID Controller Module detected errors:  0

                     Physical Disk detected errors:           3485767

                     Timeout errors:                          0

                     Total I/O count:                         757848036

                  RAID Controller Module 1

                     Baseline time set:                       11/18/14 5:32:52 PM

                     Sample period (days, hh:mm:ss):          598 days, 12:09:58

                     RAID Controller Module detected errors:  948

                     Physical Disk detected errors:           5993184

                     Timeout errors:                          73

                     Total I/O count:                         2629457758

               CAPTURED INTERVAL ERROR COUNTS

               RAID Controller Module 1

                  Start time: {0}                          11/18/14 10:23:41 PM

                  End time: {0}                            6/5/16 6:42:17 AM

                  RAID Controller Module detected errors:  916

                  Physical Disk detected errors:           5642849

                  Timeout errors:                          25

                  Total I/O count:                         1835496439

         DRIVE CHANNEL 2

            Port: In, Out, Expansion

               Status: Degraded

                  Reason: Error threshold exceeded

               Max. Rate: 3 Gbps

               Current Rate: 3 Gbps

               Rate Control: Switched

               DRIVE COUNTS

                  Total # of attached physical disks: 29

                  Connected to: B (right), Port In

                     Attached physical disks: 14

                        Expansion enclosure: 1 (14 physical disks)

                  Connected to: 1, Port Expansion

                     Attached physical disks: 15

                        Expansion enclosure: 0 (15 physical disks)

               CUMULATIVE ERROR COUNTS

                  RAID Controller Module 0

                     Baseline time set:                       11/18/14 5:32:52 PM

                     Sample period (days, hh:mm:ss):          371 days, 20:04:01

                     RAID Controller Module detected errors:  129

                     Physical Disk detected errors:           3810970

                     Timeout errors:                          2

                     Total I/O count:                         344810553

                  RAID Controller Module 1

                     Baseline time set:                       11/18/14 5:32:52 PM

                     Sample period (days, hh:mm:ss):          598 days, 12:09:58

                     RAID Controller Module detected errors:  1655

                     Physical Disk detected errors:           5509740

                     Timeout errors:                          33

                     Total I/O count:                         82661965

               CAPTURED INTERVAL ERROR COUNTS

                  RAID Controller Module 0

                     Start time: {0}                          11/18/14 5:32:52 PM

                     End time: {0}                            11/8/15 3:57:26 PM

                     RAID Controller Module detected errors:  65

                     Physical Disk detected errors:           3643402

                     Timeout errors:                          2

                     Total I/O count:                         189870495

    Script execution complete.

    SMcli completed successfully.

    To the untrained eye that looks bad with over 1600 errors on module 1.  Granted that's over 2 years.  I'm trying to implement a 2nd md3000i/md1000 but having problems with the speed.  I'll create another post for  that one I think.

    Anyways, are all those errors something I need to worry about?  I haven't reset the stats yet.

    Thanks again for all your help!

  • Hey, Josh.

    Yes, these are historical errors. (acquired over the life of the array roughly) Nothing to worry about in the present. Definitely open a new post on that one, so we have a case for each issue. :)

    Have a happy holiday!