Lost blocks and lost configuration/array on PS6000 - Crisis

Storage

Storage
Gain insight and information on Dell Storage products from engineers, peers and experts in the industry.

Lost blocks and lost configuration/array on PS6000 - Crisis

This question is answered

Lost blocks and lost configuration on PS6000

After power loss and restart, the following symptoms:

No Ethernet links active
Login via Console serial cable gives  lost blocks
Old grpadmin password is remembered, but no group info present, so the member is asking to initialise the array. All data would be lost.

Crisis! Is there any way to recover the array and group info and any data, even if unreliable and imcomplete?

 

 

Verified Answer
  • Caution: Before using the clearlostdata command, call your support provider.

    Clears lost data blocks on an array.

    Use the clearlostdata command when blocks have been lost but the array does not know which blocks have been lost. In this case, the array will not be able to boot or will no longer be accessible from the network.

    Once connected to the array (either through a serial connection or a network connection to an IP address for a network interface), press the <Enter> key and log in to a group administration account that has read-write permission (for example, the grpadmin account). The console will describe the problem and recommend that you use the clearlostdata command to try to correct the lost block problem. At the CLI prompt, enter the clearlostdata command.

    The clearlostdata command may or may not be able to correct the entire lost blocks problem. In some cases, the problem may be corrected enough to boot the array. If volumes remain offline, you may not be able to recover the data on them. In these cases, you must delete the volumes, recreate them, and recover the data from a backup or replica.

    If the array is a member of a group and the clearlostdata command does not resolve the problem, the array will not boot and you might have to delete the member from the group. If you delete the member, the array will be reset and returned to its factory defaults. Only delete an offline member in extreme circumstances because resetting the array will remove any data on the array. See member on page 153.

    Format

    clearlostdata

    Example

    The following example shows how to log in to the grpadmin account from a console terminal and run the clearlostdata command on the array. PS Series Storage ArraysUnauthorized Access Prohibited

    login: grpadmin

    Password:

    Welcome to Group Manager

    Data loss has occurred. The array will not initialize until the error condition has been cleared.

    > clearlostdata

    cleaning luns

    raid firing scan complete

     

    -joe

    Follow me on Twitter: @joesatdell

     

All Replies
  • Caution: Before using the clearlostdata command, call your support provider.

    Clears lost data blocks on an array.

    Use the clearlostdata command when blocks have been lost but the array does not know which blocks have been lost. In this case, the array will not be able to boot or will no longer be accessible from the network.

    Once connected to the array (either through a serial connection or a network connection to an IP address for a network interface), press the <Enter> key and log in to a group administration account that has read-write permission (for example, the grpadmin account). The console will describe the problem and recommend that you use the clearlostdata command to try to correct the lost block problem. At the CLI prompt, enter the clearlostdata command.

    The clearlostdata command may or may not be able to correct the entire lost blocks problem. In some cases, the problem may be corrected enough to boot the array. If volumes remain offline, you may not be able to recover the data on them. In these cases, you must delete the volumes, recreate them, and recover the data from a backup or replica.

    If the array is a member of a group and the clearlostdata command does not resolve the problem, the array will not boot and you might have to delete the member from the group. If you delete the member, the array will be reset and returned to its factory defaults. Only delete an offline member in extreme circumstances because resetting the array will remove any data on the array. See member on page 153.

    Format

    clearlostdata

    Example

    The following example shows how to log in to the grpadmin account from a console terminal and run the clearlostdata command on the array. PS Series Storage ArraysUnauthorized Access Prohibited

    login: grpadmin

    Password:

    Welcome to Group Manager

    Data loss has occurred. The array will not initialize until the error condition has been cleared.

    > clearlostdata

    cleaning luns

    raid firing scan complete

     

    -joe

    Follow me on Twitter: @joesatdell

     

  • Thanks Joe!

    Gulp.

    OK, Typed that in. Hit enter. No visible output yet. What time behaviour should I expect? (sata 16x1GB)

    And maybe of more concern:  What causes this, how likely is it to hit my other group/members?

  • And can I re-enable the network connection for telnet/ssh/ftp? Collecting diag files over serial port is not fun!

  • Per the instruction posted: You called support and you also had the prompt "Data loss has occurred. The array will not initialize until the error condition has been cleared" prior to running the command right?

    Typically only takes a short time, say a minute or two to complete.

    When the array isn't properly shut down (i.e., issuing the shutdown command) there could be several things that cause it (that's why the first thing we instruct is to call support to help identify what the actual cause was).  Some possible causes are:

    • We either have no data in cache, or the data that is in cache doesn't match what we think should be in cache
    • We have data in cache, but we don't have the corresponding disks that the data belongs to
    • Also a piece of hardware could have been damaged (for example through a voltage spike)

    Unless the array has recovered from this state, and has completed the initialization (a clean boot), the only way to communicate to the member is through serial connection

    -joe

    Follow me on Twitter: @joesatdell

     

  • Thanks Joe.

    clearlostdata isn't doing anything for me except a text dump for diagnostic ; and the member is out of maintenance. So I decided to bite the bullet and reset the array. The data's not mission-critical, fortunately.  One disk is reporting faulty which may have contributed to the problem, even though there were 2 hot spares.

    I am worried about what caused the data loss, and the ongoing dangers to the other group and its members. They're production/missioncritical and I'm very nervous of this recurring.

  • I fully understand you concerns.

    Just to be clear, there clearlostdata command was only for the specific situation of the array displaying "Data loss has occurred. The array will not initialize until the error condition has been cleared".

    Even though you didn't have a support contract for that member in some of these "worst case scenario’s”, support would still have assisted you in your time of need.  So please next time ensure you call us, this may have saved you some time and you may not have needed to reset the array.

    Since the array was reset, I'm sorry to say that we can't determine the exact cause now.  But my guess is that when the array booted up, the CM that became active believes there was unsaved data in the cache, but it has been lost before or during the most recent system start-up (after a sudden power loss or controller failure). It was possible that the data was still available in the cache on the second controller (if one is available), so we could have tried to use that data to restart the array.

    Regarding reliability (this is not a sales pitch), but we do have tens of thousands of customers who use the arrays daily for mission critical environments.

    -joe

    Follow me on Twitter: @joesatdell