The watchdog timer expired.

Servers

Servers
Information and ideas on Dell PowerEdge rack, tower and blade server solutions.

The watchdog timer expired.

This question is answered

We have several Dell PowerEdge T620 servers in remote locations throughout our Enterprise.  Each of them is randomly throwing the following event and thus far I've found no information about the message or how to resolve it.  I'm hoping someone here can help me figure this out.


Event Message: The watchdog time expired.

Severity: Critical

Detailed Description: The operating system or potentially an application failed to communicate to the baseboard management controller (BMC) within the timeout period.

Recommended Action: Check the operating system, application, hardware, and system event log for exception events.

Message ID: ASR0000

System Model: PowerEdge T620

Power State: ON

Operating System: Microsoft Windows Server 2012, Standard x64 Edition

While I've been working in the desktop support world for a very long time, I'm fairly new to Dell servers.  I'm trying to help another highly over-worked, over-stressed, administrator with this issue.  Is someone can spare the time to help me learn where to look to gain more insight on what might be going on, I'd really appreciate it.  I'm willing and able to learn so I can help take some work off a co-worker's plate.  Thanks.

Verified Answer
  • I apologize for the delay in responding.  After reviewing the errors and doing some research, the error is coming from Dell's OpenManage software v7.2 . Our recommendation is to update your OpenManage to version 7.3  and monitor.  This version should address the timeout error in this particular service is giving the watchdog error.

    Regards,

     

    Geoff P
    Dell | Social Outreach Services - Enterprise


    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device!
    (iOS, Android, Windows)

All Replies
  • The watchdog timer is used to monitor the status of a component. It operates by monitoring responses. When it stops getting a heartbeat from a component that it is monitoring then the timer expires, and you receive an error in the log. When the timer expires it will initiate whatever action is set. If the operating system stops responding then the timer will expire and restart the server if it is set to perform that action.

    The above error doesn't tell us why the timer expired, so you will need to review your hardware and operating system logs to find out what happened when the timer expired.

    Regards,

    Geoff P
    Dell | Social Outreach Services - Enterprise


    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device!
    (iOS, Android, Windows)

  • Just prior to the error, the following events occured:

    8/23/2013 11:10:06 PM

    Faulting application name: dsm_sa_datamgr64.exe, version: 7.2.0.3801, time stamp: 0x50c769ae
    Faulting module name: dciemp64.dll, version: 7.2.0.3999, time stamp: 0x50c77d73
    Exception code: 0xc0000005
    Fault offset: 0x0000000000004038
    Faulting process id: 0x8e0
    Faulting application start time: 0x01ce9f07bdc36762
    Faulting application path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Faulting module path: C:\Program Files\Dell\SysMgt\omsa\bin\dciemp64.dll
    Report Id: acddebe1-0c6a-11e3-93f9-001018f63d67
    Faulting package full name:
    Faulting package-relative application ID:

    Followed by...

    08/23/2013 11:10:06 PM

    Fault bucket , type 0
    Event Name: APPCRASH
    Response: Not available
    Cab Id: 0

    Problem signature:
    P1: dsm_sa_datamgr64.exe
    P2: 7.2.0.3801
    P3: 50c769ae
    P4: dciemp64.dll
    P5: 7.2.0.3999
    P6: 50c77d73
    P7: c0000005
    P8: 0000000000004038
    P9:
    P10:

    Attached files:
    C:\Windows\Temp\WER1075.tmp.appcompat.txt
    C:\Windows\Temp\WER10D4.tmp.WERInternalMetadata.xml
    C:\Windows\Temp\WER10D5.tmp.hdmp
    C:\Windows\Temp\WER13C2.tmp.dmp

    These files may be available here:
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_dsm_sa_datamgr64_6033d1f5754645d6f47ce76327e3cf9364ed73_cab_094e147b

    Analysis symbol:
    Rechecking for solution: 0
    Report Id: acddebe1-0c6a-11e3-93f9-001018f63d67
    Report Status: 96
    Hashed bucket:


    And finally...

    08/23/2013 11:10:08 PM

    Fault bucket , type 0
    Event Name: APPCRASH
    Response: Not available
    Cab Id: 0

    Problem signature:
    P1: dsm_sa_datamgr64.exe
    P2: 7.2.0.3801
    P3: 50c769ae
    P4: dciemp64.dll
    P5: 7.2.0.3999
    P6: 50c77d73
    P7: c0000005
    P8: 0000000000004038
    P9:
    P10:

    Attached files:
    C:\Windows\Temp\WER1075.tmp.appcompat.txt
    C:\Windows\Temp\WER10D4.tmp.WERInternalMetadata.xml
    C:\Windows\Temp\WER10D5.tmp.hdmp
    C:\Windows\Temp\WER13C2.tmp.dmp

    These files may be available here:
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_dsm_sa_datamgr64_6033d1f5754645d6f47ce76327e3cf9364ed73_cab_094e147b

    Analysis symbol:
    Rechecking for solution: 0
    Report Id: acddebe1-0c6a-11e3-93f9-001018f63d67
    Report Status: 4
    Hashed bucket:

    Does that help at all?  If not, where specifically should I be looking for logs?  I've checked the iDRAC7 and it had less data then the original message.  The above mentioned three events were located in the Windows Event Viewer.

  • I apologize for the delay in responding.  After reviewing the errors and doing some research, the error is coming from Dell's OpenManage software v7.2 . Our recommendation is to update your OpenManage to version 7.3  and monitor.  This version should address the timeout error in this particular service is giving the watchdog error.

    Regards,

     

    Geoff P
    Dell | Social Outreach Services - Enterprise


    Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device!
    (iOS, Android, Windows)

  • Dell-Geoff P,


    I was at one of our facilities yesterday so I went ahead and ran the latest SUU upon the server and got it caught up on all firmware and driver updates.  That did include the OpenManage Server Administrator upgrade to 7.3.0.  We'll monitor the server over the next few days and I'll report back with my findings.


    Thank you,

    Geoff

  • So far I've not seen this message return on the 1 server upgraded.  I will be upgrading a second of seven servers tomorrow.  I'll update you afterward.  Thank you for your patience while we work to get fully updated.  It should go faster after tomorrow's work.

  • It appears that upgrading to Dell OpenManage 7.3 has resolved this issue.  Thanks for your help!

  • Actually, I have the same error, but mine is brand new server loaded with OM7.3.

    Any though?

    -------------------------------

    System Host Name: JCMS8BDC01
    Event Message: The watchdog timer expired.
    Date/Time: Mon Oct 14 2013 16:37:08
    Severity: Critical
    
    Detailed Description: The operating system or potentially an application failed to communicate to the baseboard management controller (BMC) within the timeout period.
    Recommended Action: Check the operating system, application, hardware, and system event log for exception events. 
    Message ID: ASR0000

    ------------------------
    Windows log reads
    ------------------------

    Faulting application name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Faulting module name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Exception code: 0xc0000005
    Fault offset: 0x0000000000014c77
    Faulting process id: 0x5c0
    Faulting application start time: 0x01cec924561cbd3a
    Faulting application path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Faulting module path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe


  • I've had the same error on 4 Windows 2008 R2 PE blades with OM 7.3 after installing this month's Microsoft patches which included numerous .NET.  After the reboot the DSM SA Data Manager service does not start.  Manually starting the service works.  A second reboot the service starts on its own.

    Faulting application name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Faulting module name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Exception code: 0xc0000005
    Fault offset: 0x0000000000014c77
    Faulting process id: 0x780
    Faulting application start time: 0x01cec9552d6c3a8b
    Faulting application path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Faulting module path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Report Id: 9040c032-3548-11e3-8a30-e0db55230842

  • Just installed the MS updates on R620 Win 2008 R2 with OM 7.3 and had an unexpected ASR Watchdog reboot

    Faulting application name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Faulting module name: dsm_sa_datamgr64.exe, version: 7.3.0.350, time stamp: 0x51b23742
    Exception code: 0xc0000005
    Fault offset: 0x0000000000014c77
    Faulting process id: 0x584
    Faulting application start time: 0x01cec9d288eeaa00
    Faulting application path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Faulting module path: C:\Program Files\Dell\SysMgt\dataeng\bin\dsm_sa_datamgr64.exe
    Report Id: e3ac4279-35c5-11e3-9bd5-b8ca3af5c99a

    7.3 is definitely not the solution here. Anyone any ideas?

     

  • Encountering same error from IDRAC on server R710 running VMware ESX :

    Event: The watchdog timer expired.
    Date/Time: Sat Jan 18 2014 10:24:31
    Severity: Critical
    Model: PowerEdge R710
    Service Tag: F931Z4J
    BIOS version: 6.3.0
    Hostname: left blank intentionally 
    OS Name: VMware ESXi 5.1.0 build-1065491.0 build-106549
    iDrac version: 1.85

  • problem was solved by upgrading IDRAC firmware from 1.85 to 1.96 directly.

  • IDRAC same critical alert  re appears  2 weeks later with  iDrac being upgraded to latest version: 1.96 

    Message: 
    Event: The watchdog timer expired.
    Date/Time: Tue Feb 04 2014 01:45:48
    Severity: Critical
    Model: PowerEdge R710
    Service Tag: H4SY75J
    BIOS version: 6.3.0
    Hostname: 
    OS Name: VMware ESXi 5.1.0 build-1065491.0 build-106549
    iDrac version: 1.96

  • We are getting a LOT of these alerts now.....every time we reboot a server after a WSUS update (ie monthly). All the boxes are on 7.3.

    Is there a way to disable this function/alert??

    Thx,

    John Bradshaw