System Alert from SP of (REBOOT (watchdog reset)) CRITICAL | NetApp

Ran into this recently. Was upgrading SP in older FAS3270 HA pair running 7.3.6. I did software install, then sp upgrade. That’s when Cluster Takeover occured:

Mon Aug 22 08:59:47 MDT [cf.fsm.firmwareStatus:info]: Cluster monitor: partner dumping core  
Mon Aug 22 08:59:49 MDT [cf.ic.xferTimedOut:error]: wafl interconnect transfer timed out  
Mon Aug 22 08:59:50 MDT [cf.ic.xferTimedOut:error]: ofw interconnect transfer timed out  
Mon Aug 22 08:59:50 MDT [netif.linkDown:info]: Ethernet c0a: Link down, check cable.  

I have seen this before, and after checking the SP I did confirm it was indeed over 500 days uptime.

SP> sp uptime  
 15:02:17 up 759 days, 23:14, load average: 1.26, 1.27, 1.12

To fix this, I did a giveback, rebooted both SPs, and then could perform SP Upgrade successfully. You are prompted to reboot the SP after a successful upgrade, and I know I’ve been able to do this in the past without issue, but lesson learned – it’s better to perform SP reboot prior, than risk a takeover.

More from the KB article:

Service Processor Can Trigger Watchdog Reset After 500+ Days of Uptime:
Summary

The Service Processor (SP) contains a memory resource leak affecting the IPMI software, eventually causing the SP to become unresponsive and possibly triggering a watchdog reset of Data ONTAP®. In order to trigger this reset, the SP must be running for approximately five hundred (500) days or longer.

Symptom

The SP reports the following errors regularly when it is close to failing on the Data ONTAP console:

[hostname: statd: sp.network.link.down:warning]: Service Processor (SP) network port link down due to cable or network errors.
[hostname: statd: sp.network.link.down:warning]: Service Processor (SP) network port link down due to cable or network errors.
[hostname: statd: sp.network.link.down:warning]: Service Processor (SP) network port link down due to cable or network errors.
[hostname: orftp_rcv_file_from: sp.orftp.failed:warning]: SP communication error, receiver could not locate file on SP.
[hostname: statd: spmgmt.driver.hourly.stats:warning]: The software driver for the Service Processor (SP) detected a problem: Configuration Error (1). 
The following messages are reported in the SP’s Event Log proximate to the watchdog reset:

[IPMI.notice]: 1705 | c0 | OEM: ffff70005000 | ManufId: 150300 | SP Reset Externally
[IPMI.notice]: 1805 | c0 | OEM: fcff70000000 | ManufId: 150300 | POS Register: Unexpected Reset
[IPMI.notice]: 1905 | c0 | OEM: ffff70005000 | ManufId: 150300 | SP Reset Externally
[IPMI.notice]: 1a05 | c0 | OEM: fcff70000000 | ManufId: 150300 | POS Register: Unexpected Reset
[IPMI.notice]: 1b05 | c0 | OEM: ffff70005000 | ManufId: 150300 | SP Reset Externally
[IPMI.notice]: 1c05 | c0 | OEM: fcff70000000 | ManufId: 150300 | POS Register: Unexpected Reset
[IPMI.notice]: 1d05 | c0 | OEM: ffff70005000 | ManufId: 150300 | SP Reset Externally
[IPMI.notice]: 1e05 | c0 | OEM: fcff70000000 | ManufId: 150300 | POS Register: Unexpected Reset
[IPMI.notice]: 1f05 | c0 | OEM: ffff70005000 | ManufId: 150300 | SP Reset Externally
[IPMI.notice]: 2005 | c0 | OEM: fcff70000000 | ManufId: 150300 | POS Register: Unexpected Reset
[IPMI.notice]: 2105 | 02 | EVT: 6f01ffff | Partner_IO_Pre | Assertion Event, "Absent"
[IPMI.notice]: 2205 | 02 | EVT: 0301ffff | System_Fault | Assertion Event, "State Asserted"
[IPMI.notice]: 2305 | 02 | EVT: 0301ffff | Controller_Fault | Assertion Event, "State Asserted"
[IPMI.notice]: 2405 | 02 | EVT: 0300ffff | System_Fault | Assertion Event, "State Deasserted"
[IPMI.notice]: 2505 | 02 | EVT: 0301ffff | PSU1_Input_Type | Assertion Event, "State Asserted"
[IPMI.notice]: 2605 | 02 | EVT: 0301ffff | PSU2_Input_Type | Assertion Event, "State Asserted"
[IPMI.notice]: 2705 | 02 | EVT: 0301ffff | System_Fault | Assertion Event, "State Asserted"
[IPMI Event.critical]: NMI
[IPMI Event.critical]: L2 watchdog timeout hard reset
The messages in the Event Log will occur close enough to the watchdog reset that they cannot be used to predict failure. If a system restarts due to a watchdog reset, this signature can be used to verify if this issue caused the reset.  
Workaround

Run the SP CLI command, sp uptime to determine how long the SP has been running since the last reset.

Reboot the SP:

A reboot of the SP will clear any existing memory resource issues. This action is nondisruptive to data operations on the storage system. Once restarted, the SP will be able to run undisturbed for approximately the next five hundred days. The rate and severity of the leak is well characterized, this approach is a reasonable workaround to address the problem until the SP is updated with a fixed firmware release.

Caution: If the SP Firmware of your system is < 1.3 (FAS32xx/62xx platforms) or < 2.1 (FAS22xx platforms), you should be aware of BUG ID: 546048: Service Processor (SP) fails to come up after “sp reboot” command. Plan for a potential maintenance window to power-cycle the system, if needed.

Solution

Install a SP firmware release that mitigates this issue, once they are available.

In my case, for platform FAS3270, running Data ONTAP Release 7.3.x; the SP FW Release would be 1.3.3P2; which is what I installed.

More on the KB here: https://kb.netapp.com/support/index?page=content&id=7010141&locale=en_US

Leave a Reply

Your email address will not be published. Required fields are marked *