VMware SRM Error: “Failed to create snapshots of replica devices”

Today I encountered VMware SRM error “Failed to create snapshots of replica device Cause: SRA command ‘testFailoverStart’ failed. Storage port not found Either Storage port information provided in NFS list is incorrect else Verify the “isPv4″ option in ontap_config file matches the ipaddress in NFS field.”


Found the solution in kb article 000016026: https://kb.netapp.com/support/s/article/ka11A0000001BN5QAM/sra-command-testfailoverstart-failed-storage-port-not-found-either-storage-port-information-provided-in-nfs-list-is-incorrect?language=en_US

Looks to be caused by the firewall-policy of the SVM data LIFs. These were set for “mgmt”, which are not detected by the SRA according to the kb article.

To change the firewall-policy from “mgmt” to “data”:
net int modify -vserver [vserver_name] -lif [data_lif_name] -firewall-policy data  
To list LIFs by firewall-policy:
net int show -fields firewall-policy  

Article also advises checking the ontap_config file on the SRM server to

ensure that the NFS IP address on the controller is correct and the IP address format mentioned in the NFS address field matches the value set for the isipv4 option in the ontap_config file

By default, the configuration file is located at install_dirProgram FilesVMwareVMware vCenter Site Recovery ManagerstoragesraONTAPontap_config.txt. You’ll look for the “isPv4” option.

NetApp OCUM Error: “Cluster cannot be deleted when discovery is in progress.”

I opened a case recently as I had a cluster in OnCommand Unified Manager that was no longer polling. When I tried to delete and recreate I received the error “Cluster cannot be deleted when discovery is in progress.” I am running version 7.0.

Turns out I had hit a Burt. The Burt # is 1053008, and you can view it by logging into the support site with your credentials.

The fix is here: https://kb.netapp.com/support/s/article/An-incomplete-removal-of-a-cluster-via-the-UM-dashboard-prevents-further-collection-when-the-cluster-is-re-added

The link doesn’t seem to be working right now, but I copied the details below.

Note: for vApps you will need to use diag shell, instructions can be found here: https://kb.netapp.com/support/s/article/ka31A00000012qfQAA/How-to-access-the-OnCommand-Virtual-Machine-DIAG-shell

Symptom
• Error unable to discover cluster, Cluster already exists.
• When a cluster is added to the OnCommand Unified Manager (UM) Dashboard, then this even gets logged:

Failed to add cluster 172.16.42.16. An internal error has occurred. Contact technical support. Details: Cannot update server (com.netapp.oci.server.UpdateTaskException [68-132-983])

When the user attempts to remove the cluster, it fails indicating that it is being acquired.
Navigating to Health, Settings, Manager datasources and observing that the datasource is failing.
• For UM, the ocumserver-debug log may contain:
2016-12-19 08:06:14,583 DEBUG [oncommand] [reconcile-0] [c.n.dfm.collector.OcieJmsListener] OCIE JMS notification message received: {DatasourceName=Unknown, DatasourceID=-1, ClusterId=3387647, ChangeType=ADDED, UpdateTime=1482152623430, MessageType=CHANGE}
• When a cluster is added to the UM Dashboard this message may be displayed indicating that the issue is in the OnCommand Performance Manager (OPM) database:
Cluster in a MetroCluster configuration is added only to Unified Manager. Cluster add failed for Performance Manager.

Note: The MetroCluster part of the message is not relevant but is included as that full message is possible.
• For OPM, the ocfserver-debug log may contain:
2016-12-19 09:15:00,013 ERROR [system] [taskScheduler-5] [o.s.s.s.TaskUtils$LoggingErrorHandler] Unexpected error occurred in scheduled task.
com.netapp.ocf.collector.OcieException: com.onaro.sanscreen.acquisition.sessions.AcquisitionUnitException [35]
Failed to getById id:-1
<>
Caused by: com.onaro.sanscreen.acquisition.sessions.AcquisitionUnitException: Failed to getById id:-1
<>

Cause
This is under investigation in Documented Issue 1053008.

Because the cluster is successfully removed from the datasource tables but not the inventory tables, when the cluster is re-added there is a disconnect between these two tables. Attempting to re-add the inventory and performance fail due to duplicate entries tied to the old objects in the database as the values are not unique.
Solution
1. Shutdown the OPM host.
2. Shutdown the UM host.
3. Take a VMware snapshot or other backup per your company policy.
4. Boot UM
5. When the UM WebUI is accessible, boot OPM.
6. Check MYSQL in order to determine which hosts have the invalid datasource ID.
For vApps
Use KB 000030068, get to the diag shell
diag@OnCommand:~# sudo mysql -e “select datasourceId, name, managementIp from netapp_model.cluster where datasourceId = -1;”
+————–+————-+————–+
| datasourceId | name| managementIp |
+————–+————-+————–+
| -1 | clusterName | 10.0.0.2 |
+————–+————-+————–+
diag@OnCommand:~#

For RHEL.
diag@OnCommand:~# sudo mysql -e “select datasourceId, name, managementIp from netapp_model.cluster where datasourceId = -1;”
+————–+————-+————–+
| datasourceId | name| managementIp |
+————–+————-+————–+
| -1 | clusterName | 10.0.0.2 |
+————–+————-+————–+
diag@OnCommand:~#

Windows
A. Open a Windows Command Prompt window
B. Browse to the MySQLMySQL Server 5.6bin directory
EXAMPLE:>cd “Program FilesMySQLMySQL Server 5.6bin”Authenticate MYSQL to access the Database:> mysql -u -p
C. When you press [ENTER] the system will prompt you to enter the user’s password.
a. NOTE: The user and password were created when MYSQL was first installed on the Windows host. There is not a NETAPP default user that can be used to authenticate MYSQL.
MySQL>select datasourceId, name, managementIp from netapp_model.cluster where datasourceId = -1;”
+————–+————-+————–+
| datasourceId | name| managementIp |
+————–+————-+————–+
| -1 | clusterName | 10.0.0.1 |
+————–+————-+————–+

  1. Download the attached script appropriate for the version of UM and OPM.
    For vApps,
    A. Use an application such as FileZilla or WinSCP to upload the script to the /upload directory on the vApp.
    B. Use KB 000030068, get to the diag shell
    C. Add the execute attribute to the script.
    a. Syntax# sudo chmod +x /jail/upload/BURT1053008_um_70_2016-12-27.sh
    For RHEL
    A. Be sure to have sudo or root access to the host.
    B. Move the script to /var/logs/ocum/
    C. Add the execute attribute to the script.
    a. Syntax# sudo chmod +x /var/logs/ocum/BURT1053008_um_70_2016-12-27.sh
    For Windows (UM only)
    A. Browse to the Directory where MYSQL is installed MySQLMySQL Server 5.6bin
    a. EXAMPLE: C:Program FilesMySQLMySQL Server 5.6bin
    B. Save a copy of the following Script in the ‘MySQL Server 5.6bin’ directory: BURT1053008_um_70_Windows”

  2. Execute the script.
    A.For vApps: sudo /jail/upload/scriptname
    B. For RHEL: sudo /var/logs/ocum/script
    name
    C. For WIndows: MYSQL.ext -u -p MySQLMySQL Server 5.6bin >Script_name

  3. Confirm that the datasources with ID -1 are gone:
    vApps and RHEL: diag@OnCommand:~# sudo mysql -e “select datasourceId, name, managementIp from netappmodel.cluster where datasourceId = -1;”
    Windows:
    A.Authenticate MYSQL as per the steps outlined above, then run Syntax:>
    select datasourceId, name, managementIp from netapp
    model.cluster where datasourceId = -1;
    B. Type: Exit to exit MYSQL
  4. Reboot the host
  5. Once UM is back up, perform this same process with OPM.
  6. Once UM an OPM have been corrected, perform a discovery of the cluster and verify that it is showing up within the WebUI.
  7. If the same failure occurs, please contact NetApp Technical Support for further assistance.
    After I ran the scripts provided in the BURT, on both the OnCommand Unified Manager and OnCommand Performance Manager servers, the cluster no longer showed up in inventory and I could add succesfully.