Recovery :: Windows 2008 R2 AlwaysOn Cluster - Migrate DR Node
Oct 8, 2015
I have a Windows 2008 R2 Always on Cluster with 3 nodes (two in the primary site and one in the DR site).
Primary Site:
-Primary Site Server1
-Primary Site Server2
DR Site 1 (to be decommed):
-DR Site Server1
Our company is planning on decommissioning the DR site. But before we do this, we want to add a 4th site to the cluster. Migrate the data...and then decommission the original DR Site.
Is it possible to have this configuration:
Primary Site:
-Primary Site Server1
-Primary Site Server2
DR Site 1 (to be decommed):
-DR Site Server1
DR Site 2 (NEW DR Site):
-DR Site Server1
IF this is possible, do I simply add the new DR site to the existing cluster (same steps as adding the first DR node to the cluster when the cluster was originally configured? or are there special steps?
I need to change the authentication mode from 'windows' to mixed, its a 4-node participating in always-on... Will this brake/impact alwayson in any way? I know I have to restart the sql instance.
Our network guys have to carry out an IBM Flex Chassis move at our data centre, which will affect the primary replica of one of our SQL 2012 AlwaysOn Availability Group nodes (the secondary replica won't be affected).
They have suggested using vMotion to migrate the primary replica to another virtual host, which will result in a very brief period of network outage for the node.
I've done some reading and have seen a few potential issues regarding Stun During Page Send (SDPS) and increasing thresholds within WSFC. Unfortunately, we're not able to test this prior to the migration, so I have a few questions...
Would it be necessary to failover to the secondary replica node before performing the vMotion (and back again afterwards)?
We have a requirement to build SQL environment which will give us local high availability and disaster recovery to second site. We have two sites- Site A & Site B. We are planning to have two nodes at Site A and 2 nodes at Site B. All four nodes will be part of same Windows failover cluster. We will build two SQL Cluster, InstanceA will be clustered between the nodes at Site A Server and InstanceB will be clustered between the nodes at Site B, we will enable Always On Between the InstanceA and InstanceB and will be primary owner where data will be written on InstanceA and will be replicated to InstaceB. URL....Now we want we will have instanceC on the Site B and data will be writen from the application available on Site B, will be replicated to the instance on the Site A as replica.
How do I add my second (secondary) node in my AlwaysOn Availability Group, after adding my head node, and the secondary node is a virtual machine. See based on the attached file if it is the correct way?
Cluster services gives the high availability needed - that is great.But I have never seen any discussion about what happens when a nodefails - what do you do to get everything back to the active-passivetandem.I imagine there is not much difference in terms of recovery procedurefor either active or passive node. So I'm just going to make up ascenario that we have encountered. The system hard drive (not theshared disk) on primary node fails. Cluster fails over to the passivenode. Following are the problems I have at hand:-After installing windows, I need to install driver and configure thepermission to access the SAN. There is no way I could do it since thesecondary node has exclusive access to the disks.-Imagine I got that working, is there anyway to install SQL so SQLwould know this server used to be the primary node and attach the DBand translog automatically-Finally, there is no proper way to apply SQL 2000 service pack 3a.Originally when the cluster was fully functional, the service pack wasapplied to active node and that automatically upgrades passive node.Now we have a machine without 3a and a machine with 3a alreadyinstalled. See any problem?Consider all of the above as this one big question: What is a properprocedure to restore a cluster when one of the node goes down? Whetherit's the active or passive node.
We have 2 clusters, 1 running SQL 2008 on Windows 2008 R2 server and 1 running SQL 2000 on Windows 2003 Server. Because of a disaster with the disks, each of the passive nodes had to be rebuilt and Ive been asked to install SQL on the nodes.
Ive not done this before. Does this mean simply adding a new node to the cluster through the wizard? Or do I need to reinstall the entire cluster?
I think SQL 2000 is too risky as its unsupported, so Im going to resist that. But how should I approach the SQL 2008 Instance?
My environment has a 4 node cluster , 2 in primary and 2 in sec dc. Storage is sperate for both.
Need to setup always on for 4 Instances there on the 2 nodes of the primary dc. Is there any restriction in setting up always on for multiple instances for a cluster.
I have had a serious issue with a production AlwaysOn cluster whereby the service did not successfully transition to the secondary node and I cannot find the root cause of the issue.
Some details: It is a 2 node cluster (same datacenter) with a shared disk quorum, Windows Server 2012, both are virtual machines running on VMWare vSphere 5.5. SQL Server version is 2012 Enterprise SP2 CU6
The failover occurred because of a network incident (a spanning tree recalculation caused a connection timeout between both nodes). Initial entries in the SQL Log look normal for this event, for example:
05/08/2015 11:18:06: A connection timeout has occurred on a previously established connection to availability replica 'FIN-IE-PA078' with id [6910F4A9-87E7-4836-BA79-0F41BE90266D]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role. 05/08/2015 11:18:06: AlwaysOn Availability Groups connection with secondary database terminated for primary database 'UserManagement' on the availability replica with Replica ID: {6910f4a9-87e7-4836-ba79-0f41be90266d}. This is an informational message only. No user action is required.
[code]....
My interpretation of this is that the cluster failover attempts failed, because the network condition still persisted. The network interruption lasted approximately 2 minutes, and I would have expected the cluster to come back online at this point, after the restart delay period as suggested in the last entry in the error log. However this did not happen.
we have to build high availability SQL 2012 cluster for VDI and we have two options. One option is to build a server cluster with combination of failover and mirroring and other option is to build failover cluster with AlwaysOn.We are not sure which option to chose. We have contacted Microsoft support to provide us some documents and instructions for failovermirroring combination but they have send us instructions for AlwaysOn option.
What would be best way to build high availability cluster for VDI? Also, since first option is very complicated.
In case of hardware unrecoverable issue, I have two msdn articles which states different things.
First one claims you remove the node from mscs.
[URL]
Second one claims you should remove it using sql server installation and links to the first link which says you should do it from mscs:
[URL]
Then this third article invalidates the second article. "To remove a node from an existing SQL Server failover cluster, you must run SQL Server Setup on the node that is to be removed from the SQL Server failover cluster instance."
[URL]
It is a hardware faillure where the secondary node is inaccessible.
So what is the proper way to evict a node you cannot access due to a hardware failure?
note: I don't plan on adding back the failed nodes after removing it. i.e. I am interested in the removing part.
We are running with a 2 node windows cluster having three SQL instances on it.
OS: Windows server 2008R2 SP1 SQL : SQL server 2008R2 (10.50.6529)
Currently both nodes have 256 GB or memory and we are having multiple auto failover for resources. What will be the best practice for OS memory reservation (OS+tools) so that we can set SQL max memory settings accordingly?
We have two locations in US, I am thinking of having 2 node SQL cluster for Lync 2010, I alardy have One DB server running in one location, now we got new site where we are planning to have one more DB for redundancy.
I have a 3-nodes AlwaysOn cluster (Windows Server 2008 R2 SP1 + SQL Server 2012 RTM), Node Majority quorum, the quorum vote for each node is 1.
Today the AlwaysOn AG was suddenly down due to the cluster service on node 1 stopped and can't be started.
The error in eventlog is -
The cluster database could not be loaded. The file may be missing or corrupt. Automatic repair might be attempted. The Cluster Service service terminated unexpectedly. It has done this 2 time(s). The following corrective action will be taken in 120000 milliseconds: Restart the service.
The failover cluster database could not be unloaded. If restarting the cluster service does not fix the problem, please restart the machine.
The Cluster Service service terminated with service-specific error The system cannot find the file specified..
The error log in cluster log is -
0000156c.000008f8::2012/09/05-08:09:36.057 INFO [DM] Key RegistryMachineCluster.restored does not appear to be loaded (status STATUS_OBJECT_NAME_NOT_FOUND(c0000034)) 0000156c.000008f8::2012/09/05-08:09:36.057 WARN [DM] Node 1: Failed to unload restored hive from the registry with error STATUS_INVALID_PARAMETER(c000000d) 0000156c.000008f8::2012/09/05-08:09:36.057 INFO [DM] Node 1: loading local hive 0000156c.000008f8::2012/09/05-08:09:36.057 ERR [DM] Node 1: failed to unload cluster hive, error 2.
Now the cluster service can't be started on node 1, error code 2. Looks like the clusdb in C:windowscluster is missing or corrupted. How to restore the clusdb file? And how to prevent this happen again?
All nodes were well patched, AlwaysOn and cluster related hotfixes were all installed. [URL] .... doesn't wok.
I'm getting an error adding Replica to SQL AlwaysOn failover cluster in the new availability group wizard. When I enter the name of the target node (secondary replica) server and press connect, I get the following:
A network-related or instance-specific error occurred while establishing a connection to SQL Server.
The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 2) The system cannot fine the file specified
The SQL Browser service is up and running on the target. I am using an Azure VM for my SQL instance. This cluster spans geographies from our on-premise site to Azure via a VPN. This is a multi-subnet cluster. I'm attempting to create a new AG from the primary replica node and the target is a node on Azure called SSASNodeAz03.
[URL]
Full error:
Connect to Server Cannot connect to ssasnodeaz03
Additional information: A network-related or instance-specific error occurred while establishing a connection to SQL Server.
The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 2) The system cannot fine the file specified
We are planning to change all IPs of PRODUCTION Failover Cluster Setup. In my cluster setup ... we have 2 Physical Nodes with windows-2008, Roles are MSDTC and SQL-2008R2.
IP change for:
1. Both Nodes(Physical) 2. MSDTC 3. SQL Server 4. windows Cluster
So Almost... All IPs are going to change.
Im DBA here, I need to take care of SQL cluster and MSDTC. But I haven't performed this activity before.So I'm worrying about Impacts and consequences of this change. steps how should I perform this activity.
Server : Windows server 2008 DB Server : SQL Server 2008 (SP1)
Here are the series of events which happened.
1.) Event ID: 1135 Cluster node 'XYZ' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
2.) Event ID: 1049 Cluster IP address resource 'SQL IP Address 1 (XYZ)' cannot be brought online because a duplicate IP address '10.9.8.113' was detected on the network. Please ensure all IP addresses are unique.
3.) Event ID: 1069 Cluster resource 'SQL IP Address 1 (XYZ)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
4.) Event ID: 1049 Cluster IP address resource 'Cluster IP Address' cannot be brought online because a duplicate IP address '10.9.8.112' was detected on the network. Please ensure all IP addresses are unique.
5.) Event ID: 1069 Cluster resource 'Cluster IP Address' in clustered service or application 'Cluster Group' failed.
6.) Event ID: 1066 Cluster disk resource 'Cluster Disk 25' indicates corruption for volume '?Volume{88552e6f-aea2-11df-9790-0026b92fffa7}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:WindowsClusterReportsChkDsk_ResCluster Disk 25_Disk16Part1.log'. Chkdsk may also write information to the Application Event Log.
7.) Event ID : 1066 Cluster disk resource 'Cluster Disk 26' indicates corruption for volume '?Volume{88552e05-aea2-11df-9790-0026b92fffa7}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:WindowsClusterReportsChkDsk_ResCluster Disk 26_Disk4Part1.log'. Chkdsk may also write information to the Application Event Log.
8.) Event ID: 1049 (Same message as point 2)
9.) Event ID: 1069 (Same message as point 3)
10.) Event ID : 1049 (same message as point 4)
11.) Event ID :1069 (same message as point 5)
12.) Event ID :1205 The Cluster service failed to bring clustered service or application 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
13.) Event ID: 1069 Cluster resource 'Cluster Disk 17' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
14.) Event D : 1049 (same message as point 2)
15.) Event ID: 1069 Cluster resource 'SQL IP Address 1 (XYZ)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
16.) Event ID : 1205 The Cluster service failed to bring clustered service or application 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
first of all,I went through all the logs, and could not find the reason for fail-over initialization. There should be some thing logged why the failover happened? secondly after failover the service was not coming online due to duplicate IP address detection.
Later when we try to manually bring the service online from cluster management it comes online successfully. I don't understand how would duplicate IP address get resolved when we start manually.
Lastly we see few errors related to physical disk resource between failover retries, is this could be the correlated to failover error ?
Came across this scenario in AlwaysOn Availability Group (two node), file share witness times out and RHS terminate and cause the cluster node to reboot. File share witness is for continuous failover and if the resource is unavailable my expectation was that it should go offline and should not impact Server or Sql Server. But its rebooting the cluster node to rectify the issue.
Configuration Windows Server 2012 R2 (VMs) Sql Server 2012 SP2 Errors
A component on the server did not respond in a timely fashion. This caused the cluster resource 'File Share Witness' (resource type 'File Share Witness', DLL 'clusres2.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
I have getting issues when i am creating listener for always On . Error shown as below
Can not bring the Windows server fail over cluster (WSFC) resources online. (Error Code 5942). The WSFC service may not be running or may not be accessible in its currents states, or the WSFC resources may not be in a state that could accept the request.
For information about this error code see "system error code" in windows development documentation
The attempt to create network name and IP address for the listener is failed. The WSFC service may not be running or may not be accessible in its currents states or the value provide for the network name and IP address may be incorrect. Check the state of the WSFC cluster and validate network name and IP address with network administrator. (Microsoft SQL Server error 41066) ...
I ran into problem while trying to install SQL Server 2008 to join to another node in a failover cluster ,you will find here the configuration doc and installation details. URL...
I know now that AlwaysOn feature HAS to be installed/configured on a Windows Clustering environment, BUT the secondary replicas, like the Disaster recovery replica residing in a different Data Center HAS to be also in a Windows Clustering environment or can it reside on a SINGLE SQL Server INSTANCE?.
1. In alwaysON fail over cluster, Once fail over to secondary replica, what will happen to connected session in primary node? can the session fail over to secondary seamlessly or need to re-login. what happen committed transactions which has not write to disk.
2. Assume I have always on cluster with three nodes, if primary fails, how second node make write/ read mode.
3. After fail over done to 2nd secondary node what mode in production(readonly or read write).
4. How to rollback to production primary ,will change data in secondary will get updated in primary.
But I'm not sure if I have to install SQL Server first on node 2, then add it to the cluster. Or does adding it to the cluster also install the software?
I'm contemplating running two availability groups on a two node WSFC. The WSFC is setup with a file share witness (i.e. no shared storage). Can I safely run 1 AG on one primary node, and the other AG on the other node (as primary). Each AG would have replicas on the passive node. This would effectively allow both servers to be in use at the same time. In a failover event, I understand that both workloads would transfer to a single server - so the box needs to be sized appropriately.
We are in the process of building a 3 node SQL Server Cluster (Server 2012/ SQL Server 2012), and we have configured the quorum so that all 3 nodes have a vote (no file share witness as we already have an odd number of nodes).
As I understand it, this should allow the cluster to run as long as 2 of the nodes remain online.
However, the validation report states that 2 node failures would be acceptable and, when we tested this by powering off two of the nodes, the cluster did indeed continue to run on a single node.
I configure Windows 2003 R2 and SQL 2005 two nodes Cluster. When I move cluster resource from one node to anther node it takes around 30 seconds to become online. So in that time if any query is running it stops responding.
During the installation of Adding node to a SQL Server failover cluster(On passive node) getting error like.. The MOF compiler could not connect with the WMI server. This is either because of a semantic error such as an incompatibility with the existing WMI repository or an actual error such as the failure of the WMI server to start.We run the below commands but didn’t get any resolution & got the same above error .
1<sup>st</sup> Method…
1. Open console command (Run->CMD with administrator privileges).
2. net stop winmgmt
3. Rename folder %windir%System32WbemRepository to other one, for backup purposes (for example _Repository).
I invoke xp_cmdshell proc from inside a stored procedure on a 2-node active/passive SQL 2005 SP2 Standard cluster. Depending on which server the xp_cmdshell gets executed on I need to pass different arguments in the shell command. I thought I could use host_name() function to get the runtime process server, however, I am finding that it's not behaving correctly. In one example I know my active node is server2, but the host_name() function is returning server1. The only thing that I could possible explain this is that the MSDTC cluster group is not always on the same active node as the SQL server group and in the case I am talking about the cluster groups are in this mode (differnet nodes). Does the xp_cmdshell get executed by the SQL active node or the MDTC active node? And what is the best way to find out which server is going to run my xp_cmdshell?
Thanks.
Edit:
Perhaps another by product of this is that if I run select host_name() from the Studio Management query window i get different results depending on which server I am running the Studio Management on. On server1 I get server1 and on server 2 I get server 2, all the while server2 is the active node. I need a different function that will always let me determine the correct server that'll be running the xp_cmdshell...
Edit 2: I guess I could determine the running host inside the command shell itself, but I am curious to see if i can do it (cleaner) from SQL.
I have set up a 2 node availability group to take advantage of using the secondary node for a read-only replica. I actually am having two issues. The first being I can connect to the primary node using the listener dns name and ip address, but no longer can connect via its actual host name or ip address. I can ping the address with no problem, but I can't connect to port 1433 using the actual host name or ip address.
I am no problem connecting to the secondary node using its host name, but can not get to it through the listener using the applicationintent=readonly. Eventually I would like for everything to connect through the listener name, but for now still need to connect via the server's host name and don't understand why; everything I read is that the primary node should be able to be connected via both the host name and the listener name.
I have a 3 node 2014 AlwaysOn setup. The primary and secondary are set for automatic failover. The third node, of course, is manual (until 2016). The 2 nodes with are automatic are sitting in one datacenter, the third is in another. If the first datacenter was to go down, I would manually have to failover to the third node? What's the normal process here for having two datacenters and ensuring the availability group is always available?