Scripting Failovers for SQL Server Maintenance

As much as we’d all like to set up our SQL instances and never again let anyone touch them, that’s just not possible in real life. We live in a world where entities, for good or for evil, are constantly exposing vulnerabilities both in the SQL software and Windows itself. For that reason it is the responsibility of good DBA’s to keep their systems updated, and that means failovers.

It is typical in many environments to have a separation of duties, often keeping DBA’s from having control over the Windows Failover Cluster itself. In my particular case I actually have access to the cluster. However, our OCC (Operations Command Center) is tasked with carrying out Windows Updates on a regular basis. Letting them fail instances over means the servers get updated regularly (and also that I have to get up in the middle of the night for one less thing). I needed a way to allow their team members, regardless of experience level, to carry out Windows Updates on our cluster. Enter PowerShell scripting.

The code below is a modified version of what our OCC uses to carry out Windows Updates on our Failover Cluster. (There are options for Availability Groups too, but that’s a post for another day.) In the simplified example below there are six instances (SQL Instance 1-6) balanced across six nodes (sqlnode1-6). To keep things simple, a few assumptions are made in the script:

1. All instances are named with a particular convention. In this case that is “SQL Instance #”.
2. The instance numbers have their permanent homes on nodes of the same number (ie SQL Instance 1 belongs on sqlnode1).
3. The script is run on one of the nodes of the cluster.
4. The script is run in PowerShell ISE section by section, not all at once.

With that in mind I present the code to you below. It is written in a manner such that a Systems Administrator would be reading it as an instruction sheet and carrying out the tasks on your behalf. I am sure there are more elegant and efficient ways to accomplish this task, but this has been successful for us and I hope it can help you in some way. If you have any questions please feel free to comment or contact me on Twitter. Enjoy!

#### 1. Import Cluster Module ####
Import-Module FailoverClusters

#### 2. Fail all instances over to nodes 04-06. ####
Move-ClusterGroup "SQL Instance 1" -Node sqlnode4
Move-ClusterGroup "SQL Instance 2" -Node sqlnode5
Move-ClusterGroup "SQL Instance 3" -Node sqlnode6
Move-ClusterGroup "SQL Instance 4" -Node sqlnode4
Move-ClusterGroup "SQL Instance 5" -Node sqlnode5
Move-ClusterGroup "SQL Instance 6" -Node sqlnode6


#### 3. Remove nodes 1-3 as possible owners and add 4-6.. ####
$ClusterGroups = Get-ClusterGroup

foreach ($Service in $ClusterGroups) 
    {  
      
      If ($Service -like "SQL Instance*")
      {
        Set-ClusterOwnerNode -Group  $Service.Name -Owners sqlnode4,sqlnode5,sqlnode6
        Get-ClusterOwnerNode -Group $Service.Name
      }  
      
    }

#### 4. Install updates on nodes 1-3. ####

#### 5. Remove nodes 4-6 as possible owners and add 1-3. ####
$ClusterGroups = Get-ClusterGroup

foreach ($Service in $ClusterGroups) 
    {  
      
      If ($Service -like "SQL Instance*")
      {
        Set-ClusterOwnerNode -Group  $Service.Name -Owners sqlnode1,sqlnode2,sqlnode3
        Get-ClusterOwnerNode -Group $Service.Name
      }  
      
    }

#### 6. Fail all instances over to nodes 1-3. ####
Move-ClusterGroup "SQL Instance 1" -Node sqlnode1
Move-ClusterGroup "SQL Instance 2" -Node sqlnode2
Move-ClusterGroup "SQL Instance 3" -Node sqlnode3
Move-ClusterGroup "SQL Instance 4" -Node sqlnode1
Move-ClusterGroup "SQL Instance 5" -Node sqlnode2
Move-ClusterGroup "SQL Instance 6" -Node sqlnode3


#### 7. Install updates on 04-06. ####

#### 8. Add all nodes as possible owners. ####
#### Note: This list is mostly random. It only has two goals.
#### Goal 1 is to set the permanent home at the first of the list. ####
#### Goal 2 is to make sure each of the cluster nodes is listed and that they are somewhat randomized. ####
Set-ClusterOwnerNode -Group "SQL Instance 1"  -Owners sqlnode1,sqlnode5,sqlnode2,sqlnode4,sqlnode3,sqlnode6
Set-ClusterOwnerNode -Group "SQL Instance 2"  -Owners sqlnode2,sqlnode4,sqlnode5,sqlnode3,sqlnode1,sqlnode6
Set-ClusterOwnerNode -Group "SQL Instance 3"  -Owners sqlnode3,sqlnode6,sqlnode1,sqlnode2,sqlnode5,sqlnode4
Set-ClusterOwnerNode -Group "SQL Instance 4"  -Owners sqlnode4,sqlnode3,sqlnode6,sqlnode5,sqlnode1,sqlnode2
Set-ClusterOwnerNode -Group "SQL Instance 5"  -Owners sqlnode5,sqlnode2,sqlnode4,sqlnode1,sqlnode3,sqlnode6
Set-ClusterOwnerNode -Group "SQL Instance 6"  -Owners sqlnode6,sqlnode1,sqlnode3,sqlnode2,sqlnode4,sqlnode5

#### 9. Fail displaced instances back over to their permanent nodes. ####


Move-ClusterGroup "SQL Instance 4" -Node sqlnode4
Move-ClusterGroup "SQL Instance 5" -Node sqlnode5
Move-ClusterGroup "SQL Instance 6" -Node sqlnode6

Is Optimizing for AdHoc Queries Worth It?

As you probably know, there is a server configuration option called “optimize for ad hoc workloads”. For more details on that option check out the MSDN article. The real question is, how do I know the extent to which my environment uses AdHoc queries? Well, as usual there is a DMV for that. Use the query below to get an idea of how many plan caches are being stored but only used once. The number changes often, so it’s a good idea to baseline your typical number. For instance, I’ve created a SQL Agent job to record the number to a table every ten minutes.  If you maintain high numbers then it might be a good idea to optimize for ad hoc workloads. As always, however, I would advise testing thoroughly first. Happy hunting.

USE Diagnostics

INSERT INTO Diagnostics.dbo.AdHocTracking
	SELECT COUNT (*) FROM sys.dm_exec_cached_plans
	WHERE objtype = 'Adhoc'
	AND usecounts = '1'

Credit to the Red Gate blog for my initial background information.

Managing CDC Jobs on an Availability Group

So, here’s the deal. You’ve got Change Data Capture going in your environment and want to move your database to an Availability Group on SQL Server 2012. That’s great, lots of advantages there. But what happens to CDC when your database fails over to a different node? Since the capture process runs through SQL Agent jobs you will no longer be getting the CDC information post-failover. This is the scenario I ran into recently. In order to mitigate the problem I’ve created a simple script to handle the enabling and disabling of jobs on both replicas. Of course, the jobs will need to be created on each replica before it will work. But this is an easy way of making sure you can take advantage of the benefits of AG’s while still keeping up your CDC captures. I have this running every five minutes on each node. Feel free to adapt/improve/delete it to your heart’s content. Note that I have hard-coded the replica_id. I know, not the most elegant method. But good enough for a first version.

****Update****

Through painful experience I’ve discovered that the replica id can change if you remove the replica and add it back in. A slightly better method of checking utilizes sys.dm_hadr_availability_group_states. In this case we’re checking against the instance name, which will always remain the same.

IF (select primary_replica from sys.dm_hadr_availability_group_states) = ‘YourReplicaName’

****Update 2****

For SQL Server 2014 and above you are not allowed to even run the USE statement against a database that is on a read-only replica (ie your secondary replica used for reporting). For that reason I’ve altered it to use sp_start_job instead. The updated version below also simplifies other syntax a bit.

--Check to see if this instance is the Primary Replica.
IF (select primary_replica from sys.dm_hadr_availability_group_states) = 'YourReplicaName'
	BEGIN
		--If this is the Primary and the capture job is disabled, enable and start it.
		IF (select enabled from MSDB.dbo.sysjobs where name = 'cdc.YourJobName_capture') = '0'
			BEGIN
				EXEC msdb.dbo.sp_update_job @job_name = N'cdc.YourJobName_capture', @enabled = 1;
				EXEC msdb.dbo.sp_start_job N'cdc.YourJobName_capture' ;  
			END
		--If this is the Primary and the cleanup job is disabled, enable it.
		IF (select enabled from MSDB.dbo.sysjobs where name = 'cdc.YourJobName_cleanup') = '0'
			BEGIN
				EXEC msdb.dbo.sp_update_job @job_name = N'cdc.YourJobName_cleanup', @enabled = 1;
			END
	END
--Actions to take if this instance is not the Primary Replica.
ELSE
	BEGIN
		--If this is not the Primary and the capture job is enabled, disable it.
		IF (select enabled from MSDB.dbo.sysjobs where name = 'cdc.YourJobName_capture') = '1'
			BEGIN
				EXEC msdb.dbo.sp_update_job @job_name = N'cdc.YourJobName_capture', @enabled = 0;
			END
		--If this is not the Primary and the capture job is enabled, disable it.
		IF (select enabled from MSDB.dbo.sysjobs where name = 'cdc.YourJobName_cleanup') = '1'
			BEGIN
				EXEC msdb.dbo.sp_update_job @job_name = N'cdc.YourJobName_cleanup', @enabled = 0;
			END
	END

****Update 3****

Check out this post to learn about firing off jobs based on SQL alerts. This would be a better approach than checking on a timed interval.