In my previous post, I outlined the preparations we undertook to migrate a large SQL Server 2008R2 instance to SQL Server 2016. This post details migration day.Continue reading “A Monumental Migration to SQL Server 2016 – Part 2”
A bit over a year ago, I blogged about my experience migrating a test SQL Server instance from a VM to a physical machine with a little help from my friends. That migration went well and the instance has been running trouble-free ever since. But it’s small potatoes. A modest instance, it’s only about 5% the size of production. With SQL Server 2008R2’s EOL looming, it was time to migrate production to SQL Server 2016.Continue reading “A Monumental Migration to SQL Server 2016 – Part 1”
A few weeks ago, I teased good news.
just got some good news. can't wait to share it— Andy Levy (@ALevyInROC) July 20, 2018
One person hypothesized that I’m joining Microsoft (it seems to be the thing to do lately) and another jumped to the conclusion that I must be pregnant. Both creative responses, but not quite correct.
I’ll be at PASS Summit 2018!
So much to do!
- Pick some sessions
- Make my checklist of #sqlfamily I need to see
- Find a way to pack lighter (I think the iPad will stay home this time)
- Up my selfie game
- Get back into shape for #sqlrun
- Print up some more dbatools ribbons
- Figure out the social media photo situation (see above, “Up my selfie game”)
If you’re attending Summit, let’s meet up! I’ll be on Twitter, Slack, and Instagram (@alevyinroc across the board) all week and roaming the convention center & various evening events so ping me there to find out where I am.
Because Summit starts on Election Day here in the USA, be sure to either get to your local polling place that morning or follow your state’s process to request & submit an absentee ballot. Every election is more important than the one before it (that’s the most political I’ll get on this blog, I swear).
Triggers can be really useful in your database but you have to be careful with them. We often use them to record history to a separate table (at least, until we can implement temporal tables) or keep track of when a record was last updated. It’s usually implemented via an
AFTER UPDATE trigger.
Several times a day, I’m alerted by our monitoring suite that two processes were deadlocked. This is a (mostly) normal event, but one that we’d like to avoid if we can. More than a few times, I observed this pattern:
- Insert into table
- Update a field in the row that was just inserted.
In my case, the
LastModified update is implemented as an
AFTER INSERT, UPDATE trigger. Which means that as I’m inserting the record into the table I’m turning around and immediately updating it. We can see this in both the execution plan and the IO stats provided by
SET STATISTICS IO ON.
Here’s the setup. Just a pair of basic tables, one looks like like the other except one uses a default constraint and the other an
AFTER INSERT trigger. We’ll set up the tables and then insert one row into each.
Aside: Yes, I realize that I didn’t name my default constraints. In real life, I totally would and do.
And here’s the data on each.
Great, the data matches what we expect! But how much work did SQL Server do? Let’s check out the execution plan and IO stats. First for the trigger:
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. Table 'triggertest'. Scan count 0, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server parse and compile time: CPU time = 13 ms, elapsed time = 13 ms. Table 'triggertest'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 7 ms. SQL Server Execution Times: CPU time = 15 ms, elapsed time = 104 ms.
And now with the default constraint:
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. Table 'defaulttest'. Scan count 0, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 6 ms.
We want to spend our SQL Server licensing dollars wisely, so why ask it to do unnecessary work? More CPU time, more IO, and our client is waiting longer for the process to complete (I don’t like making anyone wait).
There’s a second “gotcha” with the
AFTER INSERT method that applies to only some use cases. Let’s say you’re loading some historical data into the table, and that data includes the
LastModified date. Or maybe your application has a very specific date that should be inserted into the table for this field.
And here’s what the tables look like after the inserts.
AFTER INSERT trigger just clobbered real data!
To alleviate this, I’m migrating these tables to use a default constraint of
GETDATE() on the
LastModified fields, and changing the trigger to
AFTER UPDATE (in the cases where I do need to write to a history table upon insert, I’m removing the
LastModified portion from the
AFTER INSERT trigger).
The result is the same –
LastModified is set to the insertion date when first creating the record, and updated when updating the record. But the inserts are less work for SQL Server, and I won’t clobber my data.
This week I had a user come to me asking about how fields were defined on a few tables he was using in writing some reports. Long story short, he’s been tasked with writing some new reports and updating existing ones, but he doesn’t have visibility to the database itself so he’s left with the “ok, let’s try this” approach and then reading error messages to debug when things go sideways. Very tedious.
I asked him for a list of the tables he was most interested in, and while he worked on that I set to work on (of course) a quick & dirty PowerShell script to collect the data he needed – field names, types, and whether they’re nullable.
With the help of
dbatools and Doug Finke’s
ImportExcel module, the resulting script to retrieve the requisite data and export it to an Excel workbook with one sheet per table is very short. Each table is on its own worksheet, with the header row bold and frozen so it stays in view while scrolling.
Line by line:
- Fetch the list of tables from a file provided to me
- Gather a collection of SMO objects for the tables from the server
- Loop over each table and extracting the relevant data about each
- Write the table’s information out to a worksheet named for the table in an Excel workbook
The only tricky part was fetching the field lengths for the
nvarchar fields. I had to go spelunking into each field’s
Properties property to get that. The result:
But I can tell this isn’t disco enough for you, so I’m gonna put it right here.
– Justin Hammer, Iron Man 2
You can do this in a one-liner too. It’s not as pretty, but it’s quite doable and the resulting Excel file is identical.
End to end, I turned this around in about 20 minutes and the the requestor’s life got easier in a big hurry. Best of all, when he comes back with another batch of tables for his next set of reports, I can turn the request around in less than five minutes.
I’ve been a proponent of dbatools for close to a year now and even contributed to the project, but surprisingly haven’t been a heavy user of it. Mostly due to a lack of opportunity. I’m aware of many of the functions by virtue of working on the built-in documentation and following the project and presentations about it.
So when the need arose to move a development/test instance of SQL Server from a VM onto a physical server, I knew exactly what I wanted to do. I was warned that the contents of this instance had been moved once before and it resulted in over a week of work and a bunch of trouble. I can’t speculate on why this was as I wasn’t there to see it, but I wasn’t going to let that happen on my watch. So, with equal parts hubris and stubbornness (and a dash of naïveté), I dove in. We have the technology. We will migrate this thing.
The advertising for
Start-DbaMigration makes it look so easy. Source, destination, your method of moving the data, and you’re done. Right? Well, sure – in a small, controlled sandbox. This one was neither. About 150 databases. Two dozen Agent jobs. User account cleanup. Different drive letters and sizes. And when it was all over, the server name, instance name, and IP of the new box had to match the old one so that we didn’t disrupt production or the developers.
Of course we’re going to rehearse this. But with the destination being a physical machine, I didn’t have the luxury of rolling back a snapshot each time, or restarting from a golden image. And I couldn’t do everything because it wasn’t an isolated environment – I couldn’t test all the Agent jobs (don’t want emails going out in error) and couldn’t reconfigure the IP or server name. Which meant that my script had to clean up any artifacts from previous runs before doing the migration. Each time.
I also wanted to bring the new instance up in a controlled fashion as opposed to just moving everything and letting it go, so that I could check things out before letting them break. I also had to work in checkpoints so the network/server admin could do his pieces. Which meant that after the migration, everything on the old server had to be stopped, and Agent jobs on the new one disabled (but with a record of what was enabled/disabled on the source, so I could replicate it).
I rehearsed what I could about a half-dozen times. Each time through took about 4 hours (having multiple tests helps build confidence in your elapsed time estimates), primarily because of the amount of data that had to be moved (about 700GB). Each time, I found another tweak needed. Maybe not entirely necessary, but I was out to prove something. I didn’t want this migration to be “good enough, a little rough around the edges” – this had to work right, right away.
This is truly standing on the shoulders of giants. Without the thousands of person-hours put in by Chrissy and the rest of the team, a short script like this to do a mountain of work simply is not possible. It’s not just having the huge amount of code to build on – it’s the suite of tests they run with every pull request that tells me that I can trust it’ll work right.
Looking back on it, there’s definitely a few things I’d change in this script, and more dbatools functions I could have used. But after successfully testing a couple times, I didn’t want to break what was working.
When the migration was complete, I did a brief checkout and then gave my server admin the green light. He flipped names & IPs around, and then I ran
Repair-DbaServerName which I had just discovered a few days earlier. I was expecting to do it manually but I trust the dbatools crew and their test suite more than myself on this one as I’ve never done this before. When that was complete, I had a grand total of three issues (that I could find):
- Database owners weren’t set appropriately. I was able to resolve this via
- Outgoing dbmail didn’t work. Turns out the SMTP relay on the new server wasn’t started. Easy fix.
- I had a Linked Server on my production instance which was unable to communicate to the new test server. This took me the longest to figure out. We checked everything – SQL Server Configuration Manager, the network itself, and then finally my colleague suggested testing something outside SQL Server – mapping a drive from production to test. This last test succeeded, which pointed us at the SQL Server connection specifically. The root cause: I had two firewall rules on the new server that blocked connections from all but servers on the local subnet. The production server isn’t on the local subnet.
None of these are total showstoppers. I had workarounds (or quick solutions) for them and as this is a test instance we could live with minor inconvenience for a day or two. One or two final tests, and I was satisfied that everything was working properly so I went ahead and enabled my Agent jobs. Some of them still have incorrect owners but I can fix that later – they were wrong on the source instance too.
I consider this migration a huge success. We had 95% functionality by 9am. By 3pm, the last real problems were resolved (and only that late due to a series of meetings keeping me away from my desk). Most importantly, it was achieved with minimal downtime for the development and QA teams. I’m now one week post-migration and everything is still running smoothly on the new instance.
One of the (many) fun things to do at PASS Summit is to check out the ribbons people have attached to their badges. Some are witty or goofy, others informational, others technical, and still more that let you express how you identify with a community within the community.
To celebrate dbatools and the awesome team & community around it, two limited edition badges will be available from/distributed by me and a handful of other folks all week at Summit. Check ’em out:
Be on the lookout for these badges and talk to us about dbatools! What you like, what you’d like to see changed, new feature ideas, questions about how to use functions, anything at all. Even if you’ve never used dbatools, we love talking about it and showing people the awesome things they can do with it so please, introduce yourself!
For this month, I want you peers to write about those important lessons that you learned the hard way, for instance something you did and put your systems down or maybe something you didn’t do and took your systems down. It can be also a bad decision you or someone else took back in the day and you’re still paying for it…
- In the stress/performance testing portion of an upgrade of a critical system, we were short on disk space. So, rather than having a separate set of VMs for the performance testing (as we needed to be able to get back to functional testing quickly), we decided to just take VM snapshots of all the servers. Testing was delayed a day or two – but we didn’t switch off the snapshots. Then we started testing and performance was terrific…for about five minutes. Then everything came to a screeching halt. Panicked, we thought we were going to need a pile of new hardware until the VMWare admin realized that our disks were getting hammered and we still had those active snapshots.
Lesson learned: If you take VM-level snapshots of your database server and let them “soak” for an extended period, you’re gonna have a bad time. Unless you need to take a snapshot of the host OS or instance configuration itself, use a database snapshot instead of a VM-level snapshot.
A couple of times, I’ve had under-performing VMs running SQL Server. As I hadn’t been involved in the configuration, I thought everything had been provisioned properly. Turns out…not so much. Memory reservations, storage configuration, power profiles, all set up for suboptimal performance.
Lesson learned: Ask your VMWare admin if they’ve perused the best practices guide and review things yourself before going down the rabbit hole of SQL Server configuration & query tuning. If the underlying systems aren’t configured well, you’ll spin your wheels for a long time.
In doing a configuration review of a rather large (production) instance, I noted that at least one configuration option was still set to the default value – Cost Threshold for Parallelism was stuck at 5. Running
sp_BlitzCache, I found that I had quite a few simple queries going parallel and huge
CXPACKETisn’t bad per se, but if you’ve got a low-cost query that’s going parallel and waiting on threads where it could be running faster overall single-threaded (verified this was the case for several of the top offenders), increasing the cost threshold can help. I did some checking, verified that it was a configuration change I could make on the fly, and set the value to
And then everything. Slowed. Down.
When I made this configuration change on the test instance, it wasn’t much a problem. But that was a much smaller instance, with much less traffic. What I failed to fully comprehend was the impact of this operation. I overlooked that changing this setting (and a number of others I wasn’t aware of) blows out the plan cache. In the case of this instance, about 26Gb of plan cache. Not only was performance impacted while the plan cache was re-filled, we took a hit while all the old plans were being evicted from cache.
Lesson learned: Even if it seemed OK in test, those “low impact” changes can have a much larger impact on production unless you can make test mirror production in every way. So plan when you make these changes accordingly.
We learn the most from our mistakes. We can learn almost as much from the mistakes of others. Learn from mine.
A few weeks ago I was looking at a query and got tripped up by the network and my own forgetfulness. It was a pretty simple query with a simple-looking execution plan. It didn’t even do that much work. About 20K logical reads and returned 200K records. For a server as large as the one I was working with, this should have been nothing. Instead, was waiting three minutes to get my results. My first thought was “aha, this query must need tuning, maybe a new index.”
I started experimenting in a non-production copy of the database on the instance. Added a couple indexes, adjusted the query itself, even a few query/index hints just to see what worked. Each change definitely had an impact on the query. The execution plan shape changed a bit, and the costs shifted around between a few operators. But still I was waiting minutes for the complete results.
The Plot Thickens
I tried my query against the development server. Production has 24X as much RAM, 3X as many CPU cores, and faster disks when compared to the dev box, so I expected even worse performance. Nope. Near-instant results, even for the original query with no additional indexes, hints or any other trickery.
OK, something’s definitely up now. I hopped back to production and while running my query, I ran
sp_BlitzFirst to peek at what was going on. Lo and behold, the top wait stat was
ASYNC_NETWORK_IO. I’ve seen lots of this in the past, attributed to poorly-written applications that just couldn’t consume the data as fast as SQL Server could churn it out. But this is SQL Server Management Studio, it’s absorbing the data as fast as it can receive it.
And that’s when it hit me. SSMS was consuming the data as quickly as it received it. The development server is in the office. The production server is in a datacenter 1/3 of a continent away and the connection between that datacenter and the office sometimes gets bogged down. The problem wasn’t that SSMS couldn’t drink from the firehose – the firehose had a clamp around it!
To test this idea, I made one last change to the query – I selected the data into a temp table instead of sending the results to SSMS. Round-trip time: two seconds. Conclusion: the network itself was causing the slowdown and this query needs no tuning!
As it turns out, SSMS has a way to discard query results so that you don’t have to resort to dumping results to a temp table – Discard query results after execution. It’s similar to a
/dev/null for result sets. If you only need an actual execution plan but no query results (like I did here), check it out.
For years I’ve looked at various queries online in sample code, diagnostic queries using DMVs, and the like and seen
CROSS APPLY in the
FROM clauses. But I’ve never really managed to comprehend what it was for or how it worked because I never saw a case where it was directly applied to something I was doing.
Finally, this week I had a breakthrough. I was working on updating a bunch of data but it was breaking on a small subset of that data. In this case, I was attempting to
JOIN two tables on fields that should have been
INTs, but in a very small number of cases one side was using a comma-delimited string. The user told me that someone else had done these updates in the past and didn’t encounter the problem I was having (so I knew that it was something i was doing “wrong”), but given that it was only a handful of broken updates she was OK with manually doing the updates (we were scripting it because we were updating potentially tens of thousands of records).
I am not OK with manually fixing this in the future. I wanted to know how the other DBA had done it before. I dug into some history and found
CROSS APPLY. My nemesis. I was determined to figure out how to use it this time.
Setting the Stage
Let’s set up three simple tables to keep track of airports and what state each airport is in. But our developer doesn’t totally get database design and in his state-to-airport mapping table, he allows for a comma-separated list of airports associated with each state.
CREATE TABLE #States
([Id] INT IDENTITY(1, 1),
StateName NVARCHAR(30) NOT NULL
CREATE TABLE #Airports
([Id] INT IDENTITY(1, 1),
IATACode CHAR(3) NOT NULL
CREATE TABLE #StateAirports
(StateId INT PRIMARY KEY NOT NULL,
This makes getting a list of airports and their associated state names tricky at best if we don’t know about
CROSS APPLY. With
CROSS APPLY, it’s pretty straightforward.
Here’s the finished query.
FROM #StateAirports SA1
CROSS APPLY string_split(SA1.airports, ‘,’) AS SA2
JOIN #Airports A ON A.Id = SA2.value
JOIN #states S ON S.Id = SA1.stateid
string_split() is a Table Valued Function which we finally got in SQL Server 2016 after far too many years of having to write (or, let’s face it, copy from someone’s blog post) inefficient string splitting functions. Important note: even if your database engine is SQL Server 2016, the database you’re operating in must be at
Breaking it down
If we take
CROSS APPLY and break it down into its parts, it finally starts to make sense.
string_split() function to the
Airports field of the
* Append the each row of
string_split()‘s output to the corresponding row on
#StateAirports (similar to a
CROSS JOIN but not exactly)
So now I have
N rows for each
N is the number of values in the comma-separated field. And
JOINed to each row is one of the rows from the output of
FROM #StateAirports SA1
CROSS APPLY string_split(SA1.airports, ‘,’) AS SA2
From there, the query is pretty normal otherwise,
JOINing to the other two tables to translate the state & airport ID numbers to their text values.
Hopefully this helps others get a handle on
CROSS APPLY and find useful places for it. This had been a head-scratched for me for years, but only because I didn’t have an example that clearly broke down how to use it and what was going on. In hindsight, I probably could have used it in some analysis I did at a previous job but instead resorted to parsing & processing comma-separated data in a PowerShell script.