Q&A: Dealing with Thousands of Databases

This is part one of a three-part series.

I’ve mentioned in various places, including in blog posts on occasion, that my production SQL Server instance hosts several thousand (nearly 9000 as of this writing) databases. People are usually surprised to hear this and it often leads to interesting conversation.

Jon Shaulis (blog | twitter) asked me on Twitter recently:

And I realized that I haven’t ever sat down to address this in detail. I’ve spoken about it on the SQL Data Partners Podcast and written little bits here and there on the blog in the context of “here’s something that tripped me up” but I haven’t really sat down to write specifically about the topic.

I invited Jon to send in some questions for me to answer here. He sent nine, and in the course of writing up answers I got a bit carried away so I’ve split them into three posts for readability.

Would you consider the tools, resources, and methodology of how you manage 1000 databases similar to managing 100? 500?

Once you reach the scale of thousands of databases on a single instance, you find the places where your tried-and-true tools and methods start getting flaky or breaking down entirely. SMO (the library under SQL Server Management Studio as well as the sqlserver and dbatools PowerShell modules) doesn’t play well with thousands of databases, for example. The SSMS Object Explorer gets very slow. The Use Database drop-down on the toolbar just locks the whole application up. Server.Databases.Refresh() times out unless you’re very selective about the initialization fields you set for the connection. Some monitoring tools/platforms fall over entirely.

You start looking at every operation that touches multiple databases with an eye toward finding a way to multi-thread it. An operation that takes one second per database is not much of an issue if you run it across 500 databases in series. With 8000 databases, that’s now a two-hour (or longer) slog. We’re fortunate that our databases are relatively small and have few indexes (that’s another topic), otherwise our weekly index maintenance would take forever.

You need more rigor around keeping track of the databases. Just because a database exists does not necessarily mean that it’s in use or if it is, how current is the data (and depending on how you deploy changes, the schema) in it? Can this group of databases be archived? Which ones are the top priorities to be recovered in case of a disaster? What are the critical business hours for the customer associated with a given database? How do you know that your latest schema change got deployed to every database? Who does the database belong to, and who needs to be notified if there’s an issue with it?

Maintaining a Disaster Recovery environment that can be brought up quickly isn’t just a matter of shipping backups and logs to a secondary site, because restoring all those backups may take a day or more.

For that matter, how do you manage a tight Recovery Point and Recovery Time Objectives? If you’ve got an RPO of 15 minutes but it takes 30 minutes to cycle through all the databases to take transaction log backups, what do you do? If it takes 15 seconds to restore a database and you have to bring 2,000 back online within 20 minutes, time is not on your side.

What paradigm shifts did you undergo moving from few databases to many? Did this change how you use technology or what technology you used? How different was the culture and mind set between managing few to many?

I came into this environment after it was already well-established, probably around 7000 databases at the time I joined the company.  I moved from an environment comprised almost entirely of software sold by ISVs to one where it’s all custom, in-house development (because we’re a software company).  In that position, I was sort of a development DBA (I wasn’t officially declared a DBA at the time but was involved with a lot of database work day-to-day) and only had maybe two dozen databases per instance, a far cry from even my current test environment.  So I went through a number of shifts personally in about a week and a half.

The shift for me was larger in scope, mostly getting my head around RPO and RTO requirements that were less stringent than I was used to. I was also accustomed to having development and staging environments which almost exactly matched production. With thousands of databases, mirroring that for non-production environments is costly, both in time and hardware - something a lot of shops don’t want to deal with. So you make a best effort with your test systems and extrapolate to make estimates about how things will work in production.

What was the most difficult challenge faced initially with a large environment and how does that challenge relate to now?

For me personally, it was just getting a handle on how to deal with this many databases because I didn’t “grow up” with the system. I walked into an environment with a lot of established tools and procedures for performing tasks and had to learn how those all fit together while also not breaking anything. You don’t want to be the person who walks in the door, says “why are you doing things like this, you should be doing it this other way” and then falls victim to hubris. If something seems unusual, there’s probably a reason for that and you need to understand the “why” before trying to change anything.

After 30 months, I’m still learning some of the nooks and crannies of the system. I’ve also had the opportunity to learn a couple new tools (Red Gate’s SQL Source Control, SQL Compare and SQL Multi Script, as well as a product called Double Take which has been through a few changes over the years and I think is now under the Carbonite Availability branding)

Challenges for the whole environment?

When I started, it took about 24 hours to complete a full backup of all our databases, and due to space constraints we could only keep 2 days of full backups on-disk (we keep more on tape to satisfy our retention requirements). Today, we’re able to keep 2+ weeks on disk because we switched to weekly fulls + daily diffs. dbatools giving us an easier way to restore something other than a single full backup was a big help here. Weekly fulls complete in less than 6 hours and daily diffs in less than two hours thanks to running multiple concurrent jobs.

Performance monitoring and management used to be another challenge. When I started, there was no dedicated SQL Server monitoring in place and we experienced occasional major performance degradation incidents. Since implementing SentryOne Monitor, we’ve identified root causes for these incidents and a number of other performance hotspots. Using this information (and new features in newer releases of SQL Server), we’ve made changes which have allowed us to maintain or even reduce our SQL Server resource allocation and improve performance while growing the business at the same time.

Stay tuned for Part 2 coming soon!