Redgate SQL Provision, SQL Clone, Data Masker Part2

Redgate SQL Provision, SQL Clone, Data Masker Part2

Table of Contents

A long time ago in a galaxy far, far away…

This post is over 24 months old, that is an lifetime in tech! Please be mindful of that when reading this post, young Padawan, as it could be outdated. I try to keep things up to date as much as possible. If you think something needs updating, please let me know in the comments.

Recap

In my first post on Redgate SQL Provision, I ran some basic tests against the AdventureWorksDW2017 database to demonstrate how SQL Provision works.

The headline was that I saved 47 seconds cloning a database in 2 seconds that took 49 seconds to provision, nothing to write home about. So to really prove how cool this product is, it needs to run against something much, much larger…if I still get the same sort of results, then that would be pretty cool.

The Stack Overflow Database

The Stack Overflow database is periodically published as an XML data dump and Brent Ozar uses it as part of performance tuning courses and so being the super helpful person they are, carves it up into several different sized backup files. What I find most helpful about this, is given that the structure is consistent it means I can test stuff out at speed against one of the smaller databases before unleashing it on one of the larger datasets. There are tonnes of versions available, but the ones I have used for this test can be found on this page.

The Stack Overflow database is periodically published as an XML data dump and Brent Ozar uses it as part of performance tuning courses and so being the super helpful person they are, carves it up into several different sized backup files. What I find most helpful about this, is given that the structure is consistent it means I can test stuff out at speed against one of the smaller databases before unleashing it on one of the larger datasets. There are tonnes of versions available, but the ones I have used for this test can be found on this page.

So the idea is to evaluate against StackOverflow2010 database and run some preliminary tests against that. I would then duplicate the masking file, switch the database connection and run a full test against StackOverflow.

I have decided to mask a few columns within the User table to imitate masking a customer table. Here is a screenshot of the first 10 rows;

10 rows of data from the customer table

Setting up the masking rules

There is not a huge amount of columns to choose from, but we just need a few columns for the test because what we are interested in isn’t how quickly my machine can process a load of update statements, but how quickly it can complete an end to end process and make multiple copies available to my (imaginary) team. I chose to mask the following columns;

  • AboutMe
  • DisplayName
  • Location
  • WebsiteURL

There are 9,737,247 users in this dataset and that posed a problem with trying to provide each user a unique name. I didn’t really have to do this, but I figured I would keep it reasonably accurate. The predefined datasets weren’t really viable since none of them were going to provide enough unique values. There is less than a quarter of a million words in the English dictionary #gofigure, there is nowhere near enough person names, so I had to configure UserName differently.

So the first masking step covered AboutMe, Location, WebsiteURL with some basic substitution. I set the AboutMe column to produce some random words up to a max of 50 characters;

shows the AboutMe field being adjusted, options are max word length, and minimum & maximum characters in paragraphs

I applied a predefined list of country names to the location column;

shows the location field being adjusted, options are use of 2 letter abbreviation or full name

And I configured the generation of random URLs as shown below;

shows the url being adjusted, options are top level domain options, whether domains are real words or random text, and whether to add folders and filenames to the urls.

For the user name, there is a large selection of options from scrambling to regular expressions, but I opted for a custom calculation based on the user ID since the user ID is unique. So I went for this;

'User\_' + RIGHT('0000000000' + CAST(ID AS VARCHAR),10)

Which will result in a consistent pattern of “User_” followed by 10 digits consisting of the User ID plus some leading zeros. So ID 1 will be User_0000000001, ID 9999 will be User_0000009999 and so on. That kinda fits nice with my OCD then as they’ll all be a consistent length. So the rule looks like this;

screenshot shows me setting up the custom ID

But for good measure I excluded the bot (which has an ID of -1) from the rule because User_00000000-1 just wouldn’t look right. You can apply filters on the WHERE clause tab;

screenshot shows me applying a custom filter

Building an image

I ran tests against the 2010, 2013 and the full dataset, below is a walkthrough of the full dataset…

To build an image, I need a backup file or an existing database and a file share for the image to be hosted. I then wrote a Powershell script to build the image which looks like this;

Connect-SqlClone -ServerUrl 'http://localhost:14145'  
$SqlServerInstance = Get-SqlCloneSqlServerInstance -MachineName DESKTOP-H4M4K2E -InstanceName "" \`  
$imageDestination = Get-SqlCloneImageLocation -Path 'DESKTOP-H4M4K2EImage'  
$imageName = "StackOverflow"  
$myMaskingModification = New-SqlCloneMask -Path 'C:Codedemos-redgate-sql-provisionstackoverflowstackoverflow.dmsmaskset'  
$imageOperation = New-SqlCloneImage -Name $imageName -SqlServerInstance $sqlServerInstance -BackupFileName @('G:BackupsStackOverflow.bak') -Destination $imageDestination -Modifications $myMaskingModification  
$imageOperation | Wait-SqlCloneOperation

This process will initially build an image, then mask four columns in two steps for just under 10 million rows. I have cut down the output of the log to the salient points;

14:33:42 Operation started  
15:10:04 Attach database  
15:10:04 Starting Data Masking Script  
15:38:44 Finished Data Masking Script  
15:39:21 Operation succeeded with a duration of 01:05:39

It restored the database in ~37 minutes, with masking taking ~28 minutes. Final footprint of the image on disk is ~329GB. The original database was ~323GB.

Send in the clones

So this is where SQL Provision really comes into its own. I set up a Powershell script to imitate provisioning databases for 10 developers, 3 test databases and 1 pre prod database. The Powershell script looks like this;

Connect-SqlClone -ServerUrl 'http://localhost:14145'  
$SqlServerInstance = Get-SqlCloneSqlServerInstance -MachineName DESKTOP-H4M4K2E -InstanceName "" \`  
$imageName = "StackOverflow"  
$image = Get-SqlCloneImage -Name $imageName  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer001' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer002' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer003' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer004' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer005' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer006' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer007' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer008' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer009' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Developer010' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Test001' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Test002' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_Test003' -Location $sqlServerInstance | Wait-SqlCloneOperation  
$image | New-SqlClone -Name 'StackOverflow\_clone\_PreProd' -Location $sqlServerInstance | Wait-SqlCloneOperation

Each clone build is a separate command, so there is no end to end log and I don’t have the patience to collate them for this post! Here is the output from one clone build;

20:13:40 Operation started  
20:13:44 Started to attach database StackOverflow\_clone\_Developer001  
20:13:45 Finished to attach database StackOverflow\_clone\_Developer001  
20:13:45 Operation succeeded with a duration of 00:00:04.588

In under 5 seconds it provisioned a dedicated database! In total the process took just over a minute to complete with a footprint of ~640MB for all 14 databases;

I can now apply changes directly to a single database independent from the rest of my team;

screenshot shows the output from two clone databases, the query shows an update statement being run against the first output setting all votes to zero in the screenshot both outputs show the same table with masked data, the first database has zero votes whereas the second resultset shows the original votes.

In the screenshot above, you can see that I have updated the user database for Developer001. When I run a query against Developer001 and Developer002 you can see that all information is returned consistently except for the changes applied to Developer001.

Result

This test is a bit more like it! I have provisioned a 300GB database and securely masked and provisioned independent copies of it for 10 developers, three test tracks and pre prod testing in under 70 minutes! As more changes are applied to these clones, their footprint will increase, but when a developer can be given the ability to drop their clone and re-create it in seconds or even create more copies, this tool adds so much flexibility to development teams!

Here is the stats from each load;

Footprint

DatabaseDatabase SizeImage SizeClone Size
2010~9GB~9GB616MB
2013~51GB~48GB616MB
Full~323GB~329GB616MB

Timings

DatabaseTime to build imageTime to create clones
201000:02:1900:01:01
201300:14:3500:00:53
Full01:11:3900:01:04

Note, reading this back whilst moving it, I am disputing the size of the clones! Once I have had time to clean up the new site I will re-read this properly. My immediate reaction is I believe the clones should be smaller!

Summary

All in all, I am really impressed with this process, it is possible to securely run this since the backup can be hosted on a restricted file share, whilst the image is being constructed it is only accessible to the service account and once complete, developers can only obtain a clone of the de-sensitised image.

Next steps

There are changes I could make to the masking process to optimise the image load time. There are tips and techniques covered in the Redgate University videos, but since the main test was around how quickly I could spin up multiple copies of the database I didn’t spend a huge amount of time doing this. All scripts used for the blog post can be found on Github.

Further reading

You can find more information on the SQL Provision product page and Redgate have set up some useful videos at the Redgate University.

#mtfbwy



Recent Posts