Stop Using Production Data For Development

A common software development practice is to take data from a production system and restore it to a different environment, often called “test”, “development”, “staging”, or even “QA”. This allows for support teams to troubleshoot issues without making changes to the true production environment. It also allows for development teams to build new versions and features of existing products in a non-production environment. Using production to refresh development is just one of those things everyone accepts and does, without question.

Of course the idea of testing in a non-production environment isn’t anything new. Consider Haggis. No way someone thought to themselves “let me just shove everything I can into this sheep’s stomach, boil it, and serve it for dinner tonight.” You know they first fed it to the neighbor nobody liked. Probably right after they shoved a carton of milk in their face and asked “does this smell bad to you?”

For decades software development has made it a standard practice to create copies of production data and restore it to other non-production environments. It was not without issues, however. For example, as data sizes grew so did the length of time to do a restore. This also clogged network bandwidth, not to mention the costs associated with storage.

And then there is this:

If you read that tweet and thought “yeah, what’s your point?” then you are part of the problem.

As an industry we focus on access to specific environments, but not the assets in the environments. This is wrong. The royal family knows where the Crown Jewels are stored but if they are moved to another location you know the Jewels are heavily guarded at all times. Access to the jewels is important no matter where the jewels are located. The same should be true of your production data.

Use production to refresh development.
Then again, that stick might be pointy enough to fend off any attacker.

Data is the most critical asset your company owns. If you make efforts to lock down production but allow production data to flow to less-secure environments, then you haven’t locked down production.

It is ludicrous to think about the billions of dollars spent to lock down physical access to data centers only to allow junior developers to stuff customer data on a laptop they will then leave behind on a bus. Or senior developers leaving S3 buckets open. Or forgetting they pushed credentials to a GitHub repo.

If you are still moving production data between environments you are a data breach waiting to happen. I don’t care what the auditors say, you are at an elevated and unnecessary risk. Like when Obi-Wan decides to protect baby Luke by keeping his name and taking him to Darth Vader’s home planet. Nice job, Ben, no way this ends up with you dying, naked, in front a few dozen onlookers.

I think what frustrates me most is this entire system is unnecessary. You have options when moving production data. You can use data masking, obfuscation, and encryption in order to reduce your risk. But the best method is to not move your data at all.

After years of being told “don’t test in production” it’s time to think about testing in production. Continuous integration and continuous delivery/deployment (CI/CD) allow for you to achieve this miracle. And for those that say “No, you dummy, CI/CD is what you do in test before you push to production,” I offer the following.

Use dummy data.

You don’t need production data, you need data that looks like production data. You don’t need actual customer names and address, you need similar names and address. And there are ways to simulate the statistics in your database, too, so your query plans have the same shape as production without the actual volume of data.

It’s possible for you to develop software code against simulated production data, as opposed to actual production data. But doing so requires more work, and nobody likes more work.

Until you are breached, of course. Then the extra work won’t be optional.

5 thoughts on “Stop Using Production Data For Development”

  1. Without knowing all the intracacies of the development/production environment, project objective, type of data, etc., making a blanket statement like “Stop Using Production Data For Development” is just ridiculous.

    Reply
  2. Yeah, as a data engineer I’m getting a bit sick of hearing this tbh. It’s impossible to account for every data quality situation in a data pipeline and if you can’t test on a replica of all the production data, it just means your going to end up debugging your pipelines in production, which nobody wants.

    The only people arguing against using production data in development environments are IT security teams which don’t want to make life more difficult for themselves.

    Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.