Using Production Data in Continuous Integration Tests

Handling test data is a challenge for anyone testing a service or application that interacts with a database. One of the services my team is responsible for is already used by several games in production. So any changes have to be validated against the real production data. In this post, I’ll describe how I am able to get the data I need for continuous integration testing, without putting undue stress on our production databases or our DBA team.

The gold standard for data is whatever is used in production, but write access to production is usually very restricted. Since many tests rely on injecting test data into the database, this means they cannot run in a read-only environment. Even reading production data can cause issues if slow queries lock up the data and cause slowdowns for actual customers.

The obvious solution is just to copy the production database into the test environment, but depending on the size of the tables this can be very time-consuming, especially if the data includes customer information which would need to be masked in some way. For a test suite that is run multiple times a day in a continuous integration environment, copying the whole production database every time just isn’t practical.

The way we got around this problem was to use mysqldump to extract the schema from the production database along with *only* the data needed to run tests. This data is mainly static tables containing the game, configuration, and product information. It is a much smaller set than the constantly expanding and potentially sensitive customer data.

Step 1 – Create Your Test Database

We are fortunate enough to have a spare machine available in the dev environment that can host the test versions of our database. The CI build job is also configured to point the test service to this host. Once you have a host, the temptation is to name the test database the same as the production database. But it is better to give the database a unique name based on who is running the test. So if the production database is called STORE, then the CI version of would be called CI_TEST_STORE, my instance would be AMBER_TEST_STORE, etc. The advantage of this is that multiple test runs can be going at once on the same host without creating conflict.

Step 2 – Getting the Schema

To get just the table information, without any data at all, run the following command:

mysqldump --no-data --lock-tables=false -h{prod_host} -u{prod_user} -p{prod_password} {prod_dbname} --result-file=schema.sql

Then use the schema.sql file to create the production schema in your test database.

mysql -h{test_host} -u{test_user} -p{test_password} {test_dbname} < schema.sql

Note that it is possible to pipe these commands together without saving to file, but I would recommend doing it in two steps so that you have the schema file to look at if something goes wrong. Plus avoiding a pipe saves memory on your test machine.

Step 3 – Getting Only the Data You Need

Now run mysqldump again, this time including data, but ignoring the tables you don’t need. The ignored tables are generally going to tables containing specific customer or user information, like orders. This customer-generated data can instead be created on the fly by your tests.

mysqldump --no-create-info --complete-insert --lock-tables=false --ignore-table={TABLE1} --ignore-table={TABLE2}  -h{dbhost} -u{dbuser} -p{dbpassword} {dbname} --result-file=data.sql

Repeat the –ignore-table option as many times as you need to filter out all the tables you don’t want to copy. The –no-create-info and –complete-insert flags will result in a script that does not duplicate the table creation, but instead marshals the existing data into a giant insert statement. Then simply run the data.sql file against your test database to insert all that delicious production data.

mysql -h{test_host} -u{test_user} -p{test_password} {test_dbname} < data.sql
EDIT: I had a bit here about the duplication of tables, but I realized mysqldump does support fetching just the data. (See above)

Step 4 – Apply Schema Changes

Sometimes a new feature involves new database tables or column updates. One major advantage of testing against the production database schema is that you can test any proposed schema changes. Simply run the update script against your test database before starting the test run. It occasionally happens that developers will update the schema in lower environments without notifying test or creating a specific update script. In this case, the test cases for that feature will fail, highlighting the need for a production-level database update.

Conclusion

This is just a brief overview of a strategy my team has found useful for testing updates to a production level service. By limiting the actual data imported from production, we get the confidence of using the production schema without the hassle of dealing with an avalanche of production data. I’ll try to post an actual script to GitHub soon – I just need to clean it up for general consumption. Now on GitHub – pickyport!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s