Hiking through Synthetic Data
Imagine you’re a conspiracy theorist. You hike for hours past the Grimm Lake salt flights in Nevada, and, for the very first time, glimpse the storied Area 51. Imagine that some federal agent (let’s call her Agent Scully) is responsible for ensuring that your view of Area 51 is somehow obfuscated, so as to protect government secrets.
Agent Scully might have covertly swapped your sunglasses the night before your hike, replacing them with special decoys that blur your vision and hide whatever secrets are otherwise visible. Alternatively, she might have created a “mock” Area 51 near Grimm Lake, one that looks quite a lot like the original Area 51, and is good enough to satisfy your curiosity without having to expose any secret projects!
But how does this have anything to do with differential privacy, or machine learning? More on that soon!
Differential Privacy and a Motivating Question
If you’re new to the idea of differential privacy, or just want a refresher, I suggest you read Part 1 of this series. If you’ve read Part 1, you’ll recall that we thought about an analogy where we had a bowl of Skittles containing one errant M&M. We desired a scheme where, upon releasing information about our bowl of Skittles, we wanted to promise that a hypothetical bowl of Skittles that has an M&M in it is, to any malicious snacker, virtually indistinguishable from a bowl of Skittles without any M&M’s at all.
In other words, we conceptualized differential privacy as fulfilling a “promise:” namely, that an individual’s data is anonymized to a specific degree, and we can tune that degree with the variable epsilon, which controls the amount of random noise we add for privatization.
Hopefully you have at least some intuition for the DP promise after Part 1. Still, you might now be saying to yourself, “The theory is all well and good, but how do I (actually) use differential privacy in an applied scenario? For example, in a large-scale machine learning pipeline?”
That’s certainly a question we want to answer here at the Microsoft AI Development Acceleration program (MAIDAP), and I was lucky enough to have directly contributed to an effort in this field during a MAIDAP rotation!
Scenario 1: “Direct Learning” DP Approaches
For the sake of separating our applied methods into two simple categories, let’s call any approach that directly modifies the machine learning model a “direct learning approach”.
By “modifies the machine learning model,” I mean a method that adds noise to a supervised learning model during training. In a differentially private direct learning approach, we don’t preprocess the data, and we don’t postprocess the results of the model: we apply noise to some part of the optimization step, like during gradient descent.
There are a lot of tools we can use to apply a direct learning approach to simple tasks. For example, if you have a supervised learning task like the canonical Adult Income Classification Task, you can go ahead and use IBM’s diffprivlib (specifically, try their DP Logistic Regression and Naïve Bayes). These simple tools run quickly, efficiently, and the results are usually good enough for scenarios with simple data.
In the case of more complex data and tasks (for example, image classification), we need a general approach that allows us to use a neural architecture (like a deep neural network or a convolutional neural network) while still ensuring our DP promise. In these scenarios, the Opacus library is a great tool for implementing a generalized direct learning approach. The setup takes a bit more time, and the learning curve is steeper, but you’ll be able to apply the tool to more scenarios.
The direct learning approach with Opacus makes sense in a lot of scenarios. If, for example, your data is eyes-on but you still want to protect it. If you’ve already implemented a model in Pytorch, but you have reservations about using the model due to the sensitive nature of your data, you can simply apply the Opacus PrivacyEngine directly to your optimizer!
If you’re interested in learning the direct learning approach in greater detail, you can clone the Smartnoise medical imaging notebook and play around with it yourself! We present a flow where we:
- Acquire and prepare x-ray images
- Develop a Convolutional Neural Network for image classification
- Train a standard (non differentially private) model as a baseline
- Use Differential Privacy to train a privacy preserving model
- Compare the performance of both models
The notebook in its entirety can be found here.
Differentially private direct learning approaches are the most common method of using DP in supervised learning scenarios. However, there are many other methods of ensuring the DP promise!
Back to Agent Scully: Differentially Private Synthetic Data
Now I’m going to discuss a different approach to ensuring the DP promise, one that involves preprocessing your original data to create an entirely new, privatized dataset.
Recall the odd tale of Agent Scully and the fake Area 51 from the beginning of this post. The decoy sunglasses employed by Agent Scully are like the DP “direct learning” approaches discussed earlier. They don’t directly modify the “landscape” of the data; they instead add noise to the viewer’s perception of the incoming information in order to make it private.
Building an entire decoy set, on the other hand, certainly modifies the “landscape” (quite literally in our analogy), and so the viewer is welcome to take a look from any angle, sunglasses or not, while the real landscape remains safely hidden. It is, however, a lot more effort. This is analogous to replacing real data with synthetic data, and falls under the category of differentially private “indirect learning.”
An “indirect learning” synthetic data approach pre-processes the original data, and generates a synthetic replacement that mimics the data distribution without risking the privacy of any individual sample.
Scenario 2: The Synthetic Data Approach and QUAIL at MAIDAP
The first step for the synthetic data approach is generating the synthetic dataset. During data generation, we apply the DP promise, thus making the synthetic data itself differentially private. Then, you may train whatever model you want on that resulting privately generated data.
In contrast, recall that when we used Opacus to enforce our DP promise, we imposed privacy during supervised learning model training. This produced a model that was differentially private, but did not produce data that is differentially private.
Conversely, in creating DP synthetic data, we impose privacy during synthesizer training, which produces a synthetic data generator. This generator does not classify data, it simply produces it.
From our synthetic data generator, we are able to produce an entirely new synthetic data set, with the added bonus that the data itself is now differentially private. On this DP synthetic data, we can train whatever model we want, and perform as much experimentation/hyperparameter tuning as we like, without incurring privacy costs beyond the initial generator training.
The added benefit of flexibility may come at the cost of data utility; a supervised learning model trained on a synthetic dataset often performs worse than a direct learning approach.
However, if you know the supervised learning task beforehand, you can “boost” your synthetic data utility using a method called QUAIL.
QUAIL was developed by MAIDAP during a rotation with AzureML and Microsoft’s Responsible AI team. We worked directly on contributing to an open source project named Smartnoise, which is part of a collaborative effort between Harvard and Microsoft called OpenDP
Synthetic data generation with DP synthesizers and QUAIL has been shown to improve the synthetic data utility in a number of machine learning scenarios. The added flexibility of then having a synthetic dataset as an output to play with makes this approach appealing. If you are interested in looking at some of our results, you can read our paper on the subject here.
If you’d like a more applied scenario to play around with, you can follow our sample synthetic data approach in greater detail on the Smartnoise GitHub; please clone this notebook and play around with it yourself.
Congratulations! You now have a good sense of what tools exist for tackling machine learning problems that rely on sensitive data, and how differential privacy can help.
If you have any questions or issues concerning the use of our differentially private Smartnoise tools, please get in touch via our open source Github – we also always love to hear about any specific use cases of the package. And if you’re interested in learning more about MAIDAP and our work in Responsible AI, please check out our homepage. Every fall we hire a new cohort of university graduates studying AI/ML, so if you’re interested in building with us, please check out the Microsoft University Hiring website.