Putting differential privacy into practice to use data responsibly

A woman sits at a desk working on a laptop and external monitor

Data can help businesses, organizations and societies solve difficult problems, but some of the most useful data contains personal information that can’t be used without compromising privacy. That’s why Microsoft Research spearheaded the development of differential privacy, which safeguards the privacy of individuals while making useful data available for research and decision making. Today, I am excited to share some of what we’ve learned over the years and what we’re working toward, as well as to announce a new name for our open source platform for differential privacy – a major part of our commitment to collaborate around this important topic.

Differential privacy consists of two components: statistical noise and a privacy-loss budget. Statistical noise masks the contribution of individual data points within a dataset but does not impact the overall accuracy of the dataset, while a privacy-loss budget keeps track of how much information has been revealed through various queries to ensure that aggregate queries don’t inadvertently reveal private information.

Since differential privacy was created, Microsoft has conducted research and developed and deployed technologies with the goal of enabling more people to participate in, contribute to and benefit from differential privacy. Last year, we partnered with Harvard’s Institute for Quantitative Social Science (IQSS) and School of Engineering and Applied Sciences (SEAS) to announce the OpenDP Initiative, and earlier this year released the initial version of our open source platform. We chose to develop differential privacy technologies in the open to enable increased participation in the creation of tools that empower a larger group of people to benefit from differential privacy.

Introducing SmartNoise

In June, we announced that we would be renaming our open source platform to avoid any potential misunderstanding of our intentions for this project and the community. Language and symbols matter, especially when you are trying to build an inclusive community and responsibly enable AI systems.

I’m thrilled to share that this platform will be renamed SmartNoise. The SmartNoise Platform, powered by OpenDP, captures an essential step in the differential privacy process and follows best practices of renaming terms like whitelist and blacklist to allowlist and blocklist.

By using SmartNoise, researchers, data scientists and others will be able to derive new and deeper insights from datasets that have the potential to help solve the most difficult societal problems in health, the environment, economics and other areas.

How we’re using SmartNoise and differential privacy today at Microsoft

As we apply differential privacy to our own products and begin to work with customers to do so, we’re learning a lot about what works and what we need to explore further.

Our first production use of differential privacy in reporting and analytics at Microsoft was in Windows, where we added noise to users’ telemetry data, enabling us to understand overall app usage without revealing information tied to a specific user. This aggregated data has been used to identify possible issues with applications and improve user experience.

Since then, we’ve applied differential privacy in similar ways to understand data that benefits our customers and helps us improve our products. We’ve learned that differential privacy works best in cases where a query or dataset with a limited set of computations will be refreshed on an ongoing basis – in these cases the work required to apply differential privacy pays off because you can spend the time to optimize it and then reuse that work. An example of this is the Insights for People Managers within Workplace Analytics. These insights enable managers to understand how the people in their team are doing and to learn how to drive change by using aggregated collaboration data without sharing any information about individuals.

An application of differential privacy with limited parameters but that enables interactivity is advertiser queries on LinkedIn. Advertisers can get differentially private answers to their top-k queries (where k is a number representing how many answers the advertiser wants from the query). Each advertiser is allotted a limited number of queries, which helps to ensure that multiple queries can’t be combined to deduce private information. So, for example, an advertiser could find out which articles are being read by software engineers or employees of a particular company, but wouldn’t be able to determine which individual users were reading them.

Another key application area for differential privacy is in machine learning, where the goal is to produce a machine learning model that protects the information about the individual datapoints in the training dataset.

For example, in Office suggested replies, we use differential privacy to narrow the set of responses to ensure that the model doesn’t learn from any replies that might violate an individual user’s privacy.

During the training of a machine learning model, the training algorithm can add differentially private noise and manage the privacy budget across iterations. These algorithms often take longer to train, and often require tuning for accuracy, but this effort can be worth it for the more rigorous privacy guarantees that differential privacy enables.

To take this scenario further, we are also exploring the potential for synthetic data in machine learning, which is currently only an option if we know the specific task or question the algorithm needs to understand. The idea behind synthetic data is that it preserves all the key statistical attributes of a dataset but doesn’t contain any actual private data. Using the original dataset, we would apply a differential privacy algorithm to generate synthetic data specifically for the machine learning task. This means the model creator doesn’t need access to the original dataset and can instead work directly with the synthetic dataset to develop their model. The synthetic data generation algorithm can use the privacy budget to preserve the key properties of the dataset while adding more noise in less essential places.

SmartNoise and differential privacy going forward

We have learned so much about differential privacy, and we’re only scraping the surface of what’s possible – and starting to understand the barriers and limitations that exist.

We continue to make investments in our tools, develop new ones and innovate with new practices and research. On the technical side, there are a few areas we will pursue further. Most production applications are using a known limited set of computations, so we’ll have to go further in making differential privacy work well for a larger set of queries. We will further enable interactivity, which means dynamically optimizing so queries work well without hand-tuning. We will develop a robust budget tracking system that would allow many different people to use the data. And we will adopt security measures that would allow an untrusted analyst to query and use the data without having full access to the dataset.

There are also policy, governance and compliance questions that need to be addressed. For example, if we are allocating budget for a dataset across a diverse set of users and potential projects, how do we decide how much budget each researcher accessing the data gets? Going forward, we will strive to answer these important questions with the help of the open source differential privacy community.

And synthetic data is a particularly exciting area for exploration because anyone could access and use it without privacy ramifications. However, there are still many research questions on how to effectively implement differential privacy – while still providing accurate results – when we don’t know what the analysis will look like in advance.

Many questions remain, and we know we will need help from the community to answer them. With the OpenDP Initiative and SmartNoise project, we announced our commitment to developing differential privacy technologies in the open to enable more people to participate and contribute, and we look forward to collaborating with and learning from all of you.

Gary King, director of the Institute for Quantitative Social Science at Harvard, had this to say: “We created OpenDP to build a far more secure foundation for efforts to ensure privacy for people around the world.  We are proud to release SmartNoise with Microsoft and hope to build an active and vibrant community of researchers, academics, developers and others to fully unlock the power of data to help address some of the most pressing challenges we face.”

If you want to get involved in OpenDP and SmartNoise, find us on GitHub. We will also continue to openly share our technical and non-technical learnings as we deploy differential privacy in production across the company.

Sarah Bird is a principal program manager and the Responsible AI lead for Cognitive Services within Azure AI. Sarah’s work focuses on research and emerging technology strategy for AI products in Azure. Sarah works to accelerate the adoption and impact of AI by bringing together the latest innovations in research with the best of open source and product expertise to create new tools and technologies.