Americas

  • United States

Asia

Oceania

Peter Wayner
Contributing writer

Differential privacy: Pros and cons of enterprise use cases

Feature
Jan 04, 202111 mins
Data PrivacyEncryptionSecurity

Hiding sensitive data in a sea of noise might have more value than encryption in some use cases. Here are the most likely differential privacy applications and their trade-offs.

abstract data flows / data streams
Credit: Gonin / Getty Images

In the past, the pursuit of privacy was an absolute, all-or-nothing game. The best way to protect our data was to lock it up with an impregnable algorithm like AES behind rock-solid firewalls guarded with redundant n-factor authentication. 

Lately, some are embracing the opposite approach by letting the data go free but only after it’s been altered or “fuzzed” by adding a carefully curated amount of randomness. These algorithms, which are sometimes called “differential privacy,” depend on adding enough confusion to make it impossible or at least unlikely that a snoop will be able to pluck an individual’s personal records from a noisy sea of data.

The strategy is motivated by the reality that data locked away in a mathematical safe can’t be used for scientific research, aggregated for statistical analysis or analyzed to train machine learning algorithms. A good differential privacy algorithm can open the possibility of all these tasks and more. It makes sharing simpler and safer (at least until good, efficient homomorphic algorithms appear).

Protecting information by mixing in fake entries or fudging the data has a long tradition. Map makers, for instance, added “paper towns” and “trap streets”, to catch plagiarists. The area formally called “differential privacy” began in 2006 with a paper by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith that offered a much more rigorous approach to folding in the inaccuracies.

One of the simplest algorithms from differential privacy’s quiver can be used to figure out how many people might answer “yes” or “no” on a question without tracking each person’s preference. Instead of blithely reporting the truth, each person flips two coins. If the first coin is heads, the person answers honestly. If the first coin is tails, though, the person looks at the second coin and answers “yes” if it’s heads or “no” if it’s tails. Some call approaches like this “randomized revelation.”

The process ensures that about 50% of the people are hiding their answers and injecting noise into the survey. It also allows enough truthful answers to enter the count and lead to an accurate average. If someone is trying to spy on an individual’s answer, it’s impossible to know whether their particular version of “yes” or “no” happened to be truthful, but aggregated answers like the mean or average can still be accurately calculated.

Interest in these algorithms is growing because new toolkits are appearing. Google, for instance, recently shared a collection of differential privacy algorithms in C++, Go and Java. Microsoft has open-sourced a Rust-based library with Python bindings called SmartNoise to support machine learning and other forms of statistical analysis. TensorFlow, one of the most popular machine learning tools, offers algorithms that guard privacy for some data sets. Their work is part of OpenDP, a larger drive to create an integrated collection of tools under an open-source umbrella with broad governance.

Some high-profile projects are using the technology. The answers to the US Census for 2020, for instance, must remain private for 72 years according to the law and tradition. However, many people want to use the Census data for planning, budgeting, and making decisions like where to put a new chain restaurant. So, the Census Bureau distributes its statistical summaries. This year, to protect the privacy of people in small blocks, it will inject noise to add protection using its “Disclosure Avoidance System.

All this work means it’s easier than ever for developers and enterprise teams to add the approach to their stack. Deciding whether the extra layer of noise and code makes sense, though, requires balancing the advantages with the costs and limitations. To simplify the debate, here are pros and cons for differential privacy use cases, interleaved together. Was a randomized revelation algorithm used? You decide.

Sharing and collaboration

Pro: Sharing is essential. More and more projects depend on collaboration. More and more computing is done in the cloud. Finding good algorithms for protecting our privacy makes it possible for more people and partners to work with data without leaking personal information. Adding a layer of noise also adds a bit more safety.

Con: Is sharing bad data a good solution? Sure, it’s nice to share data, but is sharing the wrong information any help? Differential privacy algorithms work because they add noise, which is a nice way of saying “error”. For some algorithms like computing the mean, the errors can cancel each other out and still lead to accurate results. More complex algorithms aren’t so lucky. Also, when the data sets are small, the effects of the fuzzing can be much more dramatic leading to the potential for big distortions.

Controlling trade-offs between privacy and accuracy

Pro: Good algorithms control the trade-offs. The differential privacy algorithms don’t just add noise. They illustrate and codify the tradeoffs between accuracy and privacy. They give us a knob to adjust the fuzzing so it meets our needs. The algorithms let us set a privacy budget and then spend it as necessary through the various stages of data processing. For those who remember calculus, the process is trying to emulate differentiation and calculate the slope of the privacy loss.

Many differential privacy algorithms call this privacy parameter by the Greek letter epsilon and apply it in an inverse way so that large values of epsilon lead to almost no change in the data while small values of epsilon lead to adding large amounts of noise. The inverse relationship can make the number counterintuitive.

Con: Epsilon is still just a number. All the mathematical gloss and complex equations, though, just cover up the fact that someone must choose a number. Is 2 better than 1? Which number is appropriate? How much is enough? How about 1.4232? There is no easy guide and the best practices haven’t evolved yet. Even when they do, can you be sure that the best number for, say, the hamburger stand down the street is the right value of epsilon for your garden tool business?

Setting the value can be complex, especially when the data sets are less predictable. Algorithms try to suss out the sensitivity of the data defined by how close the data values may be to each other. The ideal noise will blur the distinction among people making it impossible for an attacker to identify one. Sometimes the data cooperates and sometimes it can be hard to find a single good value of epsilon.

“There’s no theory for how to set it. Policy makers don’t have anywhere to start,” said one scientist mired in the process. “It’s put them in the lap of policy makers and that’s appropriate, but the policy makers have no theoretical help to choose epsilon properly.” It’s best to say that the search for this number is an area of very active research.

Enabling machine learning

Pro: Machine learning needs data. If we want to explore the potential of machine learning and artificial intelligence, we need to feed these monsters with plenty of data. They have a voracious appetite for bits and the more you feed them, the better they do. Differential privacy may be the only choice if we want to ship big collections of data across the web to some special processor optimized for machine learning algorithms.

Con: Noise can have unknown effects. Machine learning algorithms can seem like magic and just like real magicians, they often they refuse to reveal the secret of their tricks and just why their model filled with magic numbers is making the decision. The mystery is compounded when the algorithms are fed fuzzed data because it’s often impossible to know just how the changes in the data affected the outcome. Some simple algorithms like finding the mean are easy to control and understand, but not the ones in magical black boxes.

Some researchers are already reporting that differential privacy results can compound the errors. Sometimes it might not matter. Perhaps the signal will be strong enough that a bit of noise won’t get in the way. Sometimes we can compensate, but it can make the job that much more challenging. Doing this efficiently and accurately is also an area of active exploration.

Reduced liability due to deniability

Pro: Differential privacy offers deniability. People can relax when sharing their data because the approach gives them deniability. The algorithms, like the randomized response, give them a cover story. Perhaps that information was just a random lie concocted by the algorithm.

Con: Deniability may not be enough. Just because some of the data might be random or wrong doesn’t make it easier to answer some questions truthfully—and differential privacy algorithms require some answers to be accurate. It’s not clear how people feel about truthful information leaking out, even if it’s not immediately clear who is the owner. Emotional responses may not be logical, but humans are not always logical. Their feelings about privacy are not easy to translate into algorithms.

New ways to protect data

Pro: Differential privacy is a philosophical approach. It’s not a particular algorithm. There are dozens of algorithms and researchers are tweaking new ones each day. Some meet the precise mathematical definition and some come close and offer a form that some researchers call “almost differential privacy.” Each algorithm can offer slightly different guarantees, so there are many opportunities to explore for protecting your data.

Con: No guarantees. The differential privacy vision doesn’t offer firm guarantees, just statistical ones that the difference between the real data and the fuzzy data is bounded by some threshold governed by epsilon. So, some real information will leak out and often the noisy version can be close, but at least we have some mathematical bounds on just how much information is leaking.

Pro: Differential privacy algorithms are built to be chained. The theoretical foundations of differential privacy include a good explanation of how multiple differential privacy algorithms can be layered on top of each other. If one offers some protection measured by alpha and the other protection measured by beta, then together they offer alpha plus beta. In the best cases, the algorithms can be joined like Lego bricks.

The OpenDP project, for instance, wants to deliver a broad collection of algorithms that can work together while offering some understanding about how much privacy is preserved when they’re chained together. It aims to offer “end-to-end differential privacy systems” along with strong theoretical understanding of their limits.

Con: Some slow leaks are dangerous. Not all differential privacy algorithms fit the wide-open model of the internet. Some differential privacy queries, for instance, will leak a small, very manageable amount of information. If an attacker is able to repeat similar queries, however, the total loss could be catastrophic because the leaks will compound. This doesn’t mean that it’s bad, only that the architects must pay close attention to the model for releasing data so the leaks can’t compound. The theory offers a good starting point for understanding just how privacy degrades with each step.

The deepest shifts are philosophical

In the past, protecting privacy required thinking like a doctor with a mandate to take any extreme measure and defend against the release of data at any cost. The differential privacy philosophy requires thinking like a general defending a city. There are manageable and acceptable losses of information. The goal is to limit loss as much as possible while still enabling the use of the data.

The biggest challenge for enterprise developers will be working with a rapidly evolving mathematical understanding. The idea of adding noise is a clever one with great potential, but the details are still being actively explored. Some of the algorithms are well-understood, but many aspects are the focus of active research trying to explore their limitations.

The greatest challenges may be political because the algorithm designers will often throw up their hands and say that the amount of slow leakage, the value of epsilon, must be decided by the leadership. Differential privacy offers plenty of opportunities for being more open with data, but only when the people receiving the data are able to tolerate the extra noise.