Preserving the privacy of large data sets: Lessons learned from the Australian census

Preserving the privacy of large data sets is hard, as the Australian Bureau of Statistics found out. These are the big takeaways for the upcoming U.S. census and others dealing with large amounts of personal data.

face superimposed on keyboard privacy hacker
Thinkstock

Who needs hackers when the government puts sensitive information about every person in the country online and invites the internet to look at it? That's what happened last year in Australia, and it sends a warning message of what not to do during the upcoming U.S. Census 2020.

The Australian Bureau of Statistics published data from its last census online, but anonymized the data so poorly that it is vulnerable to a database reconstruction attack, researchers at Macquarie University in Sydney tell CSO. The database contains highly sensitive information about Australian residents, including address, age, ethnicity, salary, marital status, religious affiliation, number of children and so forth.

The Australian census data is not published in bulk as one database, but allows researchers, journalists and businesses to query the database. A hostile nation-state adversary would have the time, patience and resources to query the entire database over the course of several months and reconstruct it, rendering null the ABS's anonymization algorithm, which adds noise to the database in an attempt protect citizens from identification, the researchers warn.

Nor is simply unplugging the database from the internet a good option. The value of this data to journalists, academic researchers and enterprises in the energy, health and agricultural sectors makes it untenable to flick a switch and stop sharing.

Finding the right balance in the privacy-utility tradeoff is a hard, unsolved problem, Macquarie University professor of computing, and scientific director of the Macquarie Cyber Security Hub, Dali Kaafar tells CSO. "In a perfect world we would offer 100 percent utility and 100 percent privacy, but that's not possible."

The attack that the researchers developed proves mathematically that the algorithm the ABS is using is broken. Worse, the attack their research demonstrates allows more than just de-anonymizing a small percentage of individuals, but rather makes possible a database reconstruction attack — the holy grail for an attacker that completely removes all privacy protections for the entire database, in this case of every person residing in Australia.

"No one’s privacy has ever been compromised through the use of the ABS TableBuilder tool," the ABS said in a statement posted on their website, but didn't clarify how they can be sure, given the mathematically provable attack Kaafar and Asghar demonstrated.

"It's very difficult to deny the existence of this particular vulnerability," Kaafar says. "Just 200 well-crafted queries are enough to find any sort of attribute you are interested in with a high degree of accuracy. The attack is mathematically proven. There is a vulnerability in their algorithm."

How a reconstruction attack works

An attacker could easily automate 200 queries per individual times roughly 20 million residents in Australia and distribute those queries over time and geography by renting cloud servers around the world. An ABS spokesperson referred CSO to the public statement, writing "The ABS has no further comment at this time."

"Our biggest concern," Kaafar says, "this is not just about the re-identification risk.... this attack is about reconstructing the entire census database."

The attack works as follows. The ABS perturbs data before returning an answer to a user query. "It's a random perturbation within a fixed range," Hassan Jameel Asghar, assistant professor at Macquarie University, explains. "The reason to do that is to ensure if the count is too low, if the original count is only two or three, if only a couple people have those characteristics, that could be quite privacy intrusive."

The problem arises because a user can query the census data hundreds or thousands of times, and by doing so remove the data perturbation and unmask Australian residents one by one.

Kafaar and Asghar alerted the ABS to the flaw in mid-2018, but the agency has failed to adequately deploy new protections that defend against their attack, the researchers say, and instead appear to be relying on wagging a finger at users with their terms and conditions, including stern warnings that "As a TableBuilder guest user I will: not attempt to identify particular persons, households, or organisations to which the data relates [and] immediately report to the ABS any possible identification or disclosure that may have occurred."

porup census 2 CSO

The ABS terms and conditions asks people not to misuse the census data

It's not clear how the terms and conditions will be enforced against hostile nation-state engaged in a database reconstruction attack, however.

U.S. Census Bureau looks to prevent a reconstruction attack

The failure of the Australian Bureau of Statistics to properly protect the privacy of Australian residents will be on the minds of the folks running U.S. Census 2020, who Kaafar tells CSO are working on deploying stronger protections in the form of differential privacy.

"[The U.S. Census Bureau] are planning to adopt differential privacy and the main reason is that they have internally carried out... a database reconstruction attack and found it's possible to reconstruct the census data," says Kaafar.

Get the best of CSO ... delivered. Sign up for our FREE email newsletters!