Approaches to storing, managing, analyzing and mining Big Data are new, introducing security and privacy challenges within these processes. Big Data transmits and processes an individual's PII as part of a mass of data–millions to trillions of entries–flowing swiftly through new junctions, each with its own vulnerabilities.
Deidentification masks PII, separating information that identifies someone from the rest of his or her data. The hope is that this process protects people's privacy, keeping information that would kindle biases and other misuse under wraps.
Reidentification science, which pieces PII back together reattaching it to the individual thwarts deidentification approaches that would protect Big Data, making it unrealistic to believe that deidentification can really maintain the security and privacy of personal information in Big Data scenarios.
Vulnerabilities, Exposure and Deidentification
Enterprises manage Big Data using large, complex systems that must execute hand-offs from system to system. "Typically an ETL procedure (extract, transfer, load) loads Big Data from a traditional RDBMS data warehouse onto a Hadoop cluster. Since most of that data is unstructured, the system runs a job in order to structure the data. Then the system hands it off to a relational database to serve it up, to a BI analyst, or to another data warehouse running Hadoop for storage, reference, and retrieval," explains Brian Christian, CTO, Zettaset. Any Big Data hand-offs or moves cross vulnerable junctions.
Creators of Big Data solutions never intended many of them to do what they do today. Take map reduce, for example. "Google invented map reduce to store public links so people can search them," says Christian. There were no worries about security because these were public links. Now enterprises use map reduce and NoSQL systems on medical and financial records, which should remain private. Because security is not inherent, enterprises and vendors have to retrofit these systems with security. "That's a big problem," says Christian, "vendors did not design firewalls and IDS for distributed computing architectures." These architectures tend to scale up to extremes beyond what traditional firewalls and IDS can natively address.
According to the Stanford Law Review article, vulnerabilities that expose PII subject people to scrutiny, raising concerns about acts of profiling, discrimination and exclusion based on an individual's demographics. These abuses can lead to loss of control for the individual. While brands use PII to market to customers to their benefit, those same vendors as well as law enforcement, government agencies and other third parties could also interpret and apply that personal data to the individual's detriment.
To prevent that, organizations charged with protecting private data have traditionally used de-identification methods including anonymization, pseudonymization, encryption, key-coding and data sharding to distance PII from real identities, according to the Stanford Law Review article. While anonymization protects privacy by removing names, addresses and social security numbers, pseudonymization replaces this information nicknames, pseudonyms and artificial identifiers. Key-coding encodes the PII and establishes a key for decoding them. Data sharding breaks off part of the data in a horizontal partition, providing enough data to work with but not enough to reidentify an individual.
However, computer scientists have shown they can use data that is not PII to reconstitute the associated person's identity. "There are many ways to piece data back together once you have even one type of data to work with," says Keith Carter, Adjunct Professor, The Business School of the National University of Singapore. If a brand or government acquired a list of GPS records covering one year, it could use that to learn a lot about a person or persons including their identities.
"You would easily be able to find out who they are by identifying the address they regularly come from at seven or eight in the morning. You would be able to see the school or office where they then show up. You would be able to learn where they went back to in the evening," says Carter, a speaker at the "Big Data World Asia 2013" conference.
From that, someone could get their name and address with a high degree of accuracy using a public address lookup tool. Having the family name, they could determine which family member it is by where they end up once they leave home in the morning, whether at a primary or secondary school or at a certain place of work.
But, this assumes that governments and businesses had faith in anonymization in the first place, according to Carter, who has had roles with Roles with Accenture, Goldman Sachs and Estee Lauder. There is also an assumption here that businesses and governments have spent a lot of money on something that doesn't deliver business value, Carter notes. In fact, what governments and businesses have done is to give themselves safe harbor by using deidentification/anonymization. And, even when companies don't use deidentification, the legal repercussions are a slap on the wrist, Carter confirms.
The truth is there may never be an adequate solution for Big Data privacy concerns, affordable or otherwise. There may only be solutions that protect enterprises and other entities from liability while pacifying people whose data is at risk. Unfortunately, for the individual, this means that abuses will indeed go on, regardless of the solution at hand.