With Facebook in the news, a lot of people have started asking questions about how much online privacy they have and how this touches them and their families. Here's an excellent non-technical summary of Facebook issues and challenges by Zeynep Tufecki and a bit of work by the Pew Research Center on Americans' Attitudes about Privacy, Security and Surveillance.
Much has been written pointing out Facebook users aren't Facebook's customers, but rather are marketed as the product. That isn't exactly right. Facebook is mining and selling user attention employing psychological techniques to keep users engaged in any way possible. This piece of the business model goes well beyond Facebook.. YouTube, Instagram, and most other forms of "free" social media play the same game.
People have been losing trust at an increasing rate that could have a chilling impact on the growth of Internet connected services. We hear about evil examples, but even well-meaning data collection can be problematic. A decade ago Netflix ran a competition to find a better way to recommend movies. They "anonymized" a dataset, stripping out user identification. Unfortunately it turned out it was possible to re-identify given users, political affiliation and even sexual identify by adding a tiny amount of additional information.1
Strap in - let's dive into differential privacy.
Differential Privacy (DP from here on out) is a privacy definition that has been around for a bit more than a decade. In the past two years the term has become well-known as Apple uses it to protect user privacy. A simple statement is something like:
Consider two nearly identical databases. One has your information and the other lacks it, but is otherwise identical. DP says the probability a query will produce the same result is about the same no matter which database is used.
DP tells you if your data has a significant impact on the outcome of a query. It is safe for you to contribute if it doesn't. Note that just tallying up the results leaks out a bit of information about you and does not satisfy DP to the letter of the law. Each time a query is made a little bit of information about you leaks and this can add up giving clever, and even not so clever, folks a foot in the door.
One way DP can get around this is to inject a bit of noise that masks the content of a specific person.2 This will muddy the results of the query a bit, but you can calculate how much noise to inject to have an accurate enough result.
There is a tradeoff between accuracy and privacy!
We're getting to the part that makes DP "interesting" (in a challenging way). The more information you get from the database, the more noise has to be injected to keep privacy leakage small. This loss of accuracy can be a big deal - particularly for Machine Learning training that can hit that database thousands or millions of times. Once you've leaked the data it's game over. User privacy has been compromised. You'd have to destroy the database and start over. You have to figure out a privacy budget - how many queries can you use?
Apple isn't in the business of selling raw or processed data to other parties - they're not mining your attention. They do collect a lot of information to run their services and make heavy use of DP. To make it safer, and give them something of a firewall to certain types of spying from say a government, information is randomized on the device before being sent over the network. Additionally applications are not allowed to harvest information from other applications and databases on the phone without explicit user approval and Apple's blessing.3 Even Apple can't re-anonymize internally.
For Apple's purposes DP makes a lot of sense although they need to worry about the privacy budget constantly. The bit attention harvesting social media platforms are another story. Its fascinating to think about what form a privacy preserving social media platform would take and if there is any way to migrate current SM users to such platforms. I don't think the current Facebook or YouTube business models would be viable with even modest privacy protection. That leads to an interesting question for regulators - what is the business model budget for the trade off between business model and privacy?
Thinking this through for other services is essential .. the Internet of Things for example.
In about a month the EU's General Data Protection Regulation takes effect. It has some teeth, but also has loopholes. It's a start, but people are talking about more serious protections. One can imagine these will vary from region to region. I'm hoping that anything one here (anywhere for that matter) has serious input from independent privacy and security experts and very little from the companies directly involved. One can imagine intense lobbying from the richest companies on Earth.
Note: This is way too short. I give myself an hour and don't want to bore people. Expect it to be part of a major society and technology discussion going forward.
__________
1 a classic paper by Narayanan and Shmatikov
2 There are a lot of ways to do this. Usually the noise has a Gaussian or Laplacian distribution. For the purpose of this note, just think noise. The amazing thing is determining how much noise is needed can be made without explicit knowledge of the contents of the database.
3 This is in stark contrast with Android - read Zeynep's piece for a high level description. Sadly at this point I wouldn't recommend using an Android device if you value privacy.. then again I wouldn't suggest using Facebook either. We all have lines we don't cross.
Comments