Differential Privacy

Udacity Course

Most accurate query with the greatest amount of privacy
Greatest fit with trusted models in the actual world (don't waste trust)
Create flexible DP strategies

Types of DP

Local: add noise to each data point
Global: add noise to query output

Local DP

Coin flip jaywalking example

Flip coin 2x
If first coin flip is heads, answer honestly
If first coin flip is tails, answer according to the second coin flip (heads for yes, tails for no)

Each person is now protected with plausible deniability

If we collect a bunch of samples, and 60% answer yes, then we know the true distribution is 70%, because 70% averaged with 50% (coin flip) is 60% which is the result we obtained.

NB: This privacy technique comes at the cost of accuracy, especially when we only have a few samples. The greater the privacy protection (plausible deniability) the less accurate the results.

Types of Noise

Gaussian
Laplacian (typical - Delta always zero)

How much noise to add?

[Type of noise]
Sensitivity of noise
Desired Epsilon (major privacy parameter)
Desired Delta (minor privacy parameter - in case a query doesn't satisfy Epsilon)

Laplacian Noise

Increased or decreased according to a 'scale' parameter B (Beta)
Beta = Sensitivity(Query) / Epsilon
Delta always zero with Laplacian Noise
np.random.laplace

Perfect Privacy (AI model)

Training a model on a dataset should return the same model even if we remove any person from the training set.

Training a model is kind of like querying a database

Two points of complexity

Do we always know where 'people' are referenced in the dataset?
Neural models rarely ever train to the same location, even when trained on the same dataset twice (element of randomness in the training process)

Hospital Scenario

You have unannotated medical data and want to build classifier
10 partner hospitals have annotated data

Steps

Ask each hospital to train a model on their own dataset (10 models generated)
Use each model to predict on your own local dataset, generating 10 labels for each datapoint
Perform a DP query to generate the final true (DP) label for each datapoint (max function, where max is the most frequent label across the 10 labels assigned; then add laplacian noise for DP)
Retrain a new model on your local dataset which now has DP labels