Privacy Preserving Data Mining Techniques

Literature Review on "Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects"

This is the quick summary from literature review on reference: Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects. Generaly, it is a paper concluding and comparing existing PPDM techniques, discussing their performances and raise a brief introduction to soft computing techniques adapting broad and various scenarios.

Introduction

Privacy preserving data mining (PPDM) deals with protecting the privacy of individual data or sensitive knowledge without sacrificing the utility of the data.

  1. No privacy preserving algorithm exists that outperforms all others on all possible criteria.(Choose tools and techniques according to performances under current scenario)
  2. There is no need to intrude privacy for data mining, but there might be privacy intrusion. (Data mining can be performed without intruding privacy)
  3. The goal of PPDM is to lower the risk of misuse of data and at the same time produce results same as that produced in the absence of such privacy preserving techniques.

PPDM Framework

A framework for PPDM

PPDM Techniques

Classification

The majority of the existing approaches can be classified into two broad categories:

  1. (Precaution) methodologies that protect the sensitive data itself in the mining process
  2. (Anti-extraction) methodologies that protect the sensitive data mining results

Distributed Data Mining (DDM)

A simple approach to data mining over multiple sources that will not share data is to run existing data mining tools at each site independently and combine the results.

-> Issues that cause a disparity between local and global results include:

PPDM Techniques Dimensions

  1. Data distribution
  2. Data modification
  3. Data mining algorithms
  4. Data or rule hiding
  5. Privacy preservation

PPDM Techniques Categories basing on Dimensions

  1. Anonymization based PPDM
  2. Perturbation based PPDM
  3. Randomized Response based PPDM
  4. Condensation approach based PPDM
  5. Cryptography based PPDM

Anonymization based PPDM

The basic form of the data:

  1. Explicit Identifiers (identify a record owner explicitly)
  2. Quasi Identifiers (potentially identify a record owner, combing with others)
  3. Sensitive Attributes (contains sensitive person specific information)
  4. Non-Sensitive Attributes (no problem if revealed)

k-anonymity

Definition: K-anonymity is the property that, if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release.

Examples:

  1. Explicit Identifiers has 1-anonymity.
  2. Quasi Identifiers has at least 2-anonymity.

Procedure

Definition: Replacing a value with less specific but semantically consistant value is called as generalization and suppression involves blocking the values.

Examples:

This example below is quoted from Wikipedia.

Patients Record (Original) (Height unit: cm, Weight unit: kg)

Name Age Gender Height Weight State of domicile Religion Disease
Ramsha 30 Female 165 72 Tamil Nadu Hindu Cancer
Yadu 24 Female 162 70 Kerala Hindu Viral infection
Salima 28 Female 170 68 Tamil Nadu Muslim Tuberculosis
Sunny 27 Male 170 75 Karnataka Parsi No illness
Joan 24 Female 165 71 Kerala Christian Heart-related
Bahuksana 23 Male 160 69 Karnataka Buddhist Tuberculosis
Rambha 19 Male 167 85 Kerala Hindu Cancer
Kishor 29 Male 180 81 Karnataka Hindu Heart-related
Johnson 17 Male 175 79 Kerala Christian Heart-related
John 19 Male 169 82 Kerala Christian Viral infection
  • Explicit Identifiers: Name
  • Quasi Identifiers: Age, Gender, State of domicile, and Religion
  • Sensitive Attributes: Disease
  1. Suppression.
    Certain values of the attributes are replaced by an asterisk “*”. E.g., we have replaced all the values in the Name attribute and the Religion attribute with a “*”.
  2. Generalization.
    Individual values of attributes are replaced with a broader category. E.g., the value “19” of the attribute Age may be replaced by “≤ 20”, etc.

Patient Record (Anonymized) (Height unit: cm, Weight unit: kg)

Name Age Gender Height Weight State of domicile Religion Disease
* (20, 30] Female 165 72 Tamil Nadu * Cancer
* (20, 30] Female 162 70 Kerala * Viral infection
* (20, 30] Female 170 68 Tamil Nadu * Tuberculosis
* (20, 30] Male 170 75 Karnataka * No illness
* (20, 30] Female 165 71 Kerala * Heart-related
* (20, 30] Male 160 69 Karnataka * Tuberculosis
* [0, 20] Male 167 85 Kerala * Cancer
* (20, 30] Male 180 81 Karnataka * Heart-related
* [0, 20] Male 175 79 Kerala * Heart-related
* [0, 20] Male 169 82 Kerala * Viral infection

This data has 2-anonymity with respect to the attributes Age, Gender and State of domicile, since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.

Pros and Cons

Pros:

Cons:

Perturbation based PPDM

Procedure

Definition: In perturbation the original values are replaced with some synthetic data values so that the statistical information computed from the perturbed data does not differ from the statistical information computed from the original data to a larger extent.

Examples:

Individual incomes (k$) (Original)

[40, 50, 60, 70, 80, 90, 100]

We add random noise (synthetic data) to each income value. The noise ensures that the statistical properties remain similar while obscuring individual details. E.g. random value between -5 and +5.

Individual incomes (k$) (Perturbed)

[43, 48, 62, 73, 78, 92, 98]

Aggregated Statistics:

Assume the original data has mean $E[X]$ and variance $Var[X]$, and the noise is charaterized with $E[Y]$ and variance $Var[Y]$:

  1. Total mean: \(E[X+Y]=E[X]+E[Y].\)
  2. Total variance: \(Var[X+Y]=Var[X]+Var[Y]+2 Cov(X,Y).\)

The statistical information remains consistent, if the distribution of random noise is known.

Pros and Cons

Pros:

Cons:

Randomized Response based PPDM

Procedure

Definition: In Randomized response, the data is scrambled in such a way that the central place cannot tell with probabilities better than a pre-defined threshold, whether the data from a customer contains truthful information or false information.

  1. In first step, the data providers randomize their data and transmit the randomized data to the data receiver.
  2. In second step, the data receiver reconstructs the original distribution of the data by employing a distribution reconstruction algorithm.

Examples:

The Bayes estimator is one of the Randomized Response algorithm.

  1. Scenario: survey about a sensitive topic, such as whether they have ever used illegal drugs.
  2. Randomized Response Technique: Each respondent is given a randomization device (like a coin or dice) that determines how they answer.

Suppose we ask a person whether they’ve had sex with a prostitute this month. (Assume the event $A$ true probability is $q$)

Before answering, they flip a coin (event $B$):

  • If the coin comes up tails (with probability $p$), they answer “yes” regardless of the truth.
  • If the coin comes up heads (with probability $1-p$), they answer truthfully based on their actual experience.

Only they know whether their answer reflects the coin toss or their true behavior.

  • We assume that people who get heads will answer truthfully.
  • We assume that the events $A$ and $B$ are independent, so we have joint probability:
\[P(A \cap B) = P(A) \cdot P(B)\]

Here is the table of joint probability of two events:

$P(\cdot)$ $A$ $\bar{A}$
$B$ $p \cdot q$ $p \cdot (1-q)$
$\bar{B}$ $(1-p) \cdot q$ $(1-p) \cdot (1-q)$

Since $A \cap B$, $\bar{A} \cap B$, $A \cap \bar{B}$ would receive “yes” (with an observed ratio $k$) and $\bar{A} \cap \bar{B}$ would receive “no” (with an observed ratio $1-k$), we can calculate the target probability $q$ with the prior known probabiliy $p$ and the observed conditional probability $k$:

\[\begin{aligned} P(\bar{A} \cap \bar{B}) &= (1-p) \cdot (1-q) = 1-k,\\ &\implies q = 1- \frac{1-k}{1-p}. \end{aligned}\]

For example, if observed $k=55\%$ and we know $p=50\%$, we could get $q=10\%$.

Pros and Cons

Pros:

Cons:

Condensation approach based PPDM

Procedure

Definition: Condensation approach constructs constrained clusters in dataset and then generates pseudo data from the statistics of these clusters.

Examples:

Student Record (original)

Name Age Gender Major GPA
Alice 20 Female Biology 3.8
Bob 21 Male Chemistry 3.5
Carol 19 Female Physics 3.9
David 22 Male Math 4.0
Eve 20 Female Biology 3.7
Frank 21 Male Chemistry 3.6
Grace 19 Female Physics 3.8
Harry 22 Male Math 3.9

Attributes:

  • Explicit identifier: Name
  • Quasi identifiers: Age, Gender, and Major
  • Sensitive attribute: GPA
  1. Clustering: We group the records into clusters based on their quasi-identifiers. Each cluster has a size of at least k, where k is the desired anonymity level. For example, if k = 2, each cluster should have at least two records.Here’s one possible clustering result:
    • Cluster 1: Alice, Eve
    • Cluster 2: Bob, Frank
    • Cluster 3: Carol, Grace
    • Cluster 4: David, Harry
  2. Statistics: We compute the statistics of each cluster, such as the mean, median, mode, standard deviation, etc. These statistics capture the essential information of the original data. Here’s an example of the cluster statistics:
    • Cluster 1: Mean Age = 20, Mode Gender = Female, Mode Major = Biology, Mean GPA = 3.75
    • Cluster 2: Mean Age = 21, Mode Gender = Male, Mode Major = Chemistry, Mean GPA = 3.55
    • Cluster 3: Mean Age = 19, Mode Gender = Female, Mode Major = Physics, Mean GPA = 3.85
    • Cluster 4: Mean Age = 22, Mode Gender = Male, Mode Major = Math, Mean GPA = 3.95
  3. Synthesis: We generate synthetic data from the cluster statistics using any data generation method, such as random sampling, regression, or generative models. The synthetic data should have the same format and distribution as the original data. Here’s an example of the synthetic data:

Student Record (condensated)

Name Age Gender Major GPA
Anna 20 Female Biology 3.8
Beth 20 Female Biology 3.7
Chris 21 Male Chemistry 3.6
Dan 21 Male Chemistry 3.5
Emma 19 Female Physics 3.9
Fred 19 Female Physics 3.8
George 22 Male Math 4.0
Henry 22 Male Math 3.9

Pros and Cons

Pros:

Cons:

Cryptography based PPDM

Procedure

Definition: Cryptographic techniques are ideally meant for such scenarios where multiple parties collaborate to compute results or share non sensitive mining results and thereby avoiding disclosure of sensitive information.

Examples:

All these methods are almost based on a special encryption protocol known as Secure Multiparty Computation (SMC) technology.

In Yao, Andrew C, “How to generate and exchange secrets”, Proceedings of 27th IEEE Symposium on Foundation of Computer Science, 162-167, 1986 discussed a problem where two millionaires wanted to know who is richer with neither revealing their net worth. So, SMC was coined and developed. SMC defines two basic adversarial models namely (i) Semi-Honest model and (ii) Malicious model.

See more detailed protocols in Yao’s Millionaires’ problem.

Pros and Cons

Pros:

Cons:

Introducing Soft Computing Techniques

Soft computing techniques may be used to handle the different challenges offered by the data mining. -> high MIQ (Machine Intelligence Quotient)

See related concepts: Fuzzy set, Rough set.

Evaluation Metrics

  1. Performance: Performance of a mining algorithm is measured in terms of the time required to achieve the privacy criteria.
  2. Data Utility: Data utility is basically a measure of information loss or loss in the functionality of data in providing the results, which could be generated in the absence of PPDM algorithms.
  3. Uncertainty level: It is a measure of uncertainty with which the sensitive information that has been hidden can still be predicted.
  4. Resistance: Resistance is a measure of tolerance shown by PPDM algorithm against various data mining algorithms and models.

Thanks for reading.