This is the quick summary from literature review on reference: Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects. Generaly, it is a paper concluding and comparing existing PPDM techniques, discussing their performances and raise a brief introduction to soft computing techniques adapting broad and various scenarios.
Privacy preserving data mining (PPDM) deals with protecting the privacy of individual data or sensitive knowledge without sacrificing the utility of the data.
The majority of the existing approaches can be classified into two broad categories:
A simple approach to data mining over multiple sources that will not share data is to run existing data mining tools at each site independently and combine the results.
-> Issues that cause a disparity between local and global results include:
The basic form of the data:
Definition: K-anonymity is the property that, if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release.
Examples:
Definition: Replacing a value with less specific but semantically consistant value is called as generalization and suppression involves blocking the values.
Examples:
This example below is quoted from Wikipedia.
Patients Record (Original) (Height unit: cm, Weight unit: kg)
Name | Age | Gender | Height | Weight | State of domicile | Religion | Disease |
---|---|---|---|---|---|---|---|
Ramsha | 30 | Female | 165 | 72 | Tamil Nadu | Hindu | Cancer |
Yadu | 24 | Female | 162 | 70 | Kerala | Hindu | Viral infection |
Salima | 28 | Female | 170 | 68 | Tamil Nadu | Muslim | Tuberculosis |
Sunny | 27 | Male | 170 | 75 | Karnataka | Parsi | No illness |
Joan | 24 | Female | 165 | 71 | Kerala | Christian | Heart-related |
Bahuksana | 23 | Male | 160 | 69 | Karnataka | Buddhist | Tuberculosis |
Rambha | 19 | Male | 167 | 85 | Kerala | Hindu | Cancer |
Kishor | 29 | Male | 180 | 81 | Karnataka | Hindu | Heart-related |
Johnson | 17 | Male | 175 | 79 | Kerala | Christian | Heart-related |
John | 19 | Male | 169 | 82 | Kerala | Christian | Viral infection |
Patient Record (Anonymized) (Height unit: cm, Weight unit: kg)
Name | Age | Gender | Height | Weight | State of domicile | Religion | Disease |
---|---|---|---|---|---|---|---|
* | (20, 30] | Female | 165 | 72 | Tamil Nadu | * | Cancer |
* | (20, 30] | Female | 162 | 70 | Kerala | * | Viral infection |
* | (20, 30] | Female | 170 | 68 | Tamil Nadu | * | Tuberculosis |
* | (20, 30] | Male | 170 | 75 | Karnataka | * | No illness |
* | (20, 30] | Female | 165 | 71 | Kerala | * | Heart-related |
* | (20, 30] | Male | 160 | 69 | Karnataka | * | Tuberculosis |
* | [0, 20] | Male | 167 | 85 | Kerala | * | Cancer |
* | (20, 30] | Male | 180 | 81 | Karnataka | * | Heart-related |
* | [0, 20] | Male | 175 | 79 | Kerala | * | Heart-related |
* | [0, 20] | Male | 169 | 82 | Kerala | * | Viral infection |
This data has 2-anonymity with respect to the attributes Age, Gender and State of domicile, since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.
Pros:
Cons:
Definition: In perturbation the original values are replaced with some synthetic data values so that the statistical information computed from the perturbed data does not differ from the statistical information computed from the original data to a larger extent.
Examples:
Individual incomes (k$) (Original)
[40, 50, 60, 70, 80, 90, 100]
We add random noise (synthetic data) to each income value. The noise ensures that the statistical properties remain similar while obscuring individual details. E.g. random value between -5 and +5.
Individual incomes (k$) (Perturbed)
[43, 48, 62, 73, 78, 92, 98]
Aggregated Statistics:
Assume the original data has mean $E[X]$ and variance $Var[X]$, and the noise is charaterized with $E[Y]$ and variance $Var[Y]$:
The statistical information remains consistent, if the distribution of random noise is known.
Pros:
Cons:
Definition: In Randomized response, the data is scrambled in such a way that the central place cannot tell with probabilities better than a pre-defined threshold, whether the data from a customer contains truthful information or false information.
Examples:
The Bayes estimator is one of the Randomized Response algorithm.
Suppose we ask a person whether they’ve had sex with a prostitute this month. (Assume the event $A$ true probability is $q$)
Before answering, they flip a coin (event $B$):
Only they know whether their answer reflects the coin toss or their true behavior.
Here is the table of joint probability of two events:
$P(\cdot)$ | $A$ | $\bar{A}$ |
---|---|---|
$B$ | $p \cdot q$ | $p \cdot (1-q)$ |
$\bar{B}$ | $(1-p) \cdot q$ | $(1-p) \cdot (1-q)$ |
Since $A \cap B$, $\bar{A} \cap B$, $A \cap \bar{B}$ would receive “yes” (with an observed ratio $k$) and $\bar{A} \cap \bar{B}$ would receive “no” (with an observed ratio $1-k$), we can calculate the target probability $q$ with the prior known probabiliy $p$ and the observed conditional probability $k$:
\[\begin{aligned} P(\bar{A} \cap \bar{B}) &= (1-p) \cdot (1-q) = 1-k,\\ &\implies q = 1- \frac{1-k}{1-p}. \end{aligned}\]For example, if observed $k=55\%$ and we know $p=50\%$, we could get $q=10\%$.
Pros:
Cons:
Definition: Condensation approach constructs constrained clusters in dataset and then generates pseudo data from the statistics of these clusters.
Examples:
Student Record (original)
Name | Age | Gender | Major | GPA |
---|---|---|---|---|
Alice | 20 | Female | Biology | 3.8 |
Bob | 21 | Male | Chemistry | 3.5 |
Carol | 19 | Female | Physics | 3.9 |
David | 22 | Male | Math | 4.0 |
Eve | 20 | Female | Biology | 3.7 |
Frank | 21 | Male | Chemistry | 3.6 |
Grace | 19 | Female | Physics | 3.8 |
Harry | 22 | Male | Math | 3.9 |
Attributes:
Student Record (condensated)
Name | Age | Gender | Major | GPA |
---|---|---|---|---|
Anna | 20 | Female | Biology | 3.8 |
Beth | 20 | Female | Biology | 3.7 |
Chris | 21 | Male | Chemistry | 3.6 |
Dan | 21 | Male | Chemistry | 3.5 |
Emma | 19 | Female | Physics | 3.9 |
Fred | 19 | Female | Physics | 3.8 |
George | 22 | Male | Math | 4.0 |
Henry | 22 | Male | Math | 3.9 |
Pros:
Cons:
Definition: Cryptographic techniques are ideally meant for such scenarios where multiple parties collaborate to compute results or share non sensitive mining results and thereby avoiding disclosure of sensitive information.
Examples:
All these methods are almost based on a special encryption protocol known as Secure Multiparty Computation (SMC) technology.
In Yao, Andrew C, “How to generate and exchange secrets”, Proceedings of 27th IEEE Symposium on Foundation of Computer Science, 162-167, 1986 discussed a problem where two millionaires wanted to know who is richer with neither revealing their net worth. So, SMC was coined and developed. SMC defines two basic adversarial models namely (i) Semi-Honest model and (ii) Malicious model.
See more detailed protocols in Yao’s Millionaires’ problem.
Pros:
Cons:
Soft computing techniques may be used to handle the different challenges offered by the data mining. -> high MIQ (Machine Intelligence Quotient)
See related concepts: Fuzzy set, Rough set.
Thanks for reading.