This is the quick summary from literature review on reference: Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects. Generaly, it is a paper concluding and comparing existing PPDM techniques, discussing their performances and raise a brief introduction to soft computing techniques adapting broad and various scenarios.
Introduction
Privacy preserving data mining (PPDM) deals with protecting the
privacy of individual data or sensitive knowledge without
sacrificing the utility of the data.
No privacy preserving algorithm exists that outperforms all others on all possible criteria.(Choose tools and techniques according to performances under current scenario)
There is no need to intrude privacy for data mining, but there might be privacy intrusion. (Data mining can be performed without intruding privacy)
The goal of PPDM is to lower the risk of misuse of data and at the same time produce results same as that produced in the absence of such privacy preserving techniques.
PPDM Framework
Level 1, data transform (DS -> DW)
The raw data from a single or multiple databases is transformed into a format that is well-suited for analytical purposes. Even at this stage, privacy concerns are needed to be taken care of.
Level 2, data sanitization (DW -> PPDM)
The data from data warehouses is subjected to various processes that make the data sanitized so that it can be revealed even to untrustworthy data miners. The processes applied at this stage are blocking, suppression, perturbation, modification, generalization, sampling etc.
Level 3, data revealing (PPDM -> result)
The information/knowledge so revealed by the data mining algorithms is checked for its sensitiveness towards disclosure risks.
PPDM Techniques
Classification
The majority of the existing approaches can be classified into
two broad categories:
(Precaution) methodologies that protect the sensitive data itself in the mining process
(Anti-extraction) methodologies that protect the sensitive data mining results
Distributed Data Mining (DDM)
A simple approach to data mining over multiple sources that will not share data is to run existing data mining tools at each site independently and combine the results.
-> Issues that cause a disparity between local and global results include:
failure detecting cross-site correlations
over-weighted results on uneliminated duplicates
hidden geographic or demographic distinctions due to homogeneous population
PPDM Techniques Dimensions
Data distribution
Data modification
Data mining algorithms
Data or rule hiding
Privacy preservation
PPDM Techniques Categories basing on Dimensions
Anonymization based PPDM
Perturbation based PPDM
Randomized Response based PPDM
Condensation approach based PPDM
Cryptography based PPDM
Anonymization based PPDM
The basic form of the data:
Explicit Identifiers (identify a record owner explicitly)
Quasi Identifiers (potentially identify a record owner, combing with others)
Sensitive Attributes (contains sensitive person specific information)
Non-Sensitive Attributes (no problem if revealed)
k-anonymity
Definition: K-anonymity is the property that, if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release.
Examples:
Explicit Identifiers has 1-anonymity.
Quasi Identifiers has at least 2-anonymity.
Procedure
Definition: Replacing a value with less specific but semantically consistant value is called as generalization and suppression involves blocking the values.
Examples:
This example below is quoted from Wikipedia.
Patients Record (Original) (Height unit: cm, Weight unit: kg)
Name
Age
Gender
Height
Weight
State of domicile
Religion
Disease
Ramsha
30
Female
165
72
Tamil Nadu
Hindu
Cancer
Yadu
24
Female
162
70
Kerala
Hindu
Viral infection
Salima
28
Female
170
68
Tamil Nadu
Muslim
Tuberculosis
Sunny
27
Male
170
75
Karnataka
Parsi
No illness
Joan
24
Female
165
71
Kerala
Christian
Heart-related
Bahuksana
23
Male
160
69
Karnataka
Buddhist
Tuberculosis
Rambha
19
Male
167
85
Kerala
Hindu
Cancer
Kishor
29
Male
180
81
Karnataka
Hindu
Heart-related
Johnson
17
Male
175
79
Kerala
Christian
Heart-related
John
19
Male
169
82
Kerala
Christian
Viral infection
Explicit Identifiers: Name
Quasi Identifiers: Age, Gender, State of domicile, and Religion
Sensitive Attributes: Disease
Suppression.
Certain values of the attributes are replaced by an asterisk “*”. E.g., we have replaced all the values in the Name attribute and the Religion attribute with a “*”.
Generalization.
Individual values of attributes are replaced with a broader category. E.g., the value “19” of the attribute Age may be replaced by “≤ 20”, etc.
Patient Record (Anonymized) (Height unit: cm, Weight unit: kg)
Name
Age
Gender
Height
Weight
State of domicile
Religion
Disease
*
(20, 30]
Female
165
72
Tamil Nadu
*
Cancer
*
(20, 30]
Female
162
70
Kerala
*
Viral infection
*
(20, 30]
Female
170
68
Tamil Nadu
*
Tuberculosis
*
(20, 30]
Male
170
75
Karnataka
*
No illness
*
(20, 30]
Female
165
71
Kerala
*
Heart-related
*
(20, 30]
Male
160
69
Karnataka
*
Tuberculosis
*
[0, 20]
Male
167
85
Kerala
*
Cancer
*
(20, 30]
Male
180
81
Karnataka
*
Heart-related
*
[0, 20]
Male
175
79
Kerala
*
Heart-related
*
[0, 20]
Male
169
82
Kerala
*
Viral infection
This data has 2-anonymity with respect to the attributes Age, Gender and State of domicile, since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.
Pros and Cons
Pros:
Ensures that the transformed data is true
The anonymization method is easy to perform
Cons:
Suffers heavy information loss
Not immune to homogeneity attack and background knowledge attack practically
Perturbation based PPDM
Procedure
Definition: In perturbation the original values are replaced with some synthetic data values so that the statistical information computed from the perturbed data does not differ from the statistical information computed from the original data to a larger extent.
Examples:
Individual incomes (k$) (Original)
[40, 50, 60, 70, 80, 90, 100]
We add random noise (synthetic data) to each income value. The noise ensures that the statistical properties remain similar while obscuring individual details. E.g. random value between -5 and +5.
Individual incomes (k$) (Perturbed)
[43, 48, 62, 73, 78, 92, 98]
Aggregated Statistics:
Assume the original data has mean and variance , and the noise is charaterized with and variance :
Total mean:
Total variance:
The statistical information remains consistent, if the distribution of random noise is known.
Pros and Cons
Pros:
cannot link specific perturbed records back to real-world individuals
hard to recover
Cons:
other techniques such as multivariate decision tree algorithms cannot work with it, because perturbation approach treats different attributes independently
the distribution based data mining algorithms have an inherent disadvantage of loss of implicit information available in multidimensional records
Randomized Response based PPDM
Procedure
Definition: In Randomized response, the data is scrambled in such a way that the central place cannot tell with probabilities
better than a pre-defined threshold, whether the data from a customer contains truthful information or false information.
In first step, the data providers randomize their data and transmit the randomized data to the data receiver.
In second step, the data receiver reconstructs the original distribution of the data by employing a distribution reconstruction algorithm.
Examples:
The Bayes estimator is one of the Randomized Response algorithm.
Scenario: survey about a sensitive topic, such as whether they have ever used illegal drugs.
Randomized Response Technique: Each respondent is given a randomization device (like a coin or dice) that determines how they answer.
Suppose we ask a person whether they’ve had sex with a prostitute this month. (Assume the event true probability is )
Before answering, they flip a coin (event ):
If the coin comes up tails (with probability ), they answer “yes” regardless of the truth.
If the coin comes up heads (with probability ), they answer truthfully based on their actual experience.
Only they know whether their answer reflects the coin toss or their true behavior.
We assume that people who get heads will answer truthfully.
We assume that the events and are independent, so we have joint probability:
Here is the table of joint probability of two events:
Since , , would receive “yes” (with an observed ratio ) and would receive “no” (with an observed ratio ),
we can calculate the target probability with the prior known probabiliy and the observed conditional probability :
For example, if observed and we know , we could get .
Pros and Cons
Pros:
The randomization method can be implemented at data collection time. It does not require a trusted server to contain all the original records in order to perform the anonymization process.
Cons:
It treats all the records equal irrespective of their local density. This leads to a problem where the outlier records
become more susceptible to adversarial attacks as compared to records in more dense regions in the data.
Condensation approach based PPDM
Procedure
Definition: Condensation approach constructs constrained clusters in dataset and then generates pseudo data from the statistics of these clusters.
Examples:
Student Record (original)
Name
Age
Gender
Major
GPA
Alice
20
Female
Biology
3.8
Bob
21
Male
Chemistry
3.5
Carol
19
Female
Physics
3.9
David
22
Male
Math
4.0
Eve
20
Female
Biology
3.7
Frank
21
Male
Chemistry
3.6
Grace
19
Female
Physics
3.8
Harry
22
Male
Math
3.9
Attributes:
Explicit identifier: Name
Quasi identifiers: Age, Gender, and Major
Sensitive attribute: GPA
Clustering: We group the records into clusters based on their quasi-identifiers. Each cluster has a size of at least k, where k is the desired anonymity level. For example, if k = 2, each cluster should have at least two records.Here’s one possible clustering result:
Cluster 1: Alice, Eve
Cluster 2: Bob, Frank
Cluster 3: Carol, Grace
Cluster 4: David, Harry
Statistics: We compute the statistics of each cluster, such as the mean, median, mode, standard deviation, etc. These statistics capture the essential information of the original data. Here’s an example of the cluster statistics:
Cluster 1: Mean Age = 20, Mode Gender = Female, Mode Major = Biology, Mean GPA = 3.75
Cluster 2: Mean Age = 21, Mode Gender = Male, Mode Major = Chemistry, Mean GPA = 3.55
Cluster 3: Mean Age = 19, Mode Gender = Female, Mode Major = Physics, Mean GPA = 3.85
Cluster 4: Mean Age = 22, Mode Gender = Male, Mode Major = Math, Mean GPA = 3.95
Synthesis: We generate synthetic data from the cluster statistics using any data generation method, such as random sampling, regression, or generative models. The synthetic data should have the same format and distribution as the original data. Here’s an example of the synthetic data:
Student Record (condensated)
Name
Age
Gender
Major
GPA
Anna
20
Female
Biology
3.8
Beth
20
Female
Biology
3.7
Chris
21
Male
Chemistry
3.6
Dan
21
Male
Chemistry
3.5
Emma
19
Female
Physics
3.9
Fred
19
Female
Physics
3.8
George
22
Male
Math
4.0
Henry
22
Male
Math
3.9
Pros and Cons
Pros:
The use of pseudo-data provides an additional layer of protection, as it becomes difficult to perform adversarial attacks on synthetic data.
It works even without redesigning data mining algorithms, very effective in case of data stream problems where the data is highly dynamic.
Cons:
Data mining results get affected as large amount of information is lost because of the condensation of a larger number of records into a single statistical group entity.
Cryptography based PPDM
Procedure
Definition: Cryptographic techniques are ideally meant for such
scenarios where multiple parties collaborate to compute
results or share non sensitive mining results and thereby
avoiding disclosure of sensitive information.
Examples:
All these methods are almost based on a special encryption protocol known as Secure Multiparty Computation (SMC) technology.
In Yao, Andrew C, “How to generate and exchange secrets”, Proceedings of 27th IEEE Symposium on Foundation of Computer
Science, 162-167, 1986 discussed a problem where two millionaires wanted to know who is richer with neither revealing their net worth. So, SMC was coined and developed. SMC defines two basic adversarial models namely (i) Semi-Honest model and (ii) Malicious model.
Performance: Performance of a mining algorithm
is measured in terms of the time required to achieve
the privacy criteria.
Data Utility: Data utility is basically a measure of
information loss or loss in the functionality of data in
providing the results, which could be generated in
the absence of PPDM algorithms.
Uncertainty level: It is a measure of uncertainty with
which the sensitive information that has been hidden
can still be predicted.
Resistance: Resistance is a measure of tolerance
shown by PPDM algorithm against various data
mining algorithms and models.