HRAnalytics

Chapter 20 Masking HR data

In the toolbox of every HR Analytics expert, the ability to quickly anonymize data for test and development environments finds an important place. Surely you want to be able to protect confidential data from inappropriate views.

Ensure all needed libraries are installed

20.0.1 Whitehouse dataset

Load a data set with first and last names and let us preview the dataset

Let us replace original names with fake names

Let us replace original names with random numbers

Let us anonymise the salary data

We can use four different methods to anonymise data: * Rounding Salary to the nearest ten thousand * Top coding consists of bringing data to an upper limit. * Bottom coding consists of bringing data to a lower limit.

In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.

20.0.2 Fertility dataset

Let us explore deeper anonymisation option of generating synthetic data. We will do on the basis of the fertility dataset available from UCI website: https://archive.ics.uci.edu/ml/datasets/Fertility

UCI logo

UCI logo

It needs to be loaded first and let us preview it.

# A tibble: 100 x 10
   Season     Age Child_Disease Accident_Trauma Surgical_Interven~ High_Fevers Alcohol_Consumpt~ Smoking_Habit Hours_Sitting Diagnosis
    <dbl>   <dbl>         <dbl>           <dbl>              <dbl>       <dbl>             <dbl>         <dbl>         <dbl> <chr>    
 1  -0.33 0.69000             0               1                  1           0               0.8             0          0.88 N        
 2  -0.33 0.94                1               0                  1           0               0.8             1          0.31 O        
 3  -0.33 0.5                 1               0                  0           0               1              -1          0.5  N        
 4  -0.33 0.75                0               1                  1           0               1              -1          0.38 N        
 5  -0.33 0.67                1               1                  0           0               0.8            -1          0.5  O        
 6  -0.33 0.67                1               0                  1           0               0.8             0          0.5  N        
 7  -0.33 0.67                0               0                  0          -1               0.8            -1          0.44 N        
 8  -0.33 1                   1               1                  1           0               0.6            -1          0.38 N        
 9   1    0.64                0               0                  1           0               0.8            -1          0.25 N        
10   1    0.61                1               0                  0           0               1              -1          0.25 N        
# ... with 90 more rows

Let us examine the dataset.

More on the data attributes is availalble here: https://archive.ics.uci.edu/ml/datasets/Fertility

# A tibble: 2 x 2
  Diagnosis Surgical_Intervention
  <chr>                     <dbl>
1 N                            44
2 O                             7
# A tibble: 1 x 2
   mean       sd
  <dbl>    <dbl>
1 0.669 0.121319
# A tibble: 3 x 2
  High_Fevers     n
        <dbl> <int>
1          -1     9
2           0    63
3           1    28
# A tibble: 4 x 3
  Child_Disease Accident_Trauma     n
          <dbl>           <dbl> <int>
1             0               0    10
2             0               1     3
3             1               0    46
4             1               1    41
# A tibble: 1 x 1
  Child_Disease
          <dbl>
1          0.87

In the following we will generate synthetic data sampling from a normal distribution through the subsequent steps: 1. First create a new dataset called “fert”, after applying a log transformation on the hours sitting variable. 2. Calculate the average and the standard deviation 3. Set a seed for reproducibility 4. Generate new data normally distributed for the hours sitting variable 5. Retransform back the log variable using exponential. 6. Hard bound data not falling in the right range 7. Recheck the range. 8. Substitute the synthtic data back into the initial fert dataset.

# A tibble: 1 x 2
      mean       sd
     <dbl>    <dbl>
1 -1.01224 0.504779
[1] 0.0815 1.0000

In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.

Let us introduce now the concept of differential privacy, a mathematical concept used by big names like Google, Census Bureau and Apple. Why Differential Privacy? It quantifies the privacy loss via a privacy budget, called epsilon. It assumes the worst-case scenario about the data intruder. Smaller privacy budget means less information or a noiser answer, however epsilon cannot be zero or lower.

Global Sensitivity of Other Queries n is total number of observations a is the lower bound of the data b is the upper bound of the data Counting: 1 Proportion: 1 / n Mean: (b - a) / n

small global sensitivity results in less noise large global sensitivity results in more noise

Number of observations n <- nrow(fertility)

Global sensitivity of counts gs.count <- 1

Global sensitivity of proportions gs.prop <- 1/n

Lower bound a <- 0

Upper bound b <- 1

Global sensitivity of mean gs.mean <- (b - a) / n

Global sensitivity of proportions gs.var <- (b - a)^2 / n

# A tibble: 1 x 1
  Child_Disease
          <dbl>
1            87

The double exponential distribution is also commonly referred to as the Laplace distribution. The following is the plot of the double exponential probability density function.

plot of double exponential probability density function

plot of double exponential probability density function

[1] 87
[1] 85

Sequential Composition

Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the max epsilon privacy guarantee.

The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.

The privacy budget must be divided by two.

[1] 0.37
[1] 0.247

Parallel Composition

Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the sum epsilon privacy guarantee.

The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.

The privacy budget does not need to be divided. The query with the most epsilon is the budget for the data.

# A tibble: 1 x 1
  Hours_Sitting
          <dbl>
1      0.393297
# A tibble: 1 x 1
  Hours_Sitting
          <dbl>
1      0.543333
[1] 0.37
[1] 0.568

Prepping up data

# A tibble: 3 x 2
  Smoking_Habit     n
          <dbl> <int>
1            -1    56
2             0    23
3             1    21
[1] 46
[1] 37
[1] 17

Impossible and Inconsistent Answers

[1] -79
[1] 0
[1] 100
# A tibble: 3 x 2
  Smoking_Habit     n
          <dbl> <int>
1            -1    56
2             0    23
3             1    21
[1] 46.1 37.2 44.7
[1] 36 29 35