There are situations in which we need to share sensitive or strategic information with third parties, whether to develop a system or test new functionalities. To guarantee the confidentiality of the information contained in this data set, we use de-characterization techniques.
In this tip you will learn how to do this, shuffling information from fields and records.
In our example, we will use a dataset with 3 fields and 10 records. They are intentionally ordered so that you can compare the data before and after the shuffling process,
Our initial dataset is:
Name Age Salary
Alan 10 10000
Bernardo 20 20000
Carlos 30 30000
Daniel 40 40000
Eduardo 50 50000
Flavio 60 60000
Gilberto 70 70000
Hugo 80 80000
Ian 90 90000
Jorge 100 100000
Step 1: Generate a flow for each field in your dataset
For each field, create an independent stream, each with 2 columns: Record ID and the field in question. In our example, we will create 3 streams, containing the columns Record ID + Name, Record ID + Age and Record ID + Salary, respectively.
Step 2: Create a random field (in all 3 streams)
Use the Formula tool to generate a random number. It must be large enough to reduce the chance of repetition. There are several ways to generate a random number; the formula we chose for this example is RandInt (1000) * RandInt (1000) + RandInt (1000).
Create a new field and call it Random.
From this point, each step must be repeated in each of the streams. You can copy and paste the tool and adjust as needed.
Step 3: Sort the dataset
Use the Sort tool to sort the dataset by the Random field generated in the previous step.
Step 4: Create a new column with a Sequential Number
With the data sorted, create a new index. Call it NewID
Step 5: Remove unnecessary fields
Use the Select tool to keep only the NewID and the flow’s own field in the data set. In the example below, the Name field.
Step 6: Join the fields again
Use the Multi-Join tool to regroup the fields in your data set. The key should be the New ID field.
Remember that to perform this step, the previous steps must have been completed for all fields.
In the output, choose only the fields that are part of the initial set, deselecting the NewID fields from each input.
Step 7: Check the result
Finally, connect a Browse tool to the output and check how the data is randomly shuffled.
Great! Now you can use Alteryx to shuffle your data!