PATH | Alteryx Tips – Using Alteryx to shuffle a dataset

Posted on Posted in dicas alteryx

There are situations in which we need to share sensitive or strategic information with third parties, whether to develop a system or test new functionalities. To guarantee the confidentiality of the information contained in this data set, we use de-characterization techniques.

In this tip you will learn how to do this, shuffling information from fields and records.

In our example, we will use a dataset with 3 fields and 10 records. They are intentionally ordered so that you can compare the data before and after the shuffling process,

Our initial dataset is:

Name        Age           Salary
Alan        10          10000
Bernardo    20          20000
Carlos       30          30000
Daniel       40          40000
Eduardo     50          50000
Flavio       60          60000
Gilberto    70          70000
Hugo        80          80000
Ian         90          90000
Jorge       100         100000

Step 1: Generate a flow for each field in your dataset

For each field, create an independent stream, each with 2 columns: Record ID and the field in question. In our example, we will create 3 streams, containing the columns Record ID + Name, Record ID + Age and Record ID + Salary, respectively.

Step 2: Create a random field (in all 3 streams)

Use the Formula tool to generate a random number. It must be large enough to reduce the chance of repetition. There are several ways to generate a random number; the formula we chose for this example is RandInt (1000) * RandInt (1000) + RandInt (1000).

Create a new field and call it Random.

From this point, each step must be repeated in each of the streams. You can copy and paste the tool and adjust as needed.

Step 3: Sort the dataset

Use the Sort tool to sort the dataset by the Random field generated in the previous step.

Step 4: Create a new column with a Sequential Number

With the data sorted, create a new index. Call it NewID

Step 5: Remove unnecessary fields

Use the Select tool to keep only the NewID and the flow’s own field in the data set. In the example below, the Name field.

Step 6: Join the fields again

Use the Multi-Join tool to regroup the fields in your data set. The key should be the New ID field.

Remember that to perform this step, the previous steps must have been completed for all fields.

In the output, choose only the fields that are part of the initial set, deselecting the NewID fields from each input.

Step 7: Check the result

Finally, connect a Browse tool to the output and check how the data is randomly shuffled.

Great! Now you can use Alteryx to shuffle your data!