Make Your Data FLow – Alteryx IT!

WALK-THROUGH 1.0

This is a part 1 of many blog posts coming up on Alteryx workflows, weekly challenges and walk-through – as I learn it. Beginning with a brief overview of Alteryx and it’s use cases, I will conclude with a workflow walk-through.

After spending enough time on data visualization tools and mediums like Tableau, RShiny, Dash and PowerBI I found an ardent need to explore ways through which I could connect only relevant data to my visualization media. Through my internship I came to terms with the fact that data can be voluminous and way more messier than you can imagine.

With data lakes storing large volume of data dumps the process of value creation from this resource begins with creation of defined pipelines to analyze that data – a process some call the Extract Transform Load (ETL) process.

I could go out on a limb and call myself an Alteryx fanboy – I have reasons though.

As a practice, before I get my hands on to any tool I like to check the Gartner Report – as the report here speaks volumes about what Alteryx is!

My go-to before Alteryx was SSIS however some quality time with Alteryx changes everything you know about convoluted and time-consuming analytic and data engineering process.

Talk about companies that use Alteryx? – IBM, Amazon, Microsoft, Unilever, The Home Depot, McKinsey and Co. , P&G to name a few (Click here for the complete list)

Onward to the Weekly Challenge walk-through:

A weekly challenge is typically structured to make the participant drive the input data-sets towards what the output demands.

Here’s the problem statement for this challenge #167:

‘ This week’s challenge gears up for the 2019 Grand Prix by taking a look at race data from 1950-2018. In Formula 1, points are awarded to the top 10 drivers after each race. After all the races in the year, the driver with the most points is named champion. Drivers also represent their teams in the “Constructor’s Championship”. (Constructor = Team) Use the provided data to find the youngest and oldest ever champions, and what team they drove for during that season.

The input consists of 5 files as follows:

 

We have to create a workflow take helps us narrow down to two racers – the youngest and the oldest player player (at time of winning) to win a championship . The output looks like this:

The tools I’ve used to design the workflow are as follows:

I usually begin with exploring the input data-sets using select tool; identifying what attributes connect those data-sets (if multiple) using the browse tool. Also check for the data types – The first thing i observed here was that the column for date (dob) is a string which needs to be converted to a date format since the steps to follow will see us computing the age of these drivers.

DateTime Tool

On exploring the columns in these data-sets I decided to create a master table that packs all the attributes I’ll need to compute the output and drop the columns I don’t need in the process of creating that master table. For that purpose I used the join tool which in addition to joining tables also enables you to drop/rename columns and examine their datatypes.

Here, i joined “Driver Standings” and “Race Results” on the raceid and driverid. This is where your experience with SQL comes in handy. In addition, I also dropped some columns that were irrelevant to me or would lead to duplication.

Next I joined the inner join result of the above join with the output of time conversion we did in step 1 (from “Driver Profile” table) on driverId.

Here, again dropping columns i don’t need or the ones which are duplicates.

I followed it up by joining the tables “Team Details” with the inner join output of the above join.

Finally we have one last table left which will complete the creation of our master table. I joined “Race Details” table with the inner join output of the above performed join.

We now have the master table. Age of the drivers at the given championship is central to this workflow. It can be easily computed using the Year of the race and their date of birth. However if you observe in the above image – ‘year’ has the datatype V_String. In order to compute the age we need to convert ‘year’ to Date-Time format. We use the datetime tool again.

Now we have all the columns we need in the correct format. I followed this up by creating a new column ‘Age’ using the Formula Tool.

At this point for ease of operation I sorted the table in the ascending order of ‘year’ and descending order of ‘points’. This gave me a brief idea about the leader board from the inception of this championship.

At this point you can use a browse tool to examine how the table looks. I often use multiple browse tools at each stage to examine how my data looks after the operation I perform. It’s a good way to debug before you proceed with your workflow.

For the next step we can extract the leader from every year to create a subset of this table with only the winners from every year. To that end I created a new column using a simple logic- if the points a driver have is greater than the driver preceding him (remember, we have sorted the table in descending order of points), assign a value 1 in the new column else assign zero using multi-row formula tool.

Here, we compared points for a driver with the driver preceding him and not following him to ensure that our first row does not give an erroneous entry in the new column we are creating.

In order to create a table with only the leaders we could simply use the filter tool to filter out drivers with entry “1” in the ‘Standing’ column we created above.

At this point you can sort the data in this table by ascending order of ‘Age’ and descending order of ‘points’.

Here, the first and last record of this resultant table has out data of interest. We use the Select Records tool to extract these rows out of this data-set.

You can find out the expanse of you data in the results window. That is where I got this number – 68.

We now have the result ready however, we need some more information here. We need to add the Sponsor name and remove columns that don’t appear in the output.

We use a join tool to join our pseudo result with “Team Details” table on constructorId. Un-check the column that you don’t want to appear in the output and rename the columns as they appear in the output.

Our current output looks like this –

If you compare it with the output we need, you’ll observe that the youngest champion appears first. We will conclude this workflow with a very silly operation using the sort tool.

And that concludes our Weekly Challenge #167! Here’s how the workflow looks in the end –

There are more than one ways to approach this problem.

This is my maiden attempt at a walk-through and I welcome suggestions for the many walk-through’s to follow. Consider subscribing to my blog for a weekly update!

 

Advertisement

Published by

viraj1596

Permanent β | Analyst | Philosopher | Learner

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s