Data Cleanup: Are We Done Yet?

jon macmillan
senior data analyst

People, and by people I mean data nerds, say that 80% of predictive modeling lies within data preparation. Often times, the person behind the creation of the model is wearing many hats, and likely is the one who is also doing the tedious task of preparing the data. But data cleanup finds its way into all aspects of our work, especially when it comes to creating reports.

If we are doing our jobs though, it shouldn’t be evident how much data preparation was required to produce a report. To end users, all that matters is that they see a concise report with information they can immediately act on. All of the different data sources, merges, calculations, and cleanses are hidden from view, without a trace. What takes only minutes to digest was the product of hours, weeks, or months of preparation.

This was the case when I created a dashboard around some Transportation Security Administration (TSA) claims data. This data includes all claims filed with TSA when an airline passenger was injured or their property was lost or damaged during the screening process, between the years 2002 and 2017. For the sake of those who generate these types of reports, I’ll walk you through the process that I, and many other data scientists, go through each time we take on a new project. I’ll explain the steps that must be taken, and drill down into each chart to see how much work actually goes into every one.

To start, let’s take a look at the TSA claims dashboard I created in Tableau.

This dashboard allows you to select any airport in the US and see how that airport fares (terrible pun✓) when it comes to the amount of TSA claims levied against them. While the data is fun to look at, let’s walk through each chart to understand why it is there, how it came to be and what major roadblocks I hit along the way.

TSA Data Mapped

Why? Simple. People like maps. This also acts as an easy filtering option for the rest of the dashboard where you can quickly locate the airport you are interested in and select it to see more information about that airport.

How? I had to compile all of the different files of the TSA claims data. This involved converting all data from 2016 and 2017 from PDF to Excel and using Veera Construct to clean up issues with the conversion. Then I had to append all the data together from all years and merge that with some additional airport location data that I tracked down since it wasn’t included in the TSA Data.

Major roadblocks? The PDF conversion process wasn’t simple and required extensive data cleanup after the fact. While you may not be working with PDF files, data access is always a roadblock. Finding and accessing the right data can often be one of the most time-consuming aspects of data prep.

Time? Two hours.

Total Passenger Data Cleanup and Standardization

Why? While the number of claims is important, it is also important to look at the total passenger traffic to get a better idea of the ratio of claims. Also, it kind of looks like a video game, which is pretty cool.

How? I had to combine the previously mentioned cleaned up data with some additional airport traffic data in order to find the total passenger traffic between 2002 and 2016.

Major roadblocks? The data I found could only be downloaded in individual files for each year. This meant I had to append all of the data together and merge it with the already cleaned up TSA and Airport Location data.

Time? Two hours.

Data Cleanup for Claim Type and Claim Site

Why? I wanted to show the yearly trends on both the monetary value of claimed items and the total number of claims. This allows you to see if TSA claims are increasing or decreasing, by both types of claim and location.

How? The fields used here were in the original TSA data, so nothing needed to be added to the dataset.

Major roadblocks? Both the claim site and the claim type had a lot of unnecessary and irrelevant noise. I had to go through and cleanse the fields and also set filters to remove some of the clutter.

Time? One hour.

TSA Data for the Curious

Why? Let’s be honest, we’re all (a little, or, very) nosy. I wanted to show the most common types of items that were claimed at each airport as well as the total monetary value.

How? All of this data was in the original TSA claims data source as well.

Major roadblocks? This data was also very messy. For example, there was a category for computers and accessories and another for computer – laptop. This happens all the time within data as there is no universal key for categorizing lists. I had to cleanse this field in order to visualize it more effectively.

Time? One hour.

Data Cleanup for a Box and Whisker Plot

Why? A box and whisker plot is a good way to quickly determine what airport has the lowest number of claims in comparison to their overall traffic. Any airport below the median at the center of the box has a good claim to passenger ratio, in comparison.

How? This required the additional airport traffic data that was merged in Veera Construct to combine the TSA claims data with the airport traffic to create the pretty video game-esque plane chart mentioned above.

Major roadblocks? There is a combination of the roadblocks previously mentioned. I had to calculate a new field (‘Claims Per 1M Passengers’), but I also had to remove some outliers since there were some airports with a handful of TSA claims that were very small, putting their claims per 1M passenger ratings above 100 and skewing the entire chart.

Time? 30 minutes.

Overall time to create a clean dataset

All in all, this report took me five and a half hours to create. This time doesn’t even include all of the steps that I had to take in Tableau to format and clean up the titles, and other small details. As I mentioned earlier, at a quick glance (and in a perfect world), the end user should be able to digest the information in no more than a couple of minutes. It’s inevitable that after they finish absorbing (and maybe dissecting) the dashboard, they typically ask for more data, or even a different chart. It’s crucial that the end users understand what those two questions require to get done, and how much time goes into the actual creation of a “clean” dataset.

In the end, the person who is acting on the data sees this:
However, the person preparing the dataset sees this:

Hopefully this explains why, when I get a request for five different flavors of an existing report or visualizations, I get that sad look on my face. Though, to be fair, data cleanup could take someone days rather than hours. The data cleansing tools available, such as Veera Construct, infinitely quicken this tedious process, and alas, sad face gone.