Detection and Correction: Data Prep Pitfalls

pitfall-signBig Data is the talk of the town these days, but it seems at this point we aren’t nearly as concerned with Clean Data.  According to David Kirkpatrick, founder of Techonomy, we create as much information in two days as we did from the dawn of man through 2003. How much of the data explosion is due to user generated content, selfies and rick roll videos can be left up to debate, but if we don’t start implementing tools and strategies to prepare and cleanse our data appropriately, we might in the end, be buried under a heap of worthless data.

When it comes to data prep, knowledge of the data itself is the most important tool you can arm yourself with.  With this knowledge you can then identify what relevant data you need to access.  Bringing all of the different information together can be a strenuous process and leaves us with many challenges when preparing this data for reporting or predictive modeling.  In another blog entry we discussed ways to identify data errors.  Here we look to address some of those common challenges in our data exploration.

Monotonic Variables
A monotonic variable is one that increases without bound.  This would be a variable such as a date, social security numbers, customer ID, etc.  It is the transformation of these variables that allow us to glean some information about the individuals in our data set.  For instance if we have a customer’s last interaction date we could create a new variable representing the number of days since their last interaction.  With customer ID’s simply adding a text component to these variables would ensure that this variable is not later confused as a continuous variable.

An outlier is a single, or very low frequency, occurrence of a value of a variable that is far away from the bulk of the values of that variable.  As a general rule of thumb, if you can positively identify it as a mistake then you can try to correct the entry or just include it as a missing value.  However, the data miner must assume that all outliers are in fact outliers until they are positively identified as an error.

If a variable is sparsely populated with entries if it is at all possible to collapse the variable that might be the best solution, which would include binning or other forms of cleansing.  Another way to handle these would be to consider creating a dummy variable that marks whether or not the information exists at all.  If, for instance, you had in your dataset emails of all individuals, but it happened to be sparsely populated, just the fact that an email exists might inform you of the type of relationship you have with that individual.

Data Consistency
When working with data sets it is important to review the data for consistency.  Many manually entered text variables contain many unique values that in fact represent a single entry.  It is important to cleanse these variations so that they are represented as a single entry.  If working with Binary variables you want to make sure that they are represented uniformly.

Data Pollution
One of the most common forms of data pollution is when users stretch a system beyond its means.  For instance, in Higher Ed, instead of having a separate column to record ACT scores they were entered in the SAT column.  Parsing this data out is essential.

Zip Codes
Zip codes are constantly an issue when preparing data.  The relationship between zip code and other variables cannot be easily identified.  In Rapid Insight Software there is an easy way to calculate the distance between zip codes to return a value that we can then use.

Missing Values
First, it is important to differentiate, if possible, between values that are missing and those that are empty.  Missing values are values that were not entered, but most certainly exist.  Something like gender could be considered missing.  Empty variables are values that do not actually exist.

The real question is how do we replace the missing values.   Unfortunately there isn’t a single catch all answer here.  This is entirely dependent on the variable and your knowledge of it.  You could fill with the variable mean, fill with zero, or exclude the entire record.  Also, there are several methods that can be used to estimate an appropriate value to plug in for one that is missing.  You need to identify which are the important relationships to preserve and that will indicate how to find the estimate for those missing values.  The key here is that when replacing the missing values we do not introduce a pattern that doesn’t already exist in the natural data.


How can Rapid Insight Veera™ help?

Although these challenges always chance being present in data clean up, there are tools to ease this pain that make the process smoother, faster, and way more fun! Sometimes finding the right software for the job can be a challenge in itself. Rapid Insight Veera™ makes that job a piece of cake. This data blending and preparation tool has a user friendly, drag and drop interface that will make you a data superhero, especially with those repeatable processes and automated reporting. With Veera, users can cut their data clean up down to a fraction of the time and then automate the process for convenience. Don’t let your Big Data mess get the best of you.

Remember, always keep in mind the seven challenge points above and find the right method for you and big data clean up should be as easy as 1, 2, 3.

– Jon MacMillan, Rapid Insight Data Analyst



Experience Rapid Insight

No risk, all reward! Download a free, fully functional 14-day trial of Veera Workstation and Rapid Insight Analytics today, and a member of our analyst team will help you get the most out of your trial. 



Experience Rapid Insight

No risk, all reward! Download a free, fully functional 14-day trial of Veera Workstation and Rapid Insight Analytics today, and a member of our analyst team will help you get the most out of your trial.