Data Prep for Machine Learning: Improve Your Training Data

alex ziko
data analyst

Following up on my previous blog post on data prep for machine learning, I want to unpack some of the concepts that I only had time to mention briefly. In the previous post, I discuss the very basics of what machine learning is and some of the difficulties involved in the data prep that the process uses. In this post, I cover the finer details of the data prep process. For example, being able to handle duplicate records. Identical records or nearly identical records can be particularly troublesome and are a frequent occurrence when collecting data to use to craft machine learning algorithms. Before we start talking about the specific tasks of data prep, let’s talk about why the training data is important in the first place.

 

Spice Up Data Prep

Data is to machine learning as ingredients are to food dishes. Data makes up an abundant source of material and when it is prepared in the right way using the right tools, it can create incredible outcomes in the form of predictive models and automated processes. However, as any cook knows, if your ingredients are awful you’ll never create a desirable dish. Sometimes you need to choose your ingredients carefully, other times your ingredients need to be cut up, cleaned, and specially prepared before they can be used.

Photo credit: Cassie Kozyrkov

When a machine learning model is tested, it is verified on the data that went into crafting it. Just like a good recipe is tested repeatedly with its ingredients. After enough testing, the recipe is called good enough to use until perhaps a better recipe is crafted to make an even better dish; and the process repeats itself.

Duplicate records are common and can lead to a misrepresentation of your training data. Exact duplicates take up space in your training data. Once the algorithm interprets the record, it doesn’t improve the model in any way if it were to interpret again. Or, you might have near identical records but one of the records has a mistake associated with it, for example, the zip code could be wrong. One of the first steps we can make is to isolate the cohort we want to clean and then look for duplicate records. If we identify patterns and flaws within that cohort, the flaws could help us further prep the data so that we don’t have to throw it out.

Making Training Data Appetizing

Within Veera Construct, you can use the FindDup node. Using this, identical records can be easily spotted. For example, let’s take a dataset of just over 20,000 mock student records. Of the total number of records in the original cohort, 1,781 of them have duplicate values.

We can use a DeDup node to explore the records that are returning as a duplicate and eliminate down to the total number of records that were identical. In this instance, that would be the number of found duplicates, minus the number of deduplicates (941 records).

If the duplicate value needs to be changed, we can use other nodes to identify this and make the correction. This could mean using a street address identification to keep the duplicate records exact, it could mean using a cleanse node to make data string alterations, or a filter to remove the duplicate records completely.

Knowing what to do with those duplicate records is important, and being able to defend the algorithm that you are crafting is even more important. Mastering your data and understanding exactly where improvements can be made is also a long-term solution for future algorithms that might use parts of the same data.

In addition to nodes that can fix data mistakes, there are nodes that also compile and enact processes as you explore data for further areas of improvement. Being able to explore your data efficiently and effectively can be a huge time saver, especially since this can be an iterative process. For example, aggregating a variable and selecting key variables for summary values are great ways to quickly get a 30,000-foot view of your data landscape. You can also sort, rank, and filter, so that only the values that you want are available for your training data. Since Veera Construct is built to easily replicate processes, the more you explore, the faster you can cover more ground. You are building off of what you have already created and are not reinventing the wheel every time you want to query and investigate.

With Veera Construct, it is easy to decide what you want to do with records that need to be cleaned; whether to remove them completely or to further prep them and make them their own cohort to use elsewhere in your work. Machine learning algorithms can do a great job of rapidly churning data into predictive models. But, that’s not where the only value lies. Your training data may have more usefulness than the state that it is in right now, which means you can make improvements, which will, in turn, enhance the algorithm using it. You can download a trial of Veera Construct, the automated self-serve data prep tool to explore and remove duplicate records from your own dataset.