7 Data Cleanup Terms Explained Visually

During a recent data conference, a coworker and I started to realize that there was a bit of a language barrier between “data people” and “non-data people”. For us, this was most apparent when we tried to describe data cleanup operations — while there are technical terms for types of data cleanup, we’ve found that many of them can be described just as easily in plain English. While we’ve defined these terms in both a technical and less-technical way, it might still be hard to visualize what these terms are actually doing…

So we asked ourselves, “What does data cleanup look like?”, in hopes that this might be another way to communicate  the meaning of these terms in a way that is less technical. Below, we’ve created a visual glossary to answer that question and explain some commonly used data cleanup operations.

 

Aggregating

Aggregating is sorting data and then expressing the data in a summary form.

Aggregate

Here we have a list of Nobel Peace Prize Winners that includes the country they hail from which we’ve aggregated to get the total number of Peace Prize Winners from each country.

 

Filtering

Filtering a dataset narrows it down to just a specific group of records that we’re interested in.

Filter

Here we have a list of elements and their element type which we’ve filtered down to just the metal elements.

 

Merging

When your data is scattered in multiple datasets, merging allows you to combine the relevant parts of those datasets to create a new file to work with.

Merging

Here we have two datasets; one contains a column for state and for state capitol and the other contains a column for state and the other contains a column for state and a column for state population. We’ve merged these datasets together to create one dataset that contains state, state, capitol, and state population.

 

Appending

To append two datasets is to stack them to create one larger dataset. Usually when appending data, the datasets contain the same (or very similar) fields.

Append

Here we have two datasets; one contains all of the superhero movies released in 2013 and the other contains all of the superhero movies released in 2014. We’ve appended these datasets together to create one stacked dataset that contains all of the superhero movies released in 2013 and 2014.

 

Deduping

To dedupe is to remove duplicates from a dataset.

Deduping

Here we have a list of emails from our Mickey Mouse Club newsletter, which includes a couple of duplicates (highlighted). We’ve deduped this list so that we have one single entry for each person (or mouse, or duck, or dog) on our list.

 

Transforming

To transform a column is to perform an operation on or using that column that results in a new outcome — this could be a new variable entirely, or a different version of the inputted column.

transform

Here we have a dataset that contains the first and last names of our earliest presidents. We’ve combined the two columns to create a brand new column, “Full Name”.

 

Data Cleansing

To cleanse a column is to clean up the values within that column, commonly by replacing them.

cleansing

Here we have a gender column that we’ve pulled out of our database. We’ve noticed that our entries for gender are not uniform — for example, “female”, “fem”, and “F” can all represent “female”, so we’ve cleansed the data to make the entries more consistent within the column.

*

If you’re looking for further explanations of data, analytics, and modeling, I’d recommend checking out our video series, which outlines the modeling process from raw data through to reporting results — in 5-minute bite-sized chunks.

Decentralize analytics.
Harness the power of many.

Create and share reports and datasets across the enterprise, and put analytical power in the hands of everyone. Veera creates a truly data-driven culture. Try it for yourself today.

 

DOWNLOAD FREE TRIAL

Decentralize analytics. Harness the power of many.

 
DOWNLOAD FREE TRIAL