7 Data Cleanup Terms Explained in Plain English

Photo Courtesy Wikimedia Commons

Photo Courtesy Wikimedia Commons

I spent a couple of days at the TDWI Conference in Boston last week and had a lot of great conversations with data analysts from all over the world. I realized while talking with many of them that, for those of us working with data all day long, there is a lot of terminology that gets used when cleaning up data and preparing it for analysis.  I thought it would be helpful to provide some explanation of some of the terms we often use when we talk about cleaning data.

Cleansing

To cleanse a column is to clean up the values within that column, commonly by replacing them.

For example, if my dataset contains a ‘State’ field which is sometimes expressed as a two letter abbreviation and sometimes fully spelled out, I would cleanse the values in that columns so that they all matched the same criteria – ie, so that they were all two letter abbreviations or all the fully spelled out version of state name.

Aggregating

Aggregating is sorting data and then expressing the data in a summary form.

For example, if I have a dataset that contains purchase history where each purchase was its own row, and we wanted to know how many purchases a given customer had made, I could aggregate the data by a customer ID field and sum the number of purchases associated with each ID, giving me total number of purchases organized by customer.

Filtering

Filtering a dataset narrows it down to just a specific group of records that you’re interested in.

For example, if I have a dataset containing information about all the customers my company has in the United States, but I am only interested in customers who live in California, I can filter that dataset to focus on just customers who live in California.

Merging

When your data is scattered in multiple datasets, whether it be in spreadsheets, text files or databases, merging allows you to combine the relevant parts of those datasets to create a new file to work with.

For example, if I keep purchase history and client information in two separate datasets, but I am interested in creating a report containing information from both datasets, I would first merge the two datasets together (using a field like ‘Client ID’), then build the report from the resulting dataset.

Appending

It’s easy to think of appending as stacking – to append two datasets is to stack them to create one larger dataset. Usually when appending data, the datasets contain the same (or very similar) fields.

For example, if I am tracking daily sales in separate datasets, but eventually I want to combine them to track monthly sales, I could append each of the daily sales datasets together to get one larger monthly dataset.

Deduping

To dedupe is to remove duplicates from a dataset.

Duplicate values are fairly common. For example, if I’m collecting email address at multiple events, a given person might enter my master email address list several times. If I plan on emailing everyone on my list, I’ll want to dedupe it first to be sure that I’m not emailing the same person more than once.

Transforming

To transform a column is to perform an operation on or using that column that results in a new outcome – this could be a new variable entirely, or a different version of the inputted column.

For example, if I am trying to measure SAT scores for students but find some students in my dataset who have taken the ACT but not the SAT, I might try applying a transformation to their ACT score to approximate their SAT score.

 

For more data-related content, stay tuned — next week, I will provide some visual examples of these data cleanup terms.  Also, we have a webinar coming up on streamlining the data preparation process.  You can register for it here. If you miss it or are unable to attend, we will have an on-demand version available too. Just register and we’ll send you an email with a link to the recording.

 

Experience Rapid Insight

No risk, all reward! Download a free, fully functional 14-day trial of Veera Workstation and Rapid Insight Analytics today, and a member of our analyst team will help you get the most out of your trial. 

 

DOWNLOAD FREE TRIAL

Experience Rapid Insight

No risk, all reward! Download a free, fully functional 14-day trial of Veera Workstation and Rapid Insight Analytics today, and a member of our analyst team will help you get the most out of your trial. 

 

DOWNLOAD FREE TRIAL