7 Data Cleanup Terms Explained in Plain English
I spent a couple of days at the TDWI Conference in Boston last week and had a lot of great conversations with data analysts from all over the world. I realized while talking with many of them that, for those of us working with data all day long, there is a lot of terminology that gets used when cleaning up data and preparing it for analysis. I thought it would be helpful to provide some explanation of some of the terms we often use when we talk about cleaning data.
To cleanse a column is to clean up the values within that column, commonly by replacing them.
For example, if my dataset contains a ‘State’ field which is sometimes expressed as a two letter abbreviation and sometimes fully spelled out, I would cleanse the values in that columns so that they all matched the same criteria – ie, so that they were all two letter abbreviations or all the fully spelled out version of state name.
Aggregating is sorting data and then expressing the data in a summary form.
For example, if I have a dataset that contains purchase history where each purchase was its own row, and we wanted to know how many purchases a given customer had made, I could aggregate the data by a customer ID field and sum the number of purchases associated with each ID, giving me total number of purchases organized by customer.
Filtering a dataset narrows it down to just a specific group of records that you’re interested in.
For example, if I have a dataset containing information about all the customers my company has in the United States, but I am only interested in customers who live in California, I can filter that dataset to focus on just customers who live in California.
When your data is scattered in multiple datasets, whether it be in spreadsheets, text files or databases, merging allows you to combine the relevant parts of those datasets to create a new file to work with.
For example, if I keep purchase history and client information in two separate datasets, but I am interested in creating a report containing information from both datasets, I would first merge the two datasets together (using a field like ‘Client ID’), then build the report from the resulting dataset.
It’s easy to think of appending as stacking – to append two datasets is to stack them to create one larger dataset. Usually when appending data, the datasets contain the same (or very similar) fields.
For example, if I am tracking daily sales in separate datasets, but eventually I want to combine them to track monthly sales, I could append each of the daily sales datasets together to get one larger monthly dataset.
To dedupe is to remove duplicates from a dataset.
Duplicate values are fairly common. For example, if I’m collecting email address at multiple events, a given person might enter my master email address list several times. If I plan on emailing everyone on my list, I’ll want to dedupe it first to be sure that I’m not emailing the same person more than once.
To transform a column is to perform an operation on or using that column that results in a new outcome – this could be a new variable entirely, or a different version of the inputted column.
For example, if I am trying to measure SAT scores for students but find some students in my dataset who have taken the ACT but not the SAT, I might try applying a transformation to their ACT score to approximate their SAT score.
For more data-related content, stay tuned — next week, I will provide some visual examples of these data cleanup terms. Also, we have a webinar coming up on streamlining the data preparation process. You can register for it here. If you miss it or are unable to attend, we will have an on-demand version available too. Just register and we’ll send you an email with a link to the recording.
Harness the power of many.
Create and share reports and datasets across the enterprise, and put analytical power in the hands of everyone. Veera creates a truly data-driven culture. Try it for yourself today.