Seven Essential Data Prep Tasks for Self-Service Analytics
Data analytics is an important component of strategic decision making for nearly every industry. Because insightful reports are only as good as the quality of data that feeds them, efficiently sourcing and then prepping the data are critical first steps. The data collection landscape has changed with the growth of business tools like SalesForce and QuickBooks. These tools allow for easy on-demand access to customer information and live data on transactions and business processes. Lag time associated with data entry is being reduced and general access to data is increasing due to the ease of use and adoption of such tools.
Furthermore, legacy data storage products have been given new life with the increased availability of analytics tools. Large databases systems like Oracle, SQL, or MySQL that have been used for years for collecting data points on customers and business activity can now be integrated with modern data analytics tools, making it easier to analyze that data in new ways that are less cumbersome than previous methods.
Sourcing and then analyzing your data may sound complicated enough to consider hiring an outside firm. However, the process of self-service data analytics is becoming relatively simple. Previous data analytics hurdles like using complex coding languages or dense database management procedures are becoming less necessary with new tools that help users build reports and discover insights with little technical knowledge.
In most cases, there are a variety of tasks that need to be completed with the data prior to analysis. These tasks are referred to as data prep because they help prepare and then transform the data from its raw state to a more optimal state for observing actions and patterns within the data. This post will explore seven essential data prep tasks a typical user will encounter when preparing to analyze a dataset.
Aggregating is sorting data and then expressing the data in summary form.
Here we have a list of Nobel Peace Prize Winners that includes the country they hail from which we’ve aggregated to get the total number of Peace Prize Winners from each country.
Filtering a dataset narrows it down to just a specific group of records that we’re interested in.
Here we have a list of elements and their element type which we’ve filtered down to just the metal elements.
When data is scattered in multiple datasets, merging allows you to combine the relevant parts of those datasets to create a new file to work with.
Here we have two datasets: one contains a column for state and state capital; the other dataset contains a column for state population. We’ve merged these datasets together to create one dataset that contains state, state capital, and state population.
To append two datasets is to stack them to create one larger dataset. Usually when appending data, the datasets contain the same (or very similar) fields.
Here we have two datasets: one contains all of the superhero movies released in 2013 and the other contains all of the superhero movies released in 2014. We’ve appended these datasets together to create one stacked dataset that contains all of the superhero movies released in 2013 and 2014.
To dedupe is to remove duplicates from a dataset.
Here we have a list of emails from our Mickey Mouse Club newsletter, which includes a couple of duplicates (highlighted). We’ve deduped this list so that we have one single entry for each person (or mouse, or duck, or dog) on our list.
To transform a column is to perform an operation on a column or using that column that results in a new outcome. This could be a new variable entirely or a different version of the inputted column.
Here we have a dataset that contains the first and last names of our earliest presidents. We’ve combined the two columns to create a brand new column, “Full Name”.
To cleanse a column is to clean up the values within that column, commonly by replacing them.
Here we have a gender column that we’ve pulled out of our database. We’ve noticed that our entries for gender are not uniform — for example, “female”, “fem”, and “F” can all represent “female”, so we’ve cleansed the data to make the entries more consistent within the column.
Data analytics is quickly becoming a form of information gathering that can bring a high degree of precision decision making without the need for advanced technical skills. By combining some of the simple data prep tasks above, an organization can easily use the data they already have to gain further insight into their operation in order to make strategic decisions. Data prep tools like Veera Construct can help in-house analysts easily source and blend data, mitigating the need for outside firms or expensive analytics packages—helping companies like yours save time and money.
For more information on the basics of data prep, watch an on-demand video of our Data Prep 101 webinar. During the presentation, I break down the basics of data prep with a review of common tasks required for reporting, predictive modeling, and ad hoc analysis. I also discuss best practices and tools for tackling data prep challenges, including a live demonstration of real-world projects in Veera Construct.