The most important part of any data project is data preparation, and because of this, data prep often occupies the majority of time spent on the project.
Time spent on the data prep phase directly translates to the final output of the project. This statement holds true in both positive and negative aspects where more time spent on data prep typically equates to a better output.
“Something produced based on inputs of low quality will also be low quality”
You may have heard the phrase “garbage in, garbage out”, or “GIGO”, before. It’s a popular phrase that means something produced based on inputs of low quality will also be low quality. If you try and build a database, data model, report, or dashboard based off poorly prepared data, the output will suffer.
Think about presenting a report to a client and kicking off the presentation with the following insight:
Analyst: “Month over month, our marketing spend increased 100% (up to $200K from $100K), with an ROAS of 3:1. This performance improvement is due to increased spend in paid search campaigns. Based on this performance, I’d recommend continuing with the same spend levels in the next period”
Client: “That’s great to hear about the performance increase, but our total marketing budget for the year is $200K and we didn’t spend $100K last month. How are these spend numbers possible?”
Analyst: “That’s a good question, I’ll have to look into it”
From this point on, can the client trust what the analyst is presenting? Should the client trust the analyst?
The presentation ends and you head back to your desk to dig in and understand what happened. After a QA of the raw data, you realize that the dataset contains duplicate values which artificially increased the spend numbers.
With a proper data prep process, this is something that could have been caught prior to presenting the report.
So, let’s talk about the data prep process. Data prep is all about creating clean, quality data for analysis, and typically involves a combination of three main steps:
Quantity Assurance (QA)
Data Profiling
Data Engineering & ETL
For the remainder of this post, I’d like to focus on the first data prep process, Quality Assurance, and discuss some of the key issues you may encounter (we’ll dig into Data Profiling and Data Engineering & ETL in separate posts).
Lets dig in!
Issue 1: Incorrect data types
Solution: In this example, we can see that the Postal Code and Telephone number columns contain incorrect data types. Some of the postal code values are missing the leading zero and telephone numbers are an integer data type instead of text.
To update, you can reformat the columns or re-encode (change data type) the fields to ensure the data types are accurate and consistent.
Issue 2: Duplicate records
Solution: In this example, we can see duplicate records for Joseph Martinez. Duplicate records should typically be avoided because they can cause artificial inflation or create bias in the dataset.
That said, there are cases where you may want to keep duplicate records, but this is only in cases where duplicates serve a specific and valid purpose like up-sampling (not covered in this post).
Issue 3: Inconsistent categorical values
Solution: Here, we have a mix of full state names (New York & Utah) and abbreviated state names (VT, IL, NJ). To solve this, there are a couple of different approaches we could take.
1. Determine a standard and use find and replace tools to update the naming convention used in the column. For example, always use the full state name or always use the abbreviated state name
2. Create a new column or table, to preserve the original source data, and add a new, standardized version for analysis
Either approach will get you to the same spot, a clean set of values that can be used for analysis. Determining which approach is best will likely vary based on the business requirements.
Issue 4: Empty values
Solution: Empty values, woof. What’s the best way to handle this type of value during QA? As you may be thinking it’s a giant “it depends” because it has to do with the data type and context of the column. One of the worst approaches here is to just fill the empty values with zeros. Unless you are confidant, they won’t skew the data.
Here are three ways to handle empty values:
1. Keep them – If you’re certain that the empty values are accurate and meaningful, keep them! Sometimes and empty value is just that, empty, and doesn’t need to be filled in with a value
2. Impute them – Impute just means to substitute the empty values with values that make sense. This is a great technique to use when you can accurately populate or estimate the empty data points
3. Remove them – Use this approach when you have a large volume of data and can’t impute the empty values
Issue 5: Outlier values
Solution: Woah! Nice work Beverly 💪🏻 Besides bragging about your household income, what’s the best way to handle outliners like this in a dataset? Do you keep, delete, or transform them?
Let’s think about this for a minute. In this case, the average household income would be $2,400,000 when we include the outlier. Do this feel like an accurate representation to you?
Let’s look at four ways to handle outliers:
1. Keep them – In cases where the outlier is valid, and you want to capture its impact on the data, you should keep it
2. Impute them – Similar to empty value errors, impute just means to substitute the outlier with a value that makes sense. When dealing with outliers, you can choose to substitute the values with a cap (maximum value) or replace them with a different number
3. Remove them – Replacing outlier values is a good approach when the value represents an impossible value, or the outlier value significantly skews the analysis
4. Transform them – Maybe you want to keep the value but reduce the effect it has on the analysis. In these cases, you can transform the outlier value to reduce its effect
--
Alright, so those are really the core quality assurance topic areas you’ll encounter during data prep and each one requires a slightly different approach to solve. The key piece to understand is that the approach you take should always be aligned with the business outcomes and objectives.
In the next installment of this Data Prep series, we’ll cover Data Profiling – stay tuned!
Learn on 🤘🏻-Aaron
SUPER EARLY BIRD IS HERE!
For a limited time, save 25% on our upcoming Python & Power BI immersive programs!
Explore how our immersive programs with direct instructor access, weekly live sessions, and collaborative environments can elevate your skills and accelerate your career.
Aaron Parry
Partner, CCO & Lead Power BI Instructor
Aaron is a professional analytics consultant and Microsoft Power BI expert, with 10+ years working in business intelligence and marketing analytics. He's an instructor, coach and mentor for aspiring analysts, and has deep experience helping companies develop and implement full-stack BI solutions.