What a great question!
This one came up recently during one of the weekly virtual sessions we have with our bootcamp students. It is particularly insightful because there’s no easy answer and it’s one that almost every analyst will face at some point during their data career. The discussion we had as a group was really thoughtful and left me thinking a lot about cars and tires.
I’ve recently spent my weekends watching Formula 1 racing with my son and find the pit stops fascinating. I find it almost mesmerizing to watch the choreography that happens between the car & driver and their pit crew during each pit stop.
If you haven’t seen this, the entire process of changing all four tires can happen in under 2-seconds! In fact, the total time a car is stationary is measured down to one-hundredths of a second. So, even to someone new to watching F1 racing like me, it quickly becomes apparent that these pit stops are critically important to the strategy and success of the team.
In F1 racing, podium positions are decided by fractions of seconds and a clean pit stop really can make a difference. Teams invest significant resources – manpower, tools, and (I’d presume) practice – to make sure each pit stop is as clean as possible, exactly because it is an essential component to their overall success…and that’s why discussing cleaning data during bootcamp made me think about cars and tires!
When I’m faced with the question of how much time to spend cleaning the data, I always try to check in and ask:
What elements of cleaning this data are essential to our success?
How you answer this question will change from project-to-project, and can even change over time within a project.
For example: early in my data career when I worked in a lab setting, ensuring any source data was error-free was of critical importance.
The overall tolerance for errors in a lab needs to be low and, not unlike an F1 pit crew, significant resources are invested and sophisticated processes are developed to obtain the most efficient and effective process. Data is not only checked but often double- or triple-checked before going into analysis and reporting. Definitely an F1-level approach to data cleaning!
And this can apply to other areas, too. Think of any scenario where the stakes are high and/or the tolerance for risk or error is low; in these scenarios, if you uncover significant issues in the source data the next right step may be to advocate for taking the time to clean the data before moving into analysis and reporting. The investment of time, resources, and/or effort in these scenarios may be critical to support the outcomes the business is striving for.
In the example we discussed in our bootcamp, the situation was from a project data set where an error was detected in 3-rows of data out of a total of almost 200,000.
The group talked through whether these 3-rows would change any of the conclusions from subsequent analysis (unlikely), and then went one step further to talk about how much effort it was to make the correction (not much). Ultimately we decided there were many valid approaches to dealing with these particular errors but it was the conversation that was invaluable.
So, when faced with the question of how much time to spend cleaning data, know there is no “one-size-fits-all” answer. But, if you focus first on having a solid understanding of how the analysis will be used and the importance of outcomes it will drive, you’ll be in a much better position to gauge the “right fit” for your particular scenario.
Be passionate. Seek mastery. Learn with humility.
Stacy
SUPER EARLY BIRD IS HERE!
For a limited time, save 25% on our upcoming Python & Power BI immersive programs!
Explore how our immersive programs with direct instructor access, weekly live sessions, and collaborative environments can elevate your skills and accelerate your career.
Stacy Giroux
Cohort Learning Lead
Stacy is a former Cohort Learning Lead for Maven Analytics, helping to design, manage, and faciliate immersive bootcamp experiences for aspiring data professionals.