Daniel Lee: Former Google Data Scientist Shares “ Must-Know Tips” for Building Effective Data Teams
Meet the Guest
Dan is the founder of datainterview.com, which helps aspiring and seasoned data professionals prepare for interviews by offering practice problems, prep courses, and one on one coaching. Prior to founding datainterview.com, Dan served as a data scientist in a variety of companies, including Google and Paypal, where he worked on a wide array of projects, ranging from financial forecasting to fraud detection.
Watch the Episode
Top 3 Key Insights from Daniel Lee
There is a large gap between what universities teach and the skills needed to perform tasks in analytics roles.
Employers need not only to upskill employees to improve their performance but also to help retain them, or employees will look for roles that offer the technologies and work they find appealing
Data teams need to think critically about how to templatize their work and share that work with others to improve organizational efficiency and performance.
READ THE TRANSCRIPT
So nice to meet you, Dan. You founded at datainterview.com. What a great URL. I had a chance to take a look at your site and I think it's an outstanding product. I took about a six-month break
in my data science career about three years ago now and I ended up pivoting into this education space where we now both work in. But I really wish I would have found a resource like yours while I was upskilling and reskilling. Because even if you're working in the industry for several years, I was working with a little bit older tech stack and facing the reality, particularly relocating to the Bay that you know, these tech stacks are constantly progressing. The focus is often shifted from pure model building to ML OPS, which is a huge skill that folks need to have if they want to get those top-level roles, which you address a lot in your content.
Thank You. You know, I mean I started datainterview.com back in 2021 and I was just hunting for the right domain and I was surprised datainterview.com is wide open and even though it's $3000, I'm like, oh, it's gonna pay for itself. It's a very memorable domain name. So I decided, you know, I'm going to scoop it up. And so yeah, you know, I think the field of data science and data itself, has evolved over the course of the past 10 years. I jumped into the field in 2016 after studying statistics in college, and over the course of the past seven years, I’ve really seen the transformation from the 2016 tech stack to the 2023 tech stack. And now we have the emergence of AI and a bunch of different tools. So what do you need to learn right? Just when you’ve finished a course, there’s a new thing that’s just emerged. So it’s like, how do you keep up with all these things?
Absolutely. And you mentioned you got into the field in 2016. You studied statistics at Virginia Tech. What made you decide to jump into statistics from the get-go and the broader data field as a career?
That’s a good question. And I’ll be very frank, I actually didn’t jump into statistics right off the bat. I changed my major multiple times. I have this sort of ADHD thing where I’m like, you know what, this seems interesting. Let me pursue this then I get bored, then something else I hopped around a little bit. I went in thinking I’m going to be a doctor because my parents are like you’re either a doctor or lawyer or you’re a loser. And then I was like, you know what? I don’t really wanna be a premed. And then I’m like, oh finance sounds interesting, so I pursued finance for a little bit. And then I realized I don’t think I really want to do this. And then I changed my major to creative writing, psychology, and then, I was doing a minor in statistics, and this was around 2014, and I heard about Data Science, which HBR was saying was the sexiest job of the 21st century, so I’m like OK, let me just jump on that. And the rest was history. I really fell in love with the profession. I love the idea of taking this highly unsupervised random jumble of numbers and trying to make sense of it. You try to build a model out of it. And I thoroughly enjoy that process from beginning to end. I took a data science course at Virginia Tech, and then I’m like you know what? This is the profession I really wanna pick. And so in 2016, I applied to a couple of jobs. There was one company, Booz Allen Hamilton, a consulting firm in DC that mostly does government contracting. And they were gracious enough to offer an interview. I got in and worked there for the first few years of my career.
I had a similar path. I was an econ undergrad major and I kept finding that so many of the courses I wanted to take had stats prerequisites. I didn’t end up with a minor, which I regret since I was one course away, and I took a pretty easy course load senior year. But it was a strong interest of mine. But the field wasn’t quite as hyped up when I graduated in 2010. I ended up taking a circuitous route into the supply chain. I had a similar mindset of not knowing quite what I wanted to major in. I liked the idea of tackling a diverse array of subjects and you know, data analytics and data science is one of those domains that you have this really transferrable set of skills that allows you to peek into different domains and you’re not really tied to a specific industry or problem type which is great. I was starting to look into supply chain masters, when I saw one of the schools had this analytics master, and I was like, what is this? This is way better, I want a chance not to have to stay in supply chain forever.
And with your experience at Virginia Tech, and what you’re seeing from a lot of your clients coming to you, maybe they’ve done some undergrad training and are looking to upskill a bit more to get industry ready.
What are some of the biggest gaps in terms of what you see universities teaching vs. what companies are looking for?
That’s a really good question. So I would say that right now one of the biggest hurdles is that sort of depends on the main focus you’re pursuing, right? Data science is in a way a rebranding of classical statistical methods where you work with a limited data set, you might run some simple statistical inference, create some pretty-looking visualization and stuff like that and then showcase that right? But in real life, there’s a lot more complexity around this. The data you work with is often not really ideal. It’s not this perfect CSV file. You might receive a dump of like 20 to 25 different CSV files. Sometimes you can’t even open it on excel. Sometimes it might not be something you can load up on your computer because you don’t have enough RAM or even storage.
Not only that but sometimes the objective itself is not clear. You’re not given this homework assignment saying OK, you’re going to be rated on whether you can answer ABC. So the thing is, in data science projects, often in real life, you don’t even know what the key objective is because the stakeholders themselves say they want to model but don’t necessarily know what the main objective we want to achieve. So the ambiguity around the project is also one aspect of it. And also the modeling process as well, you know, in school they’ll teach you dozens and dozens of all these different algorithms. They talk about all the variations of regression models, they talk about SVM and all the intricacies of neural network algorithms, and stuff like that. But in truthfully speaking, on a day-to-day basis, only a handful of models are often used: K means XG boost, you know, all less it's a fairly finite set of different algorithms, but only a handful of them are actually used.
They talk about all the variations of regression models, they talk about maybe SVM and all the intricacies of neural network algorithms, but truthfully speaking, on a day-to-day basis, you only use a handful of models: K-Mean, XGBoost, OLS, it’s a fairly finite set. And (in the field), the main focus is the modeling process: feature selection, data cleaning, feature engineering, picking out the right evaluation metric, and the other aspect is the model deployment.
So to me, the large gap that schools are failing to provide is that they’re often too focused on the theoretical side of things, the theory of data science, but only a fraction of those actually have practical value. And although the curriculum I picked up from statistics has in a way trained me in the fundamentals of understanding statistical theories, I would say 80 to 90% of what I do on the day-to-day job I learned through the job. There were a lot of struggles in that and I feel schools need to really understand what data scientists actually do on a day in the life and design a curriculum that is specifically for that and not teach esoteric concepts that don't have value. I think that’s where schools need to fill in the gap.
I remember my first crash course on what an enterprise database actually looks like and just learning how to navigate massive star diagrams and learning that even the underlying data of these production tables could be very incorrect and none if that is really taught. I certainly wish I had a course that let us tap into a production database and ask questions.
Could you talk about the differences in cultures and tech stacks across your time in government consulting, startups, and big tech companies like Paypal and Google?
So when it comes to the government contracting and consulting side of things, there’s definitely a lot more rigorous process in place, starting with just being able to get the right level of clearance for the project you’re working on. If the data is really sensitive, you might need something like a secret or top-secret clearance and then it’s very formal.
I would have to say honestly in terms of how you engage your stakeholders, of course, you want to be creative with your solutions, but you also have to be very cognizant of the limitations on the technology that you could provide to the stakeholders. So there’s not much wiggle room around projects you might work within the government consulting side of the world based on what I’ve experienced in terms of the team process. It’s fairly standardized.
In one of the analytics projects I worked on, for example, we had a data science manager, business manager, data engineer, and data scientist. This was like a three-month analytics project where we were trying to understand the behaviors of suicide among the US Marines, and we were deployed at the Pentagon. We’re doing this project and we had access to the marine data doing analysis on some of the driving factors of suicide. So this was an end-to-end process. We always have to be really cognizant of what the stakeholder needs are, and what is the etiquette when working with military leadership, because we’re talking about very high-ranking members of the command. So there were a lot of things I needed to be mindful of in terms of how I engage with stakeholders.
Transitioning from consulting to a startup, which later got acquired by PayPal, was very scrappy. It was a complete contrast to what I saw in consulting from the get-go. The post for this role was, what programming language do you use? We don’t care, as long as you get the job done. So going from this very formal, structured, rigid process to something that’s a lot more nimble where you’re free to express however you need to, but you just need to get the job done. But I think the main difference was it’s just a total hustle culture because startups are burning a ton of cash and they need to prove their value quickly and they really expect that from each employee. So there was definitely a lot of late-night work and weekends where I’m delivering multiple projects at once.
I was working with a large bank delivering their fraud model and there were constant fire drills. You build a model solution and then the production environment crashes because of a bug. You can’t just wait on someone to get that fixed. You have to start diagnosing and trying to figure out ‘what the heck happened to Kafka’? And it’s not that you even necessarily know how to fix this, but you at least have to make the effort to get the initial diagnosis and reach out to the engineer to fix this ASAP. These types of fire drills would happen a lot, which was a major contrast from what I saw in the more comfortable, less fire-drill-heavy culture. I would then at Google, I’d say there was a bit of a mix of both.
At Google, I was hired onto the Google Finance team. It was a new team that was being built out within the Finance org. There are a lot of FP&A teams, as Google is a conglomeration of all these different business orgs. You have YouTube, Google Cloud, Google Maps, etc., etc. and each one has a different FP&A team. We’re doing financial accounting and similar tasks, but there was no data science capability around that. So the finance team at Google thought, hey, we need a dedicated data science team that helps FP&A folks. But we’re going to function as an internal consulting group slash startup. And they happened to like my profile because I did both. At Google, a lot of the process was already established. There’s already a tech stack. The goal was really to try to build out a capability from scratch. We have to think about, ok as a finance data science team, what is the business proposition to those FP&A folks, what are some of the modeling methodologies, and what are our processes as a team in order to be effective. And these are the things that our team needed to figure out as we would go.
So these are some of the processes and experiences I’ve gathered across my roles.
Yeah, I like the Google-level resources with the startup-type project landscape where you really get a lot of chances to evaluate different opportunities, and even finding a single univariate split in data can add value because no one’s looked at it yet. That’s where I’d like to be if I was going back into the industry. I also had a similar shift from insurance, which is slowly moving, to insure-tech which had a much more advanced tech stack. Switching from a more traditional industry to a more modern one means you’re drinking out of a firehose for a bit.
Do you think organizations, whether enterprise or academia, are keeping pace with really upskilling or reskilling when it comes to analytics and data science?
I would say that in the interview process, there’s a bit of a gap between what’s being taught at school in terms of how you actually prepare for these interviews versus some of the things that are often covered in the actual interviews. As an example, if you’re pursuing statistics or data science with the focus of becoming a data scientist or ML engineer, what’s often lacking is the fact that for a lot of startups and tech companies, and I’m not saying for all companies if the candidate is pursuing those, they have a fairly standard process in terms of what will be asked. That’s going to be a component where they assess candidates on statistics, machine learning, and coding, which are often the parts that a lot of schools don’t really cover enough because of the focus on statistical or ML theory.
So having said that you’ve probably seen this Venn diagram of the data science role where it’s a combination of math & statistics, coding, and domain expertise. There’s quite a bit of emphasis on that stat and math course, but there are not enough courses to help students learn more about coding and coding best practices. They might pick up some Python courses one-on-one and learn some Pandas and NumPy data manipulation, but they don’t really learn best practices around software engineering, being able to follow the DRY (don’t repeat yourself) Principle, version control, containerization, and then how do you optimize your code? These are things candidates are assessed on, even if in some entry-level positions. I think that schools need to teach students more about that aspect and not just take out a sheet of paper and solve this statistical inference problem.
That makes sense, and I guess in addition to fresh graduates, how many of your clients are folks that have a couple of years of professional experience and aren’t satisfied in their role? Do you think there’s anything their employers could do instead of losing their employees to new opportunities?
So this question is about retention. As I’ve seen clients in one-on-one coaching, often one of the contributing factors in terms of why a person might leave is that stagnation in their growth and that the work we do can sometimes feel repetitive in nature. Or they don’t feel they have enough ownership because maybe there is already an established process and you’re just providing one piece, right?
You might have the title data scientist or data analyst and you want to build models. But the only responsibility you have, and this is something I’ve seen in very complex and bureaucratic institutions, is that your only ownership of a product is just data or feature engineering or exploratory data analysis, but they’re never going to let you do an actual modeling exercise. And I think that’s a missed opportunity. Because if you have someone who actually wants to learn something like that, there need to be services like Maven analytics, for example, that does the upskilling side of things to help people learn that and apply in jobs. Employers could help their employees learn these skills and apply them in projects. I think that kind of career growth is something that employers need to acknowledge and actually help people in order to increase retention.
“One of the contributing factors in terms of why a person (data scientist) might leave their current role is stagnation in their growth. The work in data science can sometimes feel repetitive in nature, so the individual can lose interest over time. Or they don’t feel they have enough ownership because maybe there is already an established process and you’re just providing one piece.”
Yeah, I left a role because I was very nervous about my personal tech stack. I was trained on SAS, and when I joined the field SAS really started to fall off the face of the earth with the rise of R and Python in particular. It was very scary seeing job postings that I was interested in, and while I held a Principal Data Scientist title, but knew I was going to get completely blown out of the water in interviews without the right skills. I was able to move to a team within the same company that was working with Python so I could build skills there.
What do you think it takes to build a mature data organization and an impactful team?
I’ve seen the beginning of a new data science team at Google, along with the process of it maturing and I think that there’s this business aspect along with the technical aspect. When I’m talking about the business aspect, it's being able to find strategic partners outside of the data science team, because data science is an auxiliary function. You don’t start off by saying ‘I’m going to use data science’, you start by finding actual business problems that need to be solved and then it just so happens that instead of using Excel, you can use data science as a way to really bring value to the business. So that process for data science needs to be there from the upfront.
Then you can start building a team around that. So you have some stakeholders, the right sponsorship, and then you mature. You start off by thinking about what are some of the capabilities we need to figure out. I think that teams need to figure ‘Ok, what is our value proposition? Are we just using data science because it’s an awesome catchphrase, or is there an actual segment of problems we could potentially solve?
As a concrete example, one of the things we realized from the get-go is that we need to look at what processes are being done manually right now. And we saw that a lot of the FP&A folks are using Google Sheets for simple heuristics like a moving average, or maybe linear regression for forecasting. And we knew right after that that we could definitely standardize the process. Here we could bring out the big guns like machine learning and help them generate the forecast, build the dashboard right, and help them really improve their business KPIs.
So that was one of the capabilities we built out, or templatized, which leads to the point that once you figure out some patterns the stakeholder is doing, you now think about how to build an assembly line around that. You can’t just constantly build customized solutions all the time, that’s just not going to scale.
Another example is anomaly detection. So we started out by building an anomaly detection tool for one of the Google Ads teams. There are various teams within the Ads orgs, but we started proving our capability with one of the teams, then templatized it for other peers to use, along with documentation and other items. So once we built that, we can take the solution and apply it to different cases, replicating the process. With that kind of templatization of one project to the next, you always want to make sure you can reuse it, thereby improving efficiency.
But you also have to think about some of the processes that could be further improved. Maybe you see that many data science projects need a dedicated feature store. Instead of having an engineer or data scientist building a completely different set of feature stores, there is some commonality where we can build that out. So as a team, you want to think about how to find overlapping work and find a dedicated single feature store or code repository.
These are things you want to always think about as a team. “Are there ways in which we can cut costs? In terms of time and effort? And are there ways in which we can prove our additional capability?” When data science teams are cognizant of these questions and actually put in the effort, the impact is significant.
“These are things you want to always think about as a team. “Are there ways in which we can cut costs? In terms of time and effort? And are there ways in which we can prove out additional capability?” When data science teams are cognizant of these questions and actually put in the effort, the impact is significant.”
Makes sense! Let’s talk about your business for a minute. What made you take the leap from industry into interview prep? I looked at your platform and it’s incredible. It’s exactly what I would want if I were getting ready to apply for jobs.
So I started datainterview.com in 2021 because I personally saw pain points in the interview process for data roles, data engineering, data analytics, data science and ML engineering. I’ve been a candidate multiple times since 2016 and every other year when I was applying for a job. I couldn’t really find any good resources. I’d look up the top 100 interview questions for data science and see the top 5 questions being “what is pandas” or “who is your favorite data scientist?” I’m like, are you kidding me? Companies don’t ask this. Who is your favorite data scientist doesn’t really say anything about whether the candidate is good for the job or not. I thought, ok, why don’t I try to walk the talk here? I’m going to create a system for myself that actually works, and then build a business around it. So in 2018-2019, I was going through multiple interviews and then along the way I was collecting questions. I was creating a system around which I was using myself. And then I got into Google and realized, I think there’s value in this, right?
So I started this as a side hustle where I’m just creating some YouTube videos here and there and marketing my business. And then it started gaining a lot of traction, and that’s when I realized, you know what? Let me jump off board, and actually pursue this full-time, and that’s what I did in the summer of 2021.
I have dedicated coaching courses, boot camps, and other things. I’m building out others month to month. I started with a simple website, but you always want to refine your process. There were a lot of issues with using WordPress as a front end, for example. So I picked up full stack development in the winter of last year, crammed a lot of React.js and next.js, and just decided to go full stack entrepreneur mode and build it myself and revamp the website sales process using this tech stack.
It’s incredible, and your SQL IDE with the practice questions asked by various companies is a great place for candidates to start. I spent a lot of time going to glassdoor looking to find the resources you’ve collected there, and it’s nicely packaged – I wish I had your site when I was interviewing.
Thanks for your time today Dan!