__STYLES__
In this project, I will clean up a database and perform exploratory data analysis on it. I will do this by
I start off by looking at a snapshot of our data. After first impressions, I create a duplicate of the database and store the original away to be referred to whenever needed.
Using row_number() over(), I am able to suss out where a particular row has repeat datapoints in all columns meaning it is a duplicate.
I create a second duplicate table and populate it with the unique data points, removing the duplicates.
We proceed using the second duplicate table as our working dataset.
On to standardization, the column “company” contains words and spaces. The spaces sometimes come before the words or after, creating inconsistencies that may affect future analysis. I us the trim function to solve this issue.
I move to the next column “industry”. The industry column has entries under both “Crypto” and “Cryto Currency” which I decide are the same thing. Since “Cryto” appears more frequently, I would be using that. Industry also has some null values and blanks.
Next, I look at where a company in the same location appears more than once but has industry nulls or blanks in one of the entries. To make things easier, I will set the blanks to null values, the populate the nulls with industry values inferred from the existing data.
After confirming that inferable entries have been made, I move to the next column “country”. There are 2 different “United States” because one has a period at the end. We need to remove said period so that the system recognizes them as one.
Checking the data types of the columns, I discover that date column is a text data type. I convert it to the more appropriate ‘date’ data type.
My analysis will focus on the numeric aspect of the dataset, having null values the numeric columns will be problematic.
Checking for null values in our 2 numeric columns, I find that 348 rows have both numeric columns null, this accounts for about 15% of our original data. Unfortunately, I can not infer these values from the existing data, nor can I confidently input a calculated estimate. I decided that leaving these rows in will affect my analysis so opt to remove them. Also, the row_num column I used to find the duplicates is no longer needed so I remove that as well.
Feeling content with the cleaning for now, I move on to my exploratory data analysis. Things I am curious about the data set include:
The data set is from 11th March 2020 to to 6th March 2023, starting at the peak of the world wide outbreak of the corona virus and afterwards.
At the top of the list of industries we have retail, food and finance.
Large companies like Amazon, Google and Meta laid off the most employees during the period. However, these companies also have a disproportionately large number of employees in general.
Consumer and retail are leading industries in layoffs.
USA contributing the most companies to the database naturally also has the most number of layoffs.
The layoffs were lowest in 2020, but subsequent years saw higher numbers in the aftermath of the pandemic. Though 2023 shows a reduction in layoffs, only 3 months out of the year are accounted for.
The top 5 companies laying off employees over the period was quite volatile.
Finally, companies in Post-IPO stage laid off the most during the period.