__STYLES__
Parental Leave is a type of leave that allows certain employees to take time off from work to care for and bond with their newborn or newly adopted child. It may be available to all parents, and either paid or unpaid. The duration of the leave can vary from a few days to several months, depending on the policy or legal requirements.
Parental Leave is considered an important benefit for employees and their families for several reasons:
The U.S. Department of Labor enforces the Family and Medical Leave Act (FMLA) *1, which provides eligible employees up to 12 weeks of unpaid, job-protected leave per year for family and medical reasons. The eligible employees are defined as employees who have worked for at least 12 months, at least 1,250 hours of service in the past 12 months, under a covered employer. FMLA is designated to help employees balance their work and family responsibilities and promote equal employment opportunities for men and women. Parental leave policies and entitlements can vary by country, state, or company.
In the context of this background, the following objectives were established:
In addition to these objectives, an impact visual for a dashboard is required from the Maven Family Leave Challenge.
A CSV file named parental_leave.csv was obtained from the Data Playground of Maven Analytics. It contains crowdsourced parental leave data from 1,601 real companies across different industries. The data source is Failygodboss.
It is assumed that the paid and unpaid parental leave columns can be measured (numerical) and analyzed on a company or industry basis, including sub-industry.
The following steps were followed for the data exploration:
Imported the CSV file into Excel using Power Query
Removed four empty columns found in the dataset
Examined the table content
I confirmed the absence of errors or missing data.
The editor showed:
The data types of company and industry were found to be text, and the paid maternity leave column was decimal. The unpaid maternity leave, paid paternity leave, and unpaid paternity leave columns were classified as text because of the presence of not applicable values. This issue will be addressed later.
The CSV table record is on a company basis. Thus, among the columns, only the company column should not have duplicates. However, when looking at the column, the number of distinct and unique was different. That means there were duplicates. From the column profile information, it was found that the company "Collins Aerospace" has two records.
That revealed that the distributions were uneven, with a few values accounting for the majority of the corresponding columns. The detailed distributions of each column were as follows:
The top 3 frequently appeared values (12, 8, and 6) accounted for almost 50 % of the paid maternity leave column.
The top 3 frequently appeared values (0, 12, and 6) accounted for 56.71 % of the unpaid maternity leave column.
The paid paternity leave column had a noticeable percentage of not applicable values (81.95%).
The unpaid paternity leave column had an outstanding percentage of not applicable values (96%).
Based on the findings, the paid paternity leave and unpaid paternity leave columns were removed from the dataset as more than 80% of their values were missing, which could mislead insights and data-driven decisions and not achieve the second objective. This decision was made in reference to an article in DataCamp *2.
The following SQL commands were executed to create a database named MavenChallenge and a schema named FamilyLeave:
USE master
GO
CREATE DATABASE MavenChallenge
GO
CREATE SCHEMA FamilyLeave
GO
The table data, with empty columns removed, was exported in CSV format from Excel and imported into a table named tblParentaLeave using SQL Server. One of the duplicated records of the "Collins Aerospace" company was removed because Fairygodboss reported that the company offers 12 weeks of paid maternity leave and 4 weeks of paid paternity leave.
USE MavenChallenge
GO
--**************************************************
-- Remove duplicates
--**************************************************
-- Remove duplicates from the company column
SELECT Company, COUNT(*)
FROM dbo.tblParentalLeave
GROUP BY Company
HAVING COUNT(*) > 1
SELECT *
FROM dbo.tblParentalLeave
WHERE Company = 'Collins Aerospace'
-- Remove another record
DELETE dbo.tblParentalLeave
WHERE Company = 'Collins Aerospace' AND Paid_Maternity_Leave = 4
--**************************************************
-- Remove invalid columns
--**************************************************
USE MavenChallenge
GO
ALTER TABLE dbo.tblParentalLeave
DROP COLUMN Paid_Paternity_Leave,
COLUMN Unpaid_Paternity_Leave
Missing values ('N/A') were found in the Industry and Unpaid Maternity Leave columns
3 records had 'N/A' in the column. The missing values were replaced with the correct ones using their official website as a reference.
SELECT *
FROMdbo.tblParentalLeave
WHERE Industry IS NULL OR Industry IN ('', 'N/A')
UPDATE dbo.tblParentalLeave
SET Industry = 'Technology: Manufacturing'
WHERE Company = 'ASML'
UPDATE dbo.tblParentalLeave
SET Industry = 'Advertising: Market Research'
WHERE Company = 'INK Communications Co.'
UPDATE dbo.tblParentalLeave
SET Industry = 'Technology: B2B Tech Services'
WHERE Company = 'Rokt'
107 records had 'N/A' values in the column, accounting for around 6% of the column. These rows were deleted because calculations of the mean or median value could make the result unreliable.
SELECT *
FROM dbo.tblParentalLeave
WHERE Paid_Maternity_Leave IS NULL OR Paid_Maternity_Leave IN ('', 'N/A')
DECLARE @Count INT
DECLARE @CountAll INT
SELECT @Count = COUNT(*)
FROM dbo.tblParentalLeave
WHERE Unpaid_Maternity_Leave = 'N/A'
PRINT(@Count) -- 107
SELECT @CountAll = COUNT(*)
FROM dbo.tblParentalLeave
PRINT(@CountAll) --1,601
PRINT(@Count*100.00/@CountAll) -- around 6%
DELETE dbo.tblParentalLeave
OUTPUT DELETED.*
WHERE Unpaid_Maternity_Leave = 'N/A'
-- Verify the rows in the table matching the WHERE clause have been deleted
SELECT COUNT(*) AS [Rows in Table]
FROM dbo.tblParentalLeave
WHERE Unpaid_Maternity_Leave = 'N/A'
The data cleaning was validated and confirmed to have been performed correctly as expected.
Using the cleaned tblParentalLeave table, I created the following dimension and fact tables:
If you are not familiar with dimension and fact tables, I recommend reading a Microsoft document *3.
--*************************************
--Create DimIndustry table
--*************************************
USE MavenChallenge
GO
IF OBJECT_ID('FamilyLeave.DimIndustry') IS NOT NULL
BEGIN
DROP TABLE FamilyLeave.DimIndustry
END
CREATE TABLE FamilyLeave.DimIndustry
(
ID INT CONSTRAINT PK_Industry PRIMARY KEY IDENTITY(1,1),
Industry VARCHAR(30) NOT NULL
)
INSERT INTO FamilyLeave.DimIndustry
SELECT Industry
FROM
(
SELECT LEFT(Industry, IIF(CHARINDEX(':',Industry) = 0, LEN(Industry), CHARINDEX(':',Industry)-1)) AS Industry
FROM dbo.tblParentalLeave
) AS i
GROUP BY Industry
SELECT *
FROM FamilyLeave.DimIndustry -- 48 industries
--*************************************
-- Create DimSubIndustry table
-- Split values of the industry column in dbo.tblParentalLeave into a industry and sub-industry
--*************************************
USE MavenChallenge
GO
IF OBJECT_ID('FamilyLeave.DimSubIndustry') IS NOT NULL
BEGIN
DROP TABLE FamilyLeave.DimSubIndustry
END
CREATE TABLE FamilyLeave.DimSubIndustry
(
ID INT CONSTRAINT PK_SubIndustry PRIMARY KEY IDENTITY(1,1),
SubIndustry VARCHAR(33) NOT NULL,
IndustryID INT CONSTRAINT FK_Industry FOREIGN KEY REFERENCES FamilyLeave.DimIndustry(ID) NOT NULL,
CONSTRAINT unqSubIndustry UNIQUE(SubIndustry, IndustryID)
)
INSERT INTO FamilyLeave.DimSubIndustry
SELECT SubIndustry, i.ID AS IndustryID
FROM
(
SELECT LEFT(Industry, IIF(CHARINDEX(': ',Industry) = 0, LEN(Industry), CHARINDEX(': ',Industry)-1)) AS Industry, IIF(CHARINDEX(': ',Industry) = 0, 'Other', RIGHT(Industry, LEN(Industry) - CHARINDEX(': ',Industry)-1)) AS SubIndustry
FROM dbo.tblParentalLeave
) AS si
JOIN FamilyLeave.DimIndustry AS i
ON si.Industry = i.Industry
GROUP BY si.SubIndustry, i.ID
SELECT *
FROM FamilyLeave.DimSubIndustry
ORDER BY ID --182 industry-subindustry relations
--*************************************
--Create DimCompany table
--*************************************
USE MavenChallenge
GO
IF OBJECT_ID('FamilyLeave.DimCompany') IS NOT NULL
BEGIN
DROP TABLE FamilyLeave.DimCompany
END
CREATE TABLE FamilyLeave.DimCompany
(
ID INT CONSTRAINT PK_Company PRIMARY KEY IDENTITY(1,1),
Company VARCHAR(51) NOT NULL
)
INSERT INTO FamilyLeave.DimCompany
SELECT Company
FROM dbo.tblParentalLeave
GROUP BY Company
SELECT *
FROM FamilyLeave.DimCompany -- 1,493 companies
--*************************************
--Create FactParentalLeave table
--Convert the data type of the industry column in the FactParentalLeave table into an integer for a foreign key
--*************************************
USE MavenChallenge
GO
IF OBJECT_ID('FamilyLeave.FactParentalLeave') IS NOT NULL
BEGIN
DROP TABLE FamilyLeave.FactParentalLeave
END
CREATE TABLE FamilyLeave.FactParentalLeave
(
ID INT CONSTRAINT PK_ParentalLeave PRIMARY KEY IDENTITY(1,1),
CompanyID INT CONSTRAINT FK_Company FOREIGN KEY REFERENCES FamilyLeave.DimCompany(ID),
SubIndustryID INT CONSTRAINT FK_SubIndustry FOREIGN KEY REFERENCES FamilyLeave.DimSubIndustry(ID) NOT NULL,
Paid_Maternity_Leave DECIMAL(3,1),
Unpaid_Maternity_Leave DECIMAL(3,1),
CONSTRAINT unqParentalLeave UNIQUE(CompanyID, SubIndustryID)
)
INSERT INTO FamilyLeave.FactParentalLeave
SELECT
(
SELECT c.ID
FROM FamilyLeave.DimCompany AS c
WHERE p.Company = c.Company
) AS CompnayID
, (
SELECT si.ID
FROM FamilyLeave.DimSubIndustry AS si
JOIN FamilyLeave.DimIndustry AS i
ON si.IndustryID = i.ID
WHERE (p.Industry = i.Industry + ': ' + si.SubIndustry) OR (p.Industry + ': Other' = i.Industry + ': ' + si.SubIndustry)
) AS SubIndustryID
, Paid_Maternity_Leave
, Unpaid_Maternity_Leave
FROM dbo.tblParentalLeave AS p
Before choosing the right charts, it's important to ask critical questions about the data. Based on an article written by Sara A. Metwalli *4, I considered the following five questions:
1. What story is the data trying to convey?
The objectives of this project are as follows:
Understand the difference between parental leave among industries
Identify the industries or companies with the highest and lowest parental leave
Examine the distribution of industries or companies by paid and unpaid parental leave
2. Who will be the audience for the results? Let's say I'm reporting the results to my wife, who is one of the people most interested in parental leave policies and gender equality that I know.
3. How big is the data?
The data includes
1,493 companies
48 industries
182 sub-industries
A range of paid maternity leave from 0 to 52 weeks
A range of unpaid maternity leave from 0 to 52 weeks
4. What is the data type?
According to an article by Great Leaning *5, data can be classified into 4 types:
Nominal data
Ordinal data
Discrete data
Continuous data
Source: The article by Great Learning*4
Based on this classification of the data, I categorized my data as follows:
According to 365 Careers Team *6,
Categorical data can be represented on:
Tables
Cross tables
Bar charts
Pie charts
Pareto diagrams
Numerical data can be represented on:
Bar charts
Histograms
Scatter plots
Box plots
Pie charts
Line graphs
5. How do the different elements of the data relate to each other?
The data shows parental leave by dimensions such as industry, sub-industry, and company. To visualize the data, I broke down the objectives of this project into smaller, more precise questions:
How many companies are included in individual industries?
What is the frequency (distributions) of parental leave in a certain interval among industries?
What is the difference between parental leave among industries?
What are the most and least parental leave adoption industries or companies?
Before determining and displaying charts, it's helpful to calculate the average and median of paid and unpaid maternity leave using SQL. To do this, I used the following query:
USE MavenChallenge
GO
SELECT Industry
, COUNT(CompanyID) AS Companies
, CONVERT(DECIMAL(3,1), MAX(AvgPaid)) AS AvgPaid
, CONVERT(DECIMAL(3,1),MAX(AvgUnpaid)) AS AvgUnpaid
, MAX(MdnPaid) AS MdnPaid
, MAX(MdnUnpaid) AS MdnUnpaid
, CONVERT(DECIMAL(3,1), ABS(MAX(AvgPaid) - MAX(MdnPaid))) AS DiffPaid
, CONVERT(DECIMAL(3,1), ABS(MAX(AvgUnpaid) - MAX(MdnUnpaid))) AS DiffUnpaid
FROM
(
SELECT s.IndustryID
, p.CompanyID
, i.Industry
, AVG(Paid_Maternity_Leave) OVER(PARTITION BY i.ID) AS AvgPaid
, AVG(Unpaid_Maternity_Leave) OVER (PARTITION BY i.ID) AS AvgUnpaid
, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Paid_Maternity_Leave) OVER (PARTITION BY i.ID) AS MdnPaid
, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Unpaid_Maternity_Leave) OVER (PARTITION BY i.ID) AS MdnUnpaid
FROM FamilyLeave.FactParentalLEave AS p
JOIN FamilyLeave.DimSubIndustry AS s
ON p.SubIndustryID = s.ID
JOIN FamilyLeave.DimIndustry AS i
ON s.IndustryID = i.ID
) AS t
GROUP BY Industry
ORDER BY Count(CompanyID) DESC
After analyzing the data, it became apparent that certain industries had only one company, making it difficult to compare the average paid and unpaid maternity leave among industries. It is because the numerical values of these industries could be outliers. Rather than deleting these industries from the dataset, I opted to highlight them in red on the dashboard to ensure transparency and accuracy in the data visualization.
3.1. On the whole, paid leave mostly ranges from 4 to 12 weeks across all industries, while unpaid leave falls within the 0 to 4 week range.
When considering the distribution of paid and unpaid maternity leave across all industries, it is most common for paid leave to range from 4 to 12 weeks, while unpaid leave falls within the 0 - 4 week range. Regarding industry rankings, Philanthropy provides the highest average number of paid maternity leave weeks, at 22.4 weeks, compared to other industries. In terms of unpaid maternity leave, the Printing industry ranks first; however, it should be noted that this is based on only one company, so it is insufficient to conclude that the industry has the most availability of leave. Accounting Services ranks second with 12 weeks, following Philanthropy. The industries providing the least amount of unpaid leave are Computer & Network Security, Engineering, and Sports. They provide no unpaid leave.
3.2. While the Family and Medical Leave Act (FMLA) promotes opportunities for unpaid leave, almost 62% of companies provide more paid maternity leave weeks than unpaid ones.
It was found that nearly 62% of companies (920 out of 1,493) provide more paid maternity leave weeks than unpaid ones.
*1. U.S. Department of Labor, Family and Medical Leave Act (FMLA)
*2. DataCamp, Data Cleaning Tutorial
*3. Microsoft, Understand star schema and the importance for Power BI
*4. Sara A. Metwalli, Data Visualization 101: How to Choose a Chart Type
*5. Great Learning, 4 Types Of Data – Nominal, Ordinal, Discrete and Continuous
*6. 356 Careers Team, Statistics for Data Science and Business Analysis