__STYLES__
In this project, I conducted a comprehensive simple linear regression analysis to explore the relationship between two continuous variables. As part of an analytics team focused on marketing and sales insights, I worked on a project centered around influencer marketing. The primary objective was to investigate the relationship between marketing promotional budgets and sales. Our company's decision-makers rely on this analysis to inform future marketing strategies and investments, making it crucial to gain a clear understanding of the impact of different promotional channels on revenue generation.
This project allowed me to strengthen my knowledge of linear regression and hone my skills in evaluating regression results. The insights gained will be invaluable in providing data-driven business recommendations in the future.
To begin the analysis, I imported the necessary Python libraries - pandas, matplotlib.pyplot, and seaborn - to manipulate and visualize the data. Additionally, I used the statsmodels
library to build and fit the linear regression model.
I started with exploratory data analysis (EDA) to familiarize myself with the dataset and prepare it for modeling. The dataset consisted of the following features:
Each row represented an independent marketing promotion, with investments made in TV, social media, and radio promotions to boost sales. The primary aim was to identify the feature that most strongly predicted sales.
To achieve this, I performed the following steps:
Next, I visualized the distribution of sales using a histogram, which showed that sales were evenly distributed between $25 million and $350 million.
In this step, I constructed a pairplot to visualize relationships between pairs of variables and identify the feature with the strongest linear relationship with sales. Based on the pairplot, I chose TV
as the independent variable X
for the simple linear regression model.
I then built and fitted the model using the statsmodels
library. The linear equation for the model is:
Sales (in millions) = -0.1263 + 3.5614 * TV (in millions)
The R-squared value for the model was 0.999, indicating that 99.9% of the variation in sales could be explained by the TV promotional budget alone.
To evaluate the model, I checked the four assumptions of linear regression:
Interpreting the model results, I found that an increase of one million dollars in the TV promotional budget would lead to an estimated $3.5614 million increase in sales. The coefficient for TV had a p-value of 0.000, and its 95% confidence interval was [3.558, 3.565], indicating little uncertainty in the estimation.
This project provided valuable insights into the relationship between marketing promotional budgets and sales. Key takeaways include the importance of EDA to identify suitable variables for regression, checking assumptions before interpreting results, and providing measures of uncertainty (p-values, confidence intervals) with coefficient estimates.
I would recommend to the leadership at our organization to prioritize increasing the TV promotional budget over other channels, as TV has the strongest positive linear relationship with sales. This decision is supported by the high R-squared value, low p-value, and narrow confidence interval for the TV coefficient, indicating high confidence in the impact of TV promotions on sales. Additionally, I would explore using both TV and radio as independent variables and create plots to visualize the results for better communication.
Overall, this project provided essential insights that will aid in making informed marketing investment decisions to drive revenue growth and maximize return on promotional spending.