__STYLES__
The main objective of this notebook project is to find the highest correlations of the gross earnings in movie industry.
The data is from 1986 to 2016, and it can be found in Kaggle: https://www.kaggle.com/datasets/danielgrijalvas/movies
The project includes some of the basic Python functions to process, clean, analyze and visualize data.
The Python libraries used are: pandas, seaborn, numpy and matplotlib.
# Import Libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from matplotlib.pyplot import figure
import copy
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8) #Adjusts the configuration of the plots we will create
# Read the data
df = pd.read_csv(r'C:\Users\Documents\CARLOS\Datasets\Movie Industry\movies.csv')
# Data view
df.head()
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, pct_missing))
It is convenient to clean the empty values in the budget and the gross columns, as these are the ones with the highest percentage of missing data and they will probably be two most important data for the following analysis:
df.dropna(subset=['budget', 'gross'], inplace=True)
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, pct_missing))
The year and the released year doesn't always match, so the released date has to be in a new column.
To achieve that, released column is transformed to a string, then a regular expression (r'(\d{4})') is used to get the year:
# Create Year column
df['yearcorrected'] = df['released'].astype(str).str.extract(r'(\d{4})')
# del df['yearcorrect']
df.head()
pd.set_option('display.max_rows', None)
df = df.sort_values(by=['gross'], inplace=False, ascending=False)
df.drop_duplicates().head()
In this section the analysis is focused on finding the highest correlations to gross earnings. The following chunk of code converts categorical columns in the DataFrame 'df' into numerical values using label encoding and stores the result in the new DataFrame 'df_numerized':
# Label encoding
df_numerized = copy.deepcopy(df)
for col_name in df_numerized.columns:
if(df_numerized[col_name].dtype == 'object'):
df_numerized[col_name] = df_numerized[col_name].astype('category')
df_numerized[col_name] = df_numerized[col_name].cat.codes
df_numerized.head()
After the 'for' loop completes, the DataFrame 'df_numerized' contains the same data as the original DataFrame 'df', but the categorical columns have been replaced with corresponding integer codes. This transformation is useful when you want to use machine learning algorithms that require numerical inputs and cannot directly handle categorical data. By doing this, it will be possible to see all the correlations in a heatmap visualization. The method used to find the correlations between variables is the Pearson method:
correlation_matrix = df_numerized.corr(method='pearson')
#correlation_matrix = df_numerized.corr(method='kendall')
#correlation_matrix = df_numerized.corr(method='spearman')
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Metric for Numeric Features')
plt.xlabel('Movie Features')
plt.ylabel('Movie Features')
plt.show()
The bigger the coefficient the higher the correlation. So, from the heatmap 'budget' and 'votes' have the biggest coefficient, 0.74 and 0.61, respectively. In this case, there are so many correlations, so it will be useful to obtain only the highest correlations by doing the following:
correlation_mat = df_numerized.corr(method='pearson')
corr_pairs = correlation_mat.unstack()
sorted_pairs = corr_pairs.sort_values()
high_corr = sorted_pairs[(sorted_pairs > 0.5) & (sorted_pairs < 1)]
high_corr
After double-checking, 'budget' and 'votes' have the highest correlations with 'gross' earnings.
# Plot budget vs gross using seaborn
sns.regplot(x='budget', y='gross', data=df, scatter_kws={"color": "blue"}, line_kws={"color": "red"})
plt.title('Scatter Plot with Regression Line: Budget vs. Gross Earnings')
The scatter plot illustrates the raw data, showing how 'budget' and 'gross' earnings are distributed among movies. The regression line is a straight line that summarizes the overall trend or relationship between 'budget' and 'gross' earnings. It has a positive slope, so it means there is a positive correlation. In this case, as movie budgets increase, gross earnings tend to increase as well. This suggests that budget has a positive impact on earnings.
# Plot votes vs gross using seaborn
sns.regplot(x='votes', y='gross', data=df, scatter_kws={"color": "blue"}, line_kws={"color": "red"})
plt.title('Scatter Plot with Regression Line: Votes vs. Gross Earnings')
The scatter plot visually presents the distribution of movies based on their audience ratings and gross earnings. The regression line is a straight line that summarizes the overall trend or relationship between 'votes' and 'gross' earnings. It also has a positive slope, and a positive correlation. In this case, as the number of votes (audience ratings) increases, gross earnings tend to increase. This implies that movies with higher audience ratings tend to perform better at the box office.
The main conclusion is that 'budget' and 'votes' have the highest correlations to gross earnings. Both have a positive correlation: as movie's budget and audience ratings increase, gross earnings tend to increase as well.
However, it's important to emphasize that correlation does not imply causation. While the scatter plots and regression lines show an association between audience ratings ('votes') and box office earnings ('gross'), and between 'budget' and 'gross', further analysis would be needed to establish causation and consider potential confounding variables that influence movie success.