__STYLES__

Correlation Analysis in Movies Data - Python Project

Tools used in this project
Correlation Analysis in Movies Data - Python Project

About this project

Introduction

The main objective of this notebook project is to find the highest correlations of the gross earnings in movie industry.

The data is from 1986 to 2016, and it can be found in Kaggle: https://www.kaggle.com/datasets/danielgrijalvas/movies

The project includes some of the basic Python functions to process, clean, analyze and visualize data.

The Python libraries used are: pandas, seaborn, numpy and matplotlib.

Libraries Used and Data Processing

# Import Libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from matplotlib.pyplot import figure
import copy

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8) #Adjusts the configuration of the plots we will create

# Read the data
df = pd.read_csv(r'C:\Users\Documents\CARLOS\Datasets\Movie Industry\movies.csv')

# Data view
df.head()

undefined

Data Cleaning

Looking for missing data



for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, pct_missing))

undefinedIt is convenient to clean the empty values in the budget and the gross columns, as these are the ones with the highest percentage of missing data and they will probably be two most important data for the following analysis:

df.dropna(subset=['budget', 'gross'], inplace=True)
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, pct_missing))

undefinedAdding Year Column

The year and the released year doesn't always match, so the released date has to be in a new column.

To achieve that, released column is transformed to a string, then a regular expression (r'(\d{4})') is used to get the year:

  • 'r' before the string indicates that it's a raw string literal in Python, which is often used with regular expressions to avoid unintended escape sequences.
  • '(' and ')' are used to create a capturing group. In this case, it captures the four digits that represent the year.
  • '\d' is a shorthand character class in regular expressions, which matches any digit (0-9).
  • '{4}' specifies that the previous \d should be repeated exactly four times. In other words, it matches exactly four consecutive digits.
# Create Year column

df['yearcorrected'] = df['released'].astype(str).str.extract(r'(\d{4})')
# del df['yearcorrect']
df.head()

undefined

Sorting and Deleting Duplicates

pd.set_option('display.max_rows', None)
df = df.sort_values(by=['gross'], inplace=False, ascending=False)

df.drop_duplicates().head()

undefined

Data Analysis

Highest Correlations to Gross Earnings

In this section the analysis is focused on finding the highest correlations to gross earnings. The following chunk of code converts categorical columns in the DataFrame 'df' into numerical values using label encoding and stores the result in the new DataFrame 'df_numerized':

# Label encoding
df_numerized = copy.deepcopy(df)

for col_name in df_numerized.columns:
    if(df_numerized[col_name].dtype == 'object'):
        df_numerized[col_name] = df_numerized[col_name].astype('category')
        df_numerized[col_name] = df_numerized[col_name].cat.codes
df_numerized.head()

undefinedAfter the 'for' loop completes, the DataFrame 'df_numerized' contains the same data as the original DataFrame 'df', but the categorical columns have been replaced with corresponding integer codes. This transformation is useful when you want to use machine learning algorithms that require numerical inputs and cannot directly handle categorical data. By doing this, it will be possible to see all the correlations in a heatmap visualization. The method used to find the correlations between variables is the Pearson method:

correlation_matrix = df_numerized.corr(method='pearson')
#correlation_matrix = df_numerized.corr(method='kendall')
#correlation_matrix = df_numerized.corr(method='spearman')

sns.heatmap(correlation_matrix, annot=True)

plt.title('Correlation Metric for Numeric Features')
plt.xlabel('Movie Features')
plt.ylabel('Movie Features')

plt.show()

undefinedThe bigger the coefficient the higher the correlation. So, from the heatmap 'budget' and 'votes' have the biggest coefficient, 0.74 and 0.61, respectively. In this case, there are so many correlations, so it will be useful to obtain only the highest correlations by doing the following:

correlation_mat = df_numerized.corr(method='pearson')

corr_pairs = correlation_mat.unstack()

sorted_pairs = corr_pairs.sort_values()

high_corr = sorted_pairs[(sorted_pairs > 0.5) & (sorted_pairs < 1)]

high_corr

undefinedAfter double-checking, 'budget' and 'votes' have the highest correlations with 'gross' earnings.

Correlation between 'budget' and 'gross' earnings

# Plot budget vs gross using seaborn

sns.regplot(x='budget', y='gross', data=df, scatter_kws={"color": "blue"}, line_kws={"color": "red"})

plt.title('Scatter Plot with Regression Line: Budget vs. Gross Earnings')

undefinedThe scatter plot illustrates the raw data, showing how 'budget' and 'gross' earnings are distributed among movies. The regression line is a straight line that summarizes the overall trend or relationship between 'budget' and 'gross' earnings. It has a positive slope, so it means there is a positive correlation. In this case, as movie budgets increase, gross earnings tend to increase as well. This suggests that budget has a positive impact on earnings.

Correlation between 'votes' and 'gross' earnings

# Plot votes vs gross using seaborn

sns.regplot(x='votes', y='gross', data=df, scatter_kws={"color": "blue"}, line_kws={"color": "red"})

plt.title('Scatter Plot with Regression Line: Votes vs. Gross Earnings')

undefinedThe scatter plot visually presents the distribution of movies based on their audience ratings and gross earnings. The regression line is a straight line that summarizes the overall trend or relationship between 'votes' and 'gross' earnings. It also has a positive slope, and a positive correlation. In this case, as the number of votes (audience ratings) increases, gross earnings tend to increase. This implies that movies with higher audience ratings tend to perform better at the box office.

Conclusions

The main conclusion is that 'budget' and 'votes' have the highest correlations to gross earnings. Both have a positive correlation: as movie's budget and audience ratings increase, gross earnings tend to increase as well.

However, it's important to emphasize that correlation does not imply causation. While the scatter plots and regression lines show an association between audience ratings ('votes') and box office earnings ('gross'), and between 'budget' and 'gross', further analysis would be needed to establish causation and consider potential confounding variables that influence movie success.

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.