Python 3.7.9, jupyeter notebook 6.1.4, postgreSQL 11.10, pgAdmin 4.29, SQLAlchemy
The purpose of this project is to extract, transform, and load data about various movies from separate sources into a movies SQL database to make available for a hackathon. I used a jupyter notebook to explore data scraped from the sidebars of movies' wikipedia pages into a JSON file. I then performed similar cleaning tasks on a csv file of movie metadata from from MovieLens and merged the kaggle and wikipedia movies dataframes.
The below image shows the code used to change datatypes in the kaggle dataset then merge the two dataframes:
I cleaned the merged dataframes by consolidating data from similar columns and dropping redundant columns. Below is an image of code consolidating columns, including a list of redundant columns and choices for a resolution for redundant column in the comments.
I then created a dataframe of ratings fore each movie pulled from a csv file also compiled by MovieLens. This dataframe was merged into the existing dataframe of wiki and kaggle data.
Finally, I exported the merged movies_df and the ratings.csv to a SQL database for use in the hackathon. The code used to connect to the database and export each dataframe to a table in the database is shown below: