__STYLES__

Unraveling the World of Movies

Tools used in this project
Unraveling the World of Movies

Visualization Section

About this project

Introduction:

In this project, I embarked on a fascinating journey to clean and analyze movies using Microsoft SQL and Tableau. The objective was to gain insights into the world of movies by combining data from the IMDB database and my watched movies database from Letterboxd. By leveraging these datasets, I was able to explore various aspects of movies, including information about the movies themselves, the crews involved, and my viewing habits..

Challenges:

The need to analyze a large volume of movie data from different sources. The IMDB database provided comprehensive information about movies and crews, while my database from Letterboxd contained details about the movies I had watched, my ratings, and the dates I watched them. However, these datasets were not readily compatible, and the Letterboxd database has no unique ID, so I needed to find a way to integrate and analyze them effectively.

Task:

The project’s primary objectives were to clean and merge data from the IMDB and Letterboxd databases. By establishing a connection between these two databases, I aimed to link movie names across all datasets. This process involved data integration and synchronization. The ultimate purpose was to derive valuable insights into movie trends, identify popular actors and directors, and gain a deeper understanding of personal movie preferences.

Action:

To address the challenges at hand, I adopted a multi-step approach:

  • In the initial phase of the project, I obtained pertinent tables from the IMDB website and downloaded my Account data from the Letterboxd website. Subsequently, I imported all the acquired tables into the Microsoft SQL Server, consolidating them into a single database. To ensure data consistency and accuracy, the first step of the cleaning process involved converting all columns into their specific data types. This step lays the groundwork for further analysis and ensures that the data is appropriately structured for subsequent tasks.
  • IMDB datasets use the string value ‘\N’ when specifying a NULL value, so I used the following code to replace all values with similar values with NULL values.
UPDATE [title.basics]
SET
  isAdult = REPLACE(isAdult, '\N', ''),
  startYear = REPLACE(startYear, '\N', ''),
  endYear = REPLACE(endYear, '\N', ''),
  runtimeMinutes = REPLACE(runtimeMinutes, '\N', ''),
  genres = REPLACE(genres, '\N', '')
WHERE
  isAdult LIKE '%\N%' OR
  startYear LIKE '%\N%' OR
  endYear LIKE '%\N%' OR
  runtimeMinutes LIKE '%\N%' OR
  genres LIKE '%\N%'

This code is used on the [title.basics] table, similar code is used for tables: [name.basics] with columns [birthYear], [deathYear], and [knownForTitles]; and [title.principals] with columns [job] and [characters].

  • Since Letterboxd has about 1000 rows or more, I will merge them together for an easier analysis process. First, I will merge the [my_diary] table with the [my_ratings] table into a new table called [my_diary_ratings] as follows:
WITH unknown_date(list) AS (
  SELECT r.Name
  FROM [my_diary] AS d
  RIGHT JOIN [my_ratings] AS r
  ON d.Name = r.Name AND d.Year = r.Year
  WHERE d.Name IS NULL
)

SELECT *
INTO my_diary_ratings
FROM [my_diary] AS d
UNION
SELECT [Date]
      ,[Name]
      ,[Year]
      ,[Letterboxd URI]
      ,[Rating]
      ,NULL AS [Rewatch]
      ,NULL AS [Tags]
      ,NULL AS [Watched Date]
FROM [my_ratings] AS r, unknown_date
WHERE [Name] IN (SELECT * FROM unknown_date)

then I merge the new table with the [my_watched] table into the [my_list] table, and at the same time I will get rid of columns that won’t help in the analysis, also rename the remaining columns as follows:

SELECT w.Name AS title
      ,w.Year AS release_year
      ,r.Rating AS rate
      ,r.Rewatch AS rewatch
      ,[Watched Date] AS watched_date
INTO my_list
FROM [my_watched] AS w
LEFT JOIN [my_diary_ratings] AS r
ON w.Name = r.Name AND w.Year = r.Year
  • Convert the values in the rate column from 5/5 to 10/10 as follows:
UPDATE [my_list]
SET rate = rate * 2
  • Rename some columns in [my_watchlist]:
EXEC sp_rename '[my_watchlist].[Name]', 'title', 'COLUMN'
EXEC sp_rename '[my_watchlist].[Year]', 'release_year', 'COLUMN'
  • Split the [Genre] column in the [title.basics] table into 3 columns as follows:
ALTER TABLE [title.basics]
ADD 
  Genre1 NVARCHAR(255),
  Genre2 NVARCHAR(255),
  Genre3 NVARCHAR(255)

UPDATE [title.basics]
SET 
  Genre1 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 1), ''),
  Genre2 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 2), ''),
  Genre3 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 3), '')
  • Unpivot the three columns of Genre into one column as follows:
SELECT tconst, Genre
INTO [genre]
FROM [title]
UNPIVOT (
        Genre FOR GenreNumber IN (Genre1, Genre2, Genre3)
        ) AS unpivoted_data
  • Now that it’s time to add an ID to [my_list], I will use 3 relationships to try to connect movies from IMDB datasets to the my_list table. By using the following query, I can check which movies didn’t connect successfully:
SELECT DISTINCT m.title,
                m.release_year
FROM [my_list] AS m
LEFT JOIN (SELECT * FROM [title.basics] WHERE titleType = 'movie') AS t
ON (m.title = t.primaryTitle OR m.title = t.originalTitle)
  AND m.release_year = t.year
WHERE t.primaryTitle IS NULL
  • Look and update all movies that couldn’t connect by using (UPDATE + SET) functions, some have different names while others have different years (by 1 year mostly for movies released at the end of the year), and the third type is where they are not in movies type (some as short while others as video and tvMiniSeries)
  • Finally, we can add the id column to my_list as follows:
ALTER TABLE [my_list]
ADD id VARCHAR(200)

UPDATE [my_list]
SET id = t.tconst
FROM (SELECT * FROM [title.basics] WHERE titleType = 'movie') AS t
WHERE (title = t.primaryTitle OR title = t.originalTitle)
  AND release_year = t.year

After that, I did the process again for titles that aren’t in the movie type, then explored my_list and checked if all the titles got the right id (by checking the [numVotes] from [title.ratings] and manually checking titles with low rates).

  • Now that everything is checked and ready, we have to export the data into CSV files (since Tableau Public doesn’t support connecting data from Microsoft SQL).

Result:

The actions taken throughout this project yielded significant results:

  1. The most-watched genres were comedy, drama, action, and adventure. However, dramas received the highest average ratings, while crime, thriller, and romance genres also garnered favorable ratings. undefined

  2. The average rating for rated movies was 7.35/10, while movies that were rewatched received an average rating of 9/10, This indicates that I tend to rewatch movies that I truly love and enjoy. undefined

  3. Movie preferences were focused on releases between 2000 and 2020, with a notable concentration of highly-watched movies released between 2005 and 2014. Interestingly, during this period of preference, older releases received higher average ratings compared to more recent ones. undefined

  4. Movie-watching habits showed a consistent trend, with a higher frequency of watching movies around March, September, and January. undefined

  5. Based on the Genre Trend analysis, it appears that I tend to watch movies of various genres during peak months when movie-watching is more frequent. However, during months when I watch fewer movies, I gravitate towards movies within my favorite genres. This observation suggests that my movie preferences may be influenced by the frequency of movie-watching, leading me to explore different genres when I watch movies more often and focus on my preferred genres when movie-watching is less frequent.undefined

  6. Correlations between personal ratings and IMDb ratings indicated a preference for higher-rated movies. However, movie duration did not strongly influence personal ratings. undefined

  7. Crew analysis provided valuable insights for building a suggestion model to recommend movies based on favorite crews and other preferences.

In summary, this project offers valuable insights into the movie landscape and my individual preferences. The journey doesn't end here; I can delve deeper and create a recommendation model for movies based on my preferences. Nonetheless, the project's purpose extends beyond enhancing my analytical skills; it has enriched my movie-watching experience and demonstrated my proficiency in leveraging Microsoft SQL and Tableau for data analysis from the IMDB and Letterboxd databases. The process of data cleaning, integration, and analysis has been instrumental in achieving these outcomes.

Resources:

  • Added "name.basics.tsv", "title.basics.tsv", "title.principals.tsv" and "title.ratings.tsv" from IMDB Datasets
  • I downloaded my Data from my account at Letterboxd, which includes the following tables: "my_diary.csv", "my_ratings.csv", "my_watched.csv" and "my_watchlist.csv"
Discussion and feedback(0 comments)
2000 characters remaining