__STYLES__
In this project, I embarked on a fascinating journey to clean and analyze movies using Microsoft SQL and Tableau. The objective was to gain insights into the world of movies by combining data from the IMDB database and my watched movies database from Letterboxd. By leveraging these datasets, I was able to explore various aspects of movies, including information about the movies themselves, the crews involved, and my viewing habits..
The need to analyze a large volume of movie data from different sources. The IMDB database provided comprehensive information about movies and crews, while my database from Letterboxd contained details about the movies I had watched, my ratings, and the dates I watched them. However, these datasets were not readily compatible, and the Letterboxd database has no unique ID, so I needed to find a way to integrate and analyze them effectively.
The project’s primary objectives were to clean and merge data from the IMDB and Letterboxd databases. By establishing a connection between these two databases, I aimed to link movie names across all datasets. This process involved data integration and synchronization. The ultimate purpose was to derive valuable insights into movie trends, identify popular actors and directors, and gain a deeper understanding of personal movie preferences.
To address the challenges at hand, I adopted a multi-step approach:
UPDATE [title.basics]
SET
isAdult = REPLACE(isAdult, '\N', ''),
startYear = REPLACE(startYear, '\N', ''),
endYear = REPLACE(endYear, '\N', ''),
runtimeMinutes = REPLACE(runtimeMinutes, '\N', ''),
genres = REPLACE(genres, '\N', '')
WHERE
isAdult LIKE '%\N%' OR
startYear LIKE '%\N%' OR
endYear LIKE '%\N%' OR
runtimeMinutes LIKE '%\N%' OR
genres LIKE '%\N%'
This code is used on the [title.basics] table, similar code is used for tables: [name.basics] with columns [birthYear], [deathYear], and [knownForTitles]; and [title.principals] with columns [job] and [characters].
WITH unknown_date(list) AS (
SELECT r.Name
FROM [my_diary] AS d
RIGHT JOIN [my_ratings] AS r
ON d.Name = r.Name AND d.Year = r.Year
WHERE d.Name IS NULL
)
SELECT *
INTO my_diary_ratings
FROM [my_diary] AS d
UNION
SELECT [Date]
,[Name]
,[Year]
,[Letterboxd URI]
,[Rating]
,NULL AS [Rewatch]
,NULL AS [Tags]
,NULL AS [Watched Date]
FROM [my_ratings] AS r, unknown_date
WHERE [Name] IN (SELECT * FROM unknown_date)
then I merge the new table with the [my_watched] table into the [my_list] table, and at the same time I will get rid of columns that won’t help in the analysis, also rename the remaining columns as follows:
SELECT w.Name AS title
,w.Year AS release_year
,r.Rating AS rate
,r.Rewatch AS rewatch
,[Watched Date] AS watched_date
INTO my_list
FROM [my_watched] AS w
LEFT JOIN [my_diary_ratings] AS r
ON w.Name = r.Name AND w.Year = r.Year
UPDATE [my_list]
SET rate = rate * 2
EXEC sp_rename '[my_watchlist].[Name]', 'title', 'COLUMN'
EXEC sp_rename '[my_watchlist].[Year]', 'release_year', 'COLUMN'
ALTER TABLE [title.basics]
ADD
Genre1 NVARCHAR(255),
Genre2 NVARCHAR(255),
Genre3 NVARCHAR(255)
UPDATE [title.basics]
SET
Genre1 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 1), ''),
Genre2 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 2), ''),
Genre3 = NULLIF(PARSENAME(REPLACE(genres, ',', '.'), 3), '')
SELECT tconst, Genre
INTO [genre]
FROM [title]
UNPIVOT (
Genre FOR GenreNumber IN (Genre1, Genre2, Genre3)
) AS unpivoted_data
SELECT DISTINCT m.title,
m.release_year
FROM [my_list] AS m
LEFT JOIN (SELECT * FROM [title.basics] WHERE titleType = 'movie') AS t
ON (m.title = t.primaryTitle OR m.title = t.originalTitle)
AND m.release_year = t.year
WHERE t.primaryTitle IS NULL
ALTER TABLE [my_list]
ADD id VARCHAR(200)
UPDATE [my_list]
SET id = t.tconst
FROM (SELECT * FROM [title.basics] WHERE titleType = 'movie') AS t
WHERE (title = t.primaryTitle OR title = t.originalTitle)
AND release_year = t.year
After that, I did the process again for titles that aren’t in the movie type, then explored my_list and checked if all the titles got the right id (by checking the [numVotes] from [title.ratings] and manually checking titles with low rates).
The actions taken throughout this project yielded significant results:
The most-watched genres were comedy, drama, action, and adventure. However, dramas received the highest average ratings, while crime, thriller, and romance genres also garnered favorable ratings.
The average rating for rated movies was 7.35/10, while movies that were rewatched received an average rating of 9/10, This indicates that I tend to rewatch movies that I truly love and enjoy.
Movie preferences were focused on releases between 2000 and 2020, with a notable concentration of highly-watched movies released between 2005 and 2014. Interestingly, during this period of preference, older releases received higher average ratings compared to more recent ones.
Movie-watching habits showed a consistent trend, with a higher frequency of watching movies around March, September, and January.
Based on the Genre Trend analysis, it appears that I tend to watch movies of various genres during peak months when movie-watching is more frequent. However, during months when I watch fewer movies, I gravitate towards movies within my favorite genres. This observation suggests that my movie preferences may be influenced by the frequency of movie-watching, leading me to explore different genres when I watch movies more often and focus on my preferred genres when movie-watching is less frequent.
Correlations between personal ratings and IMDb ratings indicated a preference for higher-rated movies. However, movie duration did not strongly influence personal ratings.
Crew analysis provided valuable insights for building a suggestion model to recommend movies based on favorite crews and other preferences.
In summary, this project offers valuable insights into the movie landscape and my individual preferences. The journey doesn't end here; I can delve deeper and create a recommendation model for movies based on my preferences. Nonetheless, the project's purpose extends beyond enhancing my analytical skills; it has enriched my movie-watching experience and demonstrated my proficiency in leveraging Microsoft SQL and Tableau for data analysis from the IMDB and Letterboxd databases. The process of data cleaning, integration, and analysis has been instrumental in achieving these outcomes.