Tools used in this project
Netflix Prize | R

About this project

The Data

The dataset is a 10-million-row subset of the MovieLens dataset, which is a database generated by the GroupLens research lab. Each row represented the rating given by one user to one movie. The algorithms were built out in RStudio using the R Language (skip to Attributes if you're not familiar with Machine Learning). You want to avoid testing an algorithm with the same data it was trained with (that could make it seem as though it will perform better than it actually will). So the first step was to use the code provided by the course to split the dataset in to a train (90% of the data) and final test set (10% of the data). The train set was then similarly split for initial testing. I also split the Title column in to Title and (Release)Year using the stringr package, converted Timestamp to Date using the lubridate package, and then added a column for (Rating)Age by subtracting the Date from the Year.


There were columns for User Id, Movie Id, Rating, Timestamp, Title/Release Year, and Genre.

  • UserId: there were >9million ratings in the train set from ~70K distinct users
  • MovieId: there were 10,677 distinct movies in the train set
  • Rating: ratings ranged from 0.5 to 5, in increments of 0.5
  • Timestamp: the oldest rating in the dataset was 9/1/95. The most recent was 9/14/05
  • Title/Release Year: the movies were released between 1915 and 2008
  • Genres: there were 797 genres, many of which were various combinations of 15 distinct genres


A "naive" model, where a rating is basically just expected to be the average movie rating of 3.512, resulted in an RMSE of 1.06. The goal of the project was to create an algorithm with an RMSE below 0.8649 on the final test set.

In doing the original Exploratory Data Analysis, a few biases were evident. It was clear, for example, that some users always rate higher or lower than others. The more often a movie was rated, the higher its average rating.

No alt text provided for this imagerate= number of ratings over years since release Year to 2009

Makes sense- those movies are more popular for a reason. There also appeared to be a minor effect of rating Date on the Rating. Movies in genre categories that included Drama were the most highly rated.

No alt text provided for this image

I used a combination of Normalization with least squares and Regularization to build my final algorithm (skip to Conclusion if you're not that in to the math behind the algorithms).

  • LEAST SQUARES: The least squares method essentially used the fact that the predicted rating (Y) minus the average rating (µ) equals the error (ϵ).

No alt text provided for this imagePrediction - Avg= Error, where u = user and i = movie

Since that's true, then we also know that the predicted rating(Y) equals the error (ϵ) plus the average rating (µ).

No alt text provided for this imagePrediction = Error + Avg

We can then account for biases like movie effect (b)by adding them:

No alt text provided for this imagePrediction = Avg + Movie Effect + Error, where u= user and i= movie

That's what I did for UserId, MovieId, Genres, Age, and Year.

  • REGULARIZATION: The Predicted Top and Bottom 10 Movies based on the resulting algorithm were very obscure and most were rated by very few users.

No alt text provided for this image

Fewer users means more uncertainty, which leads to larger errors and an increased RMSE. Regularization adds a penalty (lambda) to large estimates formed from small sample sizes by minimizing this equation

No alt text provided for this imageMean squared error + penalty term that increases when many b's are large

No alt text provided for this imageOptimal 𝝺=4.9

Lambda was optimized by running a sequence of lambdas through cross validation and selecting the one that yielded the lowest RMSE.

No alt text provided for this image"Da Hip Hop Witch" was released in 2003, apparently

After Regularization, the lists of Best and Worst predicted movies are well-known good and bad movies. I had personally never heard of "Da Hip Hop Witch," but it did star both Eminem and Vanilla Ice so it stayed ranked in the bottom along with "Disaster Movie" and "From Justin to Kelly."

Note: many of the top finishers in the actual Netflix competition used a combination of Normalization, Neighborhood Models, Implicit Data, Matrix Factorization, Regression, RBMs, Temporal Effects, Regularization, and Ensemble Methods. I attempted to use linear Regression with the caret package, but my computer didn't have the RAM. I also tried my hand (remember I'm new to this) at Ensemble methods and successfully built a very basic regression tree, but my computer crashed when I tried random forests.


🎉 The train set yielded an 18.6% improvement on the naive model's RMSE. The RMSE attained on the test set by the final algorithm, which was both Regularized and optimized, was 0.8639, below the goal of 0.8648.

Discussion and feedback(1 comment)
Alice Zhao
Alice Zhao
6 months ago
Love seeing an R project on here! The EDA results were very interesting, with comedies getting rated so low, but I guess that makes sense given the variety of comedies out there. I also like how you tried a predictive model, but mention other techniques that were used in the competition. Well done!
2000 characters remaining