__STYLES__
The dataset is a 10-million-row subset of the MovieLens dataset, which is a database generated by the GroupLens research lab. Each row represented the rating given by one user to one movie. The algorithms were built out in RStudio using the R Language (skip to Attributes if you're not familiar with Machine Learning). You want to avoid testing an algorithm with the same data it was trained with (that could make it seem as though it will perform better than it actually will). So the first step was to use the code provided by the course to split the dataset in to a train (90% of the data) and final test set (10% of the data). The train set was then similarly split for initial testing. I also split the Title column in to Title and (Release)Year using the stringr package, converted Timestamp to Date using the lubridate package, and then added a column for (Rating)Age by subtracting the Date from the Year.
There were columns for User Id, Movie Id, Rating, Timestamp, Title/Release Year, and Genre.
A "naive" model, where a rating is basically just expected to be the average movie rating of 3.512, resulted in an RMSE of 1.06. The goal of the project was to create an algorithm with an RMSE below 0.8649 on the final test set.
In doing the original Exploratory Data Analysis, a few biases were evident. It was clear, for example, that some users always rate higher or lower than others. The more often a movie was rated, the higher its average rating.
rate= number of ratings over years since release Year to 2009
Makes sense- those movies are more popular for a reason. There also appeared to be a minor effect of rating Date on the Rating. Movies in genre categories that included Drama were the most highly rated.
I used a combination of Normalization with least squares and Regularization to build my final algorithm (skip to Conclusion if you're not that in to the math behind the algorithms).
Prediction - Avg= Error, where u = user and i = movie
Since that's true, then we also know that the predicted rating(Y) equals the error (ϵ) plus the average rating (µ).
Prediction = Error + Avg
We can then account for biases like movie effect (b)by adding them:
Prediction = Avg + Movie Effect + Error, where u= user and i= movie
That's what I did for UserId, MovieId, Genres, Age, and Year.
Fewer users means more uncertainty, which leads to larger errors and an increased RMSE. Regularization adds a penalty (lambda) to large estimates formed from small sample sizes by minimizing this equation
Mean squared error + penalty term that increases when many b's are large
Optimal 𝝺=4.9
Lambda was optimized by running a sequence of lambdas through cross validation and selecting the one that yielded the lowest RMSE.
"Da Hip Hop Witch" was released in 2003, apparently
After Regularization, the lists of Best and Worst predicted movies are well-known good and bad movies. I had personally never heard of "Da Hip Hop Witch," but it did star both Eminem and Vanilla Ice so it stayed ranked in the bottom along with "Disaster Movie" and "From Justin to Kelly."
Note: many of the top finishers in the actual Netflix competition used a combination of Normalization, Neighborhood Models, Implicit Data, Matrix Factorization, Regression, RBMs, Temporal Effects, Regularization, and Ensemble Methods. I attempted to use linear Regression with the caret package, but my computer didn't have the RAM. I also tried my hand (remember I'm new to this) at Ensemble methods and successfully built a very basic regression tree, but my computer crashed when I tried random forests.
🎉 The train set yielded an 18.6% improvement on the naive model's RMSE. The RMSE attained on the test set by the final algorithm, which was both Regularized and optimized, was 0.8639, below the goal of 0.8648.