__STYLES__
Tools used in this project
Predicting Diamond Prices

About this project

Table of Contents

  1. Objective
  2. Exploratory Data Analysis
    1. Diamond Dataset
    2. Data Cleaning
    3. Feature Engineering
  3. Modeling
    1. Model Pipeline
    2. Model Scoring
    3. Random Forest
    4. Linear Regression

Objective

Build a model that will predict price for each diamond, given an input file with diamond features.

Exploratory Data Analysis

Diamond Dataset

The diamond dataset contains prices and other attributes of almost 40,000 diamonds. The raw data has 9 features - 3 ordinal categorical (cut, color and clarity) and 6 numerical (price, depth, table, x, y and z).

The features are described as follows:

ORDINAL CATEGORICAL

Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

Color: diamond color, from J (worst) to D (best)

Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

NUMERICAL

Price: in US dollars

Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y)

Table: width of top of diamond relative to widest point

X: length in mm

Y: width in mm

Z: depth in mm

Table 1: Initial diamond dataset

pricecutcolorclaritydepthtablexyz
0326IdealESI261.5553.953.982.43
1326PremiumESI159.8613.893.842.31
2327GoodEVS156.9654.054.072.31
3334PremiumIVS262.4584.24.232.63
4335GoodJSI263.3584.344.352.75

Data Cleaning

The dataset needed to be cleaned prior to model building. Looking at the descriptive statistics, it is apparent that there are rows where at least one dimension of the diamond is 0, which is impossible if we are to believe the diamond is a 3 dimensional object. The minimum values for the x, y and z dimensions are 0, as shown in Table 2.

Table 2: Descriptive statistics of dataset summarizing the central tendency, dispersion and shape of distribution

pricedepthtablexyz
count400004000040000400004000040000
mean3927.0261.753757.46085.729185.731743.53813
std3982.231.432.234621.121131.120160.709047
min3264343000
25%94961564.714.722.91
50%240161.8575.75.713.52
75%5313.2562.5596.546.544.0325
max18823799510.1431.831.8

Since there were 40,000 data points, there was enough data that these bad rows could be dropped. The rows where any value was +/- 3 standard deviations from the mean were dropped from the dataset. This method is preffered, as it is productionizable - i.e., the cleaning function can be applied to any new dataframe and offers reproducible results. This left 38,371 rows, as shown in Table 3.

Table 3: Descriptive statistics summarizing the central tendency, dispersion and shape of distribution for cleaned dataset

pricedepthtablexyz
count383713837138371383713837138371
mean3615.5461.75757.37395.66655.669663.50025
std3473.151.267212.096091.068541.061030.660391
min32657.5513.733.711.53
25%92861.1564.694.72.89
50%231661.8575.665.673.5
75%508062.5596.496.494.02
max1587366649.089.015.65

Feature Engineering

Now that the dataset has been cleaned of outliers, some features will need to be engineered to prepare for model building. The ordinal categorical attributes were defined as having order amongst the values. Based on this order, a mapping scheme was generated where each cut, color and clarity value was mapped to a number in accordance to its hierarchy. In this scheme, a higher number indicates a "better" value.

Table 4: Replacing categorical values with ordinal values

pricecutcolorclaritydepthtablexyz
032656261.5553.953.982.43
132646359.8613.893.842.31
232726556.9654.054.072.31
333442462.4584.24.232.63
433521263.3584.344.352.75

From the descriptions of each feature, it is obvoius that depth is colinear with the dimensions, as depth is a function of x, y and z. In order to avoid multicollinearity, or when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy, this feature can be dropped from the dataset.

Another solution is to use decision trees or boosted trees algorithms, as they are immune to multicollinearity, since the tree will choose only one of the perfectly correlated features upon splitting. I chose to implement both models and tested their accuracy.

Futhermore, the dimensions can be engineered into a volume by multiplying them to create a new feature. This greatly simplifies the model, by reducing 3 features into 1.

Table 5: Dataframe with x, y and z reduced to volume

pricecutcolorclaritytablevolume
03265625538.202
13264636134.5059
33344245846.7246
43352125851.9172
53363165738.694

Finally, the distribution of the prices in our dataset is heavily skewed towards 0. Figure 1 shows a strip plot of the distribution of the prices, where the shading corresponds to the density of the data.

Figure 1: Strip plot of the raw price datapoints

Distributions such as this often indicate that a variable needs to be transformed. If we were to fit a model to predict these responses, we would expect our residuals (the difference between what we observe and what we predict) to be similarly skewed. Figure 2 shows studentized residuals for a model fit to the price.

Figure 2: Studentized residuals of the predicted price from a linear regression model

Skewed data can be easily fixed by taking a logarithm of the feature. This allows for a more even distribution of the price data, and can be easily reversed when interpreting the model by exponentiating the responses after predicting. Figure 3 shows the distribution from Figure 1, transformed by taking a logarithm.

Figure 3: Strip plot of the logarithm of the price datapoints

We should now expect our residuals to be much more evenly centered around 0 in our studentized residual plot.

Figure 4: Studentized residuals of the predicted price from a linear regression model

The same procedure was applied to the volume to evenly distribute the datapoints. The final dataframe used for modeling can be seen in Table 6.

Table 6: Final dataframe for model building

log_pricecolorcutclaritylog_volumetable
05.78696523.6428955
15.78696433.5411361
35.811142443.8442758
45.814131223.9496558
55.817111363.6556857

Modelling

Model Pipeline

The above preprocessing and feature engineering was integrated into a scikit-learn pipeline; a pipeline sequentially applies a list of transforms and a final estimator, where intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit. The code for the diamond model can be found here.

The script takes arguments 'data' : a tab delimited csv file with input data, either for model building or predictions, 'model_output_path' : where to save the serialized model object file, 'model_input_path' : serialized model object to use for predictions, 'output_file' : where to save the model predictions file, 'mode' : either train a new model or predict using an existing model, 'tree_model' : if True, use random forest model, if not specified, use linear regression model.

The above script can both train and predict price responses, based on the arguments provided. If training, either random forest regression or linear regression can be used to generate a serialized model output. If predicting, an input of either a serialized random forest or linear regression model is needed to generate a text file of the price responses.

When training, a PNG file of the predicted responses vs. the true responses is generated, as well as statistical metrics for model scoring.

Model Scoring

Mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R2) are three of the most common metrics used to measure accuracy for continuous variables. MAE and RMSE, are particularly useful and, depending on the model, the differences can be subtle or obvious.

Both MAE and RMSE measure the average magnitude of the errors in a set of predictions. MAE is the average of the absolute differences between predictions and true responses (where all individual differences have equal weight), while RMSE is the square root of the average of the squared differences between predictions and true repsonses. Both metrics range from 0 to ∞ and don't account for the direction of errors. MAE and RMSE are both negatively-oriented, meaning smaller values are better.

However, since errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE penalizes large errors, and depending on the problem, is more informative. However, since RMSE does not describe average error alone, MAE is usually more useful.

R2 is related to RMSE, in that it is calculated by dividing the MSE (average of the squares of the residuals) by the variance in Y values. R2 represents the proportion of the variance for a dependent variable that is explained by independent variable(s) in a regression model. R2 is positively-oriented, meaning smaller values are worse.

Random Forest

Three scoring metrics for the random forest model are given below, along with the predicted responses vs. the true responses for the training and test data, as well as the importance of each feature in the model:

    mean absolute error =  277.11
    root mean squared error =  509.31
    R squared =  0.98

The mean price from this dataset was $3,615.93. The MAE is $277.11 while the RMSE is $509.31. Looking at Figure 5, it is obvious why the RMSE is higher than the MAE - as the predicted prices get higher, the errors get larger. Since RMSE penalizes for large errors, we expect it to be greater than the MAE for this model.

While the errors are large, the overall fit of the model is very good. The highest score for R2 is 1.0, so an R2 of 0.98 is excellent. However, the magnitude of the MAE and RMSE mean that, although the fit is excellent, the model is subpar.

Figure 5: Predicted price vs. true price for random forest regression model

Table 7 shows the feature importance for each feature in the model. Since random forest consists of a number of decision trees, where every node is a single feature designed to split the dataset so that similar response values are distributed evenly across sets. Thus when training a regression tree, each feature addition should have an effect on the variance. For a forest, the variance decrease from each feature can be averaged and the features are ranked according to how much they decrease the variance.

Table 7: Feature importance for random forest regression model

feature_importance
log_volume0.951494
clarity0.0322812
color0.0147471
cut0.00106129
table0.000416271

As shown in Table 7, the logarithm of the volume had the highest effect on the variance in training. It would seem that this is the most important feature, however, feature importance is not that simple. Correlations in data can lead to incorrect conclusions that one of the variables is a strong predictor while others are unimportant. There are ways to tease this out, but they are outside of the scope of this project.

Linear Regression

Three scoring metrics for the linear regression model are given below, along with the predicted responses vs. the true responses for the training and test data, as well as the importance of each feature in the model:

    mean absolute error =  417.66
    root mean squared error =  779.50
    R squared =  0.95

The mean price from this dataset was $3,615.93. The MAE is $417.66 while the RMSE is $779.50. This model suffers from the same problems as the random forest model, since as the predicted prices get higher, the errors get larger. Recalling that RMSE penalizes for large errors, we expect it to be greater than the MAE for this model. Comparing Figure 5 to Figure 6, we also expect the RMSE and MAE to be higher for the linear regression model than for the random forest model.

While the errors are large, the overall fit of 0.95 is very good but not as good as the random forest model at 0.98. However, the magnitude of the MAE and RMSE mean that, although the fit good, the model is even more subpar than the random forest model.

Figure 6: Predicted price vs. true price for linear regression model

Table 8 shows the coefficients for each feature. The magnitude of these coefficients cannot be interpreted in the same way that the feature importances for the random forest model. Unless the data is standardized before training, the magnitude of the coefficients is meaningless.

Table 8: Coefficients for linear regression model

value
log_volume1.88635
clarity0.119566
color0.0766382
cut0.0195203
table0.0039767

Comparing the above metrics to the random forest model, it is apparent that the fit of the random forest model is superior. However, the complexity of a random forest model mean they are more computationally expensive than a linear regression model. Therefore, it is not always a better choice.

Additional project images

Discussion and feedback(1 comment)
comment-19-avatar
Rodrigo Chávez
Rodrigo Chávez
almost 2 years ago
Love it your complete description and estructure of contents!
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.