__STYLES__
I downloaded a dataset from Kaggle that contains information about California housing prices. The dataset has nine columns, eight of which are numerical and one of which is categorical. The categorical column, “ocean_proximity”, indicates the location of the housing relative to the ocean. This is the target variable that I want to predict with my model.
To prepare the data for modeling, I used the pandas function “get_dummies()” to convert the “ocean_proximity” column into four binary columns, one for each category. This way, I can use numerical methods to handle the categorical data.
I first tried a linear regression model to fit the data and evaluate the performance. I used the coefficient of determination (R-squared) as the metric to measure how well the model explains the variance in the target variable. The linear regression model gave me an R-squared score of 0.65, which means that it accounts for 65% of the variation in “ocean_proximity”.
To improve the model performance, I decided to use a random forest model, which is an ensemble method that combines multiple decision trees. Random forest models are known to be more robust and accurate than linear models, especially for complex and non-linear relationships. The random forest model gave me an R-squared score of 0.814, which is a significant improvement over the linear regression model.
However, I wanted to see if I could further optimize the random forest model by tuning its hyperparameters, such as the number of trees, the maximum depth of each tree, and the minimum number of samples required to split a node. To do this, I used GridSearchCV from sklearn.model_selection, which is a tool that performs an exhaustive search over a specified range of values for each hyperparameter and returns the best combination that maximizes a given scoring function. I experimented with different values for each hyperparameter and used R-squared as the scoring function. After several trials, I found the best combination that gave me an R-squared score of 0.816, which is slightly better than the previous random forest model.
In conclusion, I was able to build a predictive model for California housing prices using a dataset from Kaggle. I transformed the categorical target variable into binary variables using “get_dummies()”, and then applied linear regression and random forest models to fit the data. I also used GridSearchCV to fine-tune the random forest model and achieve a higher R-squared score. The final model explains 81.6% of the variation in “ocean_proximity”.
The following libraries and modules as the tools for data analysis and modeling:
For further information: https://github.com/Star-cj/california_house_price_model