__STYLES__

Decoding the Grapes: Predictive Analysis of Wine Quality Using Machine Learning

Tools used in this project
Decoding the Grapes: Predictive Analysis of Wine Quality Using Machine Learning

About this project

"Decoding the Grapes: Predictive Analysis of Wine Quality Using Machine Learning" is a project that showcases my ability to utilize advanced data analytics and machine learning techniques to derive meaningful insights from complex datasets. Using a dataset from the UCI Machine Learning Repository, I applied various data cleaning and preprocessing techniques, including handling duplicate entries, missing values, and class imbalance using Synthetic Minority Oversampling Technique (SMOTE).

In the exploratory data analysis phase, I used data visualization tools to understand the distribution of wine quality and the correlation between different variables. I also identified key variables that significantly influence wine quality.

I then built and evaluated three machine learning models - Decision Tree, Random Forest, and Logistic Regression - to predict wine quality. This process involved tuning model parameters, assessing model performance using metrics like accuracy, precision, recall, F1 score, and AUC value, and interpreting the results to draw meaningful conclusions.

The project demonstrates my proficiency in data analysis, machine learning, data visualization, and statistical analysis, providing valuable insights that can guide decision-making in the wine industry.

Abstract

This study, titled "Decoding the Grapes: Predictive Analysis of Wine Quality Using Machine Learning," utilizes data from the UCI Machine Learning Repository to predict wine quality based on various chemical composition parameters. The dataset, comprising both red and white wine samples, was thoroughly cleaned and analyzed. Key variables such as alcohol content, density, chloride, residual sugar, and total sulfur dioxide were identified as significant contributors to wine quality.

Three machine learning models - Decision Tree, Random Forest, and Logistic Regression - were employed to predict wine quality. The Random Forest model emerged as the most effective, providing the highest accuracy, precision, recall, and F1 scores. The study's findings offer valuable insights for stakeholders in the wine industry, enabling them to focus on the critical variables during the wine production process to enhance the quality of their products.

Data Cleaning

First, we looked for duplicate entries and found 240 duplicates in the red wine dataset and 937 duplicates in the white wine dataset. Once we removed these duplicates, we had 1,359 red wine entries and 3,961 white wine entries, totaling 5,320 unique rows. To our satisfaction, there were no missing values in the dataset.

Finally, we added a column called 'quality_level' as a categorical variable to classify wine quality as either 'bad' or 'good'. The classification was based on wine quality scores: wines scoring below 6.5 were labeled 'bad,' while those scoring 6.5 or higher were labeled 'good'. This new variable will help us better understand the factors that influence wine quality.

Exploratory Data Analysis

undefinedFrom the two plots, we can observe that both qualities of red and white wine are close to a normal distribution. The majority of quality scores in red wine fall between 4.5 and 5, while most of them in white wine are scoring in 5.5 to 6.

undefined

The histogram indicates that the quantity of bad wine is way higher than good wine in both red and white wine, so we may need to balance them to avoid overfitting our models.

undefined

The figure above allows us to identify outliers in each column for both red and white wines. Given that these outliers fall within acceptable boundaries, we chose not to modify them. Additionally, the box plots reveal variations in the ranges and average values across certain columns for both types of wine.

Model Building and Analysis

undefinedComparing the two correlation plots of red and white wine, we found that although there are some similarities, for example, alcohol and quality are both in negative correlation, the difference of correlation in variables is obvious. It indicates that combining the red and white wine dataset to get one predicting model would not be a suitable way. So, we would create models separately for each of them.

undefinedWe have employed three models to determine the best prediction for the best quality both red and white wine. Based on the previous visualization, we found that the dataset is skewed or unbalanced. To address this issue, we can use Synthetic Minority Oversampling Technique (SMOTE) to oversample the minority class. By doing so, we can improve the performance of our models, since SMOTE creates synthetic samples by using the existing data, which in turn increases the number of minority class samples. This method helps balance the class distribution and reduces bias towards the majority class. The advantages of using SMOTE include better model accuracy, generalization, and reduction in overfitting. Moreover, by oversampling the minority class, we can detect underlying patterns and relationships that may have been neglected in the original unbalanced data. This enables us to build more robust and reliable models, resulting in better decision-making and business outcomes. In conclusion, using SMOTE can substantially improve the quality of our models and provide more precise predictions.

White Wine Decision Tree Model

undefinedWe've developed a decision tree model to assess the quality of white wines by considering various factors such as alcohol content, residual sugar, pH, and density. The dataset used for training the model consists of 4,489 wine samples. The analysis shows that the most critical aspect in determining wine quality is alcohol content. Wines with an alcohol content of 10.8% or higher are more likely to be classified as good quality wines.

Furthermore, residual sugar and pH also contribute to the quality assessment of wines with higher alcohol content. Wines with a residual sugar level of at least 1.2 g/L and a pH value of 3.05 or higher tend to be classified as good quality wines. In addition, for wines with a pH value below 3.05, density becomes an important factor in determining quality.

Red Wine Decision Tree Model

undefinedOur team also created a decision tree model to evaluate the quality of red wines based on all the factors that we used in the white wine model. Through our analysis, we found that alcohol content is the most crucial factor in determining wine quality. Wines with an alcohol content of 10.8% or higher are more likely to be classified as good quality wines. Additionally, our analysis revealed that alcohol, sulphates, and volatile acidity are the most important factors in determining wine quality. We presented this information in a variable importance table.

Random Forest Models

We developed random forest models to predict wine quality levels. The model includes 100 decision trees, and we have used it to make predictions for the test dataset.

undefined

The above plot indicates the top 5 most important variables for predicting wine quality levels in the random forest model. These variables are ranked based on the mean decrease in accuracy when each variable is excluded from the model. The higher the decrease in accuracy, the more important the variable is considered to be in the model.

The plot shows that the top 5 most important variables for predicting white wine quality levels are alcohol, density, chloride, residual.sugar, and total.sulfur.dioxide. On the other hand, the top 5 most important variables for predicting red wine quality levels are alcohol, sulphates, volatile.acidity, total sulfur.dioxide and density. These variables have the greatest impact on the accuracy of the model, and it is recommended to prioritize them for further analysis or model improvement efforts.

Logistic Regression Models

The logistic regression model was also fit to the dataset using all the available predictor variables. The model identifies that fixed acidity, volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol are significant predictors of wine quality. However, citric acid and total sulfur dioxide were found to be non-significant. The p-values of the model coefficients suggest that most of the predictor variables are significant in predicting wine quality, except for citric acid and total sulfur dioxide.

undefinedThe logistic regression model's performance in predicting wine quality levels is indicated by the AUC (Area Under the Curve) value of 0.83. A higher AUC value suggests better model performance, and the range of AUC values is from 0 to 1. An AUC of 0.5 indicates a model that is no better than random guessing, while an AUC of 1 indicates perfect prediction. Both white wine and red wine have a value of 0.83 for AUC, indicates a good level of predictive power, with the model ranking a randomly chosen good wine higher than a randomly chosen bad wine 83% of the time. These findings suggest that the logistic regression model is effective in differentiating between good and bad wines.

Models Results and Analysis

undefinedundefined

When evaluating a model's performance, it is crucial to consider the specific context, the required level of accuracy, and the trade-offs between various performance metrics. In the case of predicting wine quality, high accuracy, precision, recall, and F1 score are desirable. Additionally, a model with an AUC value closer to 1 would be preferred, as it indicates better performance in differentiating between classes.

It is important to strike a balance between precision and recall to avoid favoring one metric over the other. For our wine quality prediction project, we might consider a model with an Accuracy score above 0.7 and an AUC value greater than 0.8 as acceptable for our purposes.

However, in a skewed dataset, the F1 score might be lower due to the imbalance in class distribution. In such cases, accuracy might appear higher because the model is better at predicting the majority class, while the precision and recall values for the minority class suffer. This results in a lower F1 score.

In the context of our wine quality prediction project, the dataset is skewed toward bad wines, which means that the model may struggle to accurately predict good wines. As a result, the F1 score might be lower than desired. In such cases, it could be more appropriate to consider the accuracy of the model in addition to other metrics like precision, recall, and F1 score to better understand the overall performance of the model.

Conclusions

After carefully analyzing the wine dataset, our team has gained valuable insights into the factors that contribute to high-quality wines. By developing and evaluating multiple models for predicting wine quality levels, we have determined that the Random Forest model delivers the best performance in terms of accuracy, precision, recall, and F1 scores.

For red wines, the most critical factors affecting quality are alcohol, sulphates, volatile acidity, total sulfur dioxide, and density, while for white wines, the key variables influencing quality are alcohol, density, chloride, residual sugar, and total sulfur dioxide. By focusing on these variables during the production process, winemakers can increase the likelihood of creating top-quality white wines.

Recommendations

Our team has thoroughly examined the wine dataset and evaluated multiple models to predict wine quality levels. Based on our findings, we would like to provide the following recommendations for the stakeholders:

  1. Adopt the Random Forest model for predicting wine quality levels, as it consistently outperforms the Decision Tree and Logistic Regression models in terms of accuracy, precision, recall, and F1 scores. Additionally, the Random Forest model is less susceptible to overfitting, ensuring better generalization for future predictions.

  2. Pay close attention to the critical variables identified by the Random Forest model when producing both red and white wines.

  3. Continue to use the Synthetic Minority Oversampling Technique (SMOTE) to maintain a balanced class distribution in the dataset, which will contribute to improved model performance, reduced overfitting, and a better understanding of underlying patterns and relationships in the data.

Reference

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016

Debjeetdas. (2023). Red Wine Quality - EDA and classification. Kaggle. Retrieved April 23, 2023, from https://www.kaggle.com/code/debjeetdas/red-wine-quality-eda-and-classification/notebook

Ggplot2(n.d.). Create elegant data visualizations using the grammar of graphics. Retrieved April 23, 2023, from https://ggplot2.tidyverse.org/

**

This project is the result of a collaborative effort by a dedicated team of data analysts. I had the privilege of working alongside Ritulkumar Patel, Xiaoting Zheng, and Pengxiang Zhang. Each member brought unique skills and perspectives to the table, contributing significantly to the success of the project. Their insights, expertise, and commitment were invaluable, and this project would not have been possible without their collaborative efforts.

Discussion and feedback(0 comments)
2000 characters remaining