__STYLES__

AquaAudit: Predictive Analysis of Coastal Plastic Waste

Tools used in this project
AquaAudit: Predictive Analysis of Coastal Plastic Waste

About this project

In the project "AquaAudit: Predictive Analysis of Coastal Plastic Waste", I have combined the powers of R, Python, and Tableau to dissect the pressing issue of multilayer plastic waste pollution in the coastal regions of the U.S.

The data used for this study was provided by the 5 Gyres Institute, an organization focused on reducing ocean pollution. The journey begins with meticulous data cleaning and preparation using R, showcasing adeptness in data manipulation, quality control, and transformation. We then utilize Python to build and refine a predictive model, leveraging the Random Forest algorithm and applying the Synthetic Minority Over-sampling Technique (SMOTE) for better accuracy. This evidences proficiency in machine learning, predictive analytics, and imbalanced data handling. Data visualizations in Tableau effectively communicate our findings, reflecting skills in creating intuitive, impactful dashboards and reports.

The insights gained through this comprehensive data science workflow guide a spectrum of recommendations. These span from policy adaptations and technological advancements to changes in business practices and consumer behavior, demonstrating strategic thinking and problem-solving abilities.

The project not only unearths the multifaceted problem of plastic pollution but also demonstrates a diverse skill set, a comprehensive approach, and a commitment to environmentally sustainable solutions.

ABSTRACT

Leveraging a data-centric methodology, the objective of this research is to delve into a more profound understanding of the multilayer plastic waste pollution problem in coastal regions of the United States, with a view to devising effective strategies for its reduction. The research uses data provided by the 5 Gyres Institute; an organization dedicated to combating pollution in the world's oceans. To gain an understanding of the scale and nature of the problem, an exploratory data analysis is conducted, highlighting the most significant contributors to plastic pollution and the roles they play.

In the Predictive Analytics section, we implemented a Random Forest model to predict layer types of waste materials. The model achieved an overall accuracy of 0.81. However, it struggled with 'multi-layer' items, having a recall of 0.51. After employing the Synthetic Minority Over-sampling Technique (SMOTE), the recall for 'multi-layer' items improved to 0.65, despite a slight decrease in overall accuracy to 0.80. Feature importance analysis indicated 'year' and 'product type' as the most influential features for predicting layer types.

Our findings underline the significance of the multilayer plastic waste issue and stress the urgent need for mitigation strategies. We advocate for concerted efforts in policy changes, technological advancements, business practice reforms, and consumer behavior modifications to tackle this environmental challenge effectively and sustainably.

EXPLORATORY DATA ANALYSIS

Our analysis of the data set revealed some interesting findings. By tallying up the unique event ids, we learned that 48 distinct organizations led a total of 93 coastal community cleanup events across the United States. But to really delve into the data and glean more substantial insights, we decided to segment the data according to certain variables, allowing for a more comprehensive exploration of the dataset.

The analysis reveals that a predominant share of 21,821 instances falls under the 'unbranded' category. This suggests a considerable segment of plastic pollution originates from products without identifiable brand or parent company affiliations. Upon sifting the original dataset to include only entries where the 'parent_company_name' is 'unbranded', we engaged in an illustrative word cloud analysis. This analysis considered both the 'item_description' and 'total_counts' columns to bring clarity to the composition of this 'unbranded' category.

undefined

Figure 1: Word cloud of item descriptions for unbranded parent companies

The outcome highlighted 'cigarette butts' and 'bottle caps' as the terms that recurred with the highest frequency in the item description where the parent company's name remained unidentified.

After 'unbranded', we see a significant drop in the count to the next highest contributors, which are 'Pepsico' and 'Nestlé', with 926 and 801 instances, respectively. However, it's important to note that the counts do not necessarily reflect the total amount of plastic pollution these companies are responsible for. The counts represent instances of pollution, not volume, and different products may contribute different amounts of plastic waste. For example, a single instance of a small plastic wrapper would be counted the same as a larger plastic bottle, even though the latter contributes more to total plastic volume.

undefinedFigure 2: Community Clean-up Events Dashboard

Milwaukee has the most community clean-up events, with 10. This could be due to a strong environmental awareness in the community, or a significant local pollution problem that has mobilized citizens. Los Angeles, a much larger city, comes in second with 8 events. This suggests a lower event-per-capita ratio than Milwaukee.

California (CA) leads by a significant margin with 47 cleanup events. This is not surprising, given that three of the top participating cities (Los Angeles, San Francisco, Long Beach) are in California. California's known focus on environmental issues, as well as its large population, may also be contributing factors.

The majority of these clean-up events occur in the third quarter, coinciding with the summer months when coastal visitation peaks.

It is important to note that raw counts of cleanup events may not provide a complete picture of community engagement. Factors like population size, geographic size, and the scale and impact of each cleanup event can all affect the interpretation of this data.

undefinedFigure 3: Bubble Chart Showing Most Frequent Word in Item Descriptions

The analysis of Figure 3 brings to light an alarming issue of pollution in our oceans. The terms "bottle caps" and "cigarette butts" emerged as the most frequently mentioned items in the descriptions of trash found during coastal community clean-up initiatives. This suggests a prevailing problem with single-use plastic and non-biodegradable items polluting our water bodies.

Bottle caps, generally composed of plastic materials, are non-biodegradable and often discarded carelessly after use. Similarly, cigarette butts, which are often misconstrued as harmless and biodegradable, actually contain plastic filters and harmful toxins. When these items make their way to our oceans, they contribute significantly to the plastic pollution problem.

The frequent occurrence of these items in clean-up records indicates that despite recycling and waste management efforts, a considerable amount of these items still end up in our oceans. This highlights the need for more stringent controls and better waste management practices to prevent these materials from entering our water bodies.

undefinedFigure 4: Dashboard of Plastic Waste Distribution

Figure 4 shifts focus to plastic waste sources, revealing that a significant portion is attributed to multi-layer plastics - often used in food and beverage packaging and notoriously difficult to recycle. The data show that this type of waste mainly falls within the "other" category of materials, indicating potential disposal issues.

Interestingly, the highest incidence of multi-layer plastic waste collection occurred in the third quarter of 2022, possibly reflecting the higher number of clean-up events during this period. At the city level, Milwaukee registered the highest volume of multi-layer plastic waste.

In the comparison of California and Wisconsin, California recorded a higher volume of multi-layer plastic waste, while Wisconsin had a greater percentage of this type of waste in its overall waste stream. This might imply California's more efficient collection of multi-layer plastic waste or a larger presence of this waste type in Wisconsin's total waste.

These dashboards offer essential insights into plastic pollution sources and their distribution across North America, underscoring the need for better waste management strategies and advancements in multi-layer plastic recycling technologies.

PREDICTIVE ANALYSIS

undefinedFigure 5: Correlation plot of layer with selected variables

Figure 5 shows the correlation of various variables with the 'layer' variable in the dataset, as calculated by Cramer's V, which is a measure of association between two nominal variables.

Year - Layer Correlation (0.564264): This is the strongest correlation in the dataset, suggesting a significant association between the year and the layer. This could potentially mean that the layer of material used in products changes over the years, possibly due to changes in manufacturing technology, material availability, regulations, or consumer preferences.

Type_product - Layer Correlation (0.429576): This indicates a moderate correlation, suggesting that the type of product is somewhat associated with the layer. This might mean that certain types of products tend to use certain types of layers more than others.

Type_material - Layer Correlation (0.265813): This indicates a weak correlation, suggesting that the type of material is less strongly associated with the layer. This could mean that a variety of materials are used across different layers, and the choice of material does not necessarily determine the layer.

City - Layer Correlation (0.558221): This shows a strong correlation, close to the correlation between year and layer. It suggests that the layer might vary by city, possibly due to local manufacturing practices, regulations, or consumer preferences.

Province - Layer Correlation (0.390463): This shows a moderate correlation, indicating that there is some association between the province and the layer. Like with the city-layer correlation, this could be due to regional differences in manufacturing, regulations, or consumer preferences.

Random Forest Classifier

We utilized the Random Forest Classifier with the above selected variables, given its notable capabilities in managing high-dimensional data. Random Forest uses a collection of decision trees, each trained on different subsets of the data, and integrates their predictions. This method mitigates the risk of overfitting, a common issue with individual decision trees, thereby enhancing the model's broad applicability.

The dataset was partitioned into a training set, which included 80% of the data, and a test set, comprising the remaining 20%. This split is standard practice as it strikes a balance, ensuring that the model has ample data to learn from while retaining a significant subset for performance evaluation.

Upon training the Random Forest Classifier with the training data, we then utilized the model to predict outcomes on the test data. To assess the model's performance, we examined critical metrics such as precision, recall, and the f1-score for each category, providing a comprehensive overview of the model's performance. With an overall accuracy of 0.81, the model demonstrates a robust ability to predict the layer type based on the given features.

undefinedFigure 6: Bar chart of the top 10 influential features and corresponding model performance metrics in a random forest classifier for predicting layer type

From Figure 6, we can see that the model does a relatively good job of classifying 'unsure' and 'single-layer' items (particularly in terms of recall for 'single-layer' and both precision and recall for 'unsure'), but it struggles more with 'multi-layer' items (with a recall of 0.51). This indicates that the model has more difficulties distinguishing 'multi-layer' items from the other classes.

The performance of the model for predicting the 'multi-layer' class is relatively low compared to the other classes. The model has a precision of 0.81 for 'multi-layer', which means that when it predicts an item to be 'multi-layer', it is correct 81% of the time. However, the recall is only 0.51, meaning that it only correctly identifies 51% of all true 'multi-layer' instances.

Improve the Random Forest Classifier Model

To enhance the prediction accuracy of multi-layer plastic waste by our model, we introduced the SMOTE technique after segmenting our data. This led to the creation of a balanced training data set, providing an equal quantity of instances for all classes. Subsequently, we trained our Random Forest Classifier on this equilibrated dataset.

Although this method could potentially enhance the model's ability to predict the 'multi-layer' class, which was initially underrepresented, it could come at the expense of reduced performance in predicting the other classes.

undefinedFigure 7: Bar chart of the top 10 influential features and corresponding model performance metrics in the SMOTE random forest classifier for predicting layer type

When comparing this with the prior model, we observe an improved ability in predicting the 'multi-layer' class. The recall for 'multi-layer' saw an increase from 0.51 to 0.65, signaling that a larger portion of 'multi-layer' instances within the data is now correctly identified by the model. This significant enhancement implies that the model is now less likely to overlook 'multi-layer' instances.

Contrarily, the precision for 'multi-layer' witnessed a slight decrease, moving from 0.81 to 0.69. Precision represents the ratio of true positives (correct 'multi-layer' predictions) to all positive predictions. This drop in precision suggests that the model is committing more false positive errors, predicting non 'multi-layer' instances as 'multi-layer'.

The F1-Score for 'multi-layer', which provides a balance between precision and recall, also saw improvement, moving from 0.62 to 0.67. This indicates a better overall performance in predicting this class.

It's crucial to understand that while the model's overall accuracy experienced a slight decrease, moving from 0.81 to 0.80, this is not necessarily an unfavorable outcome. This could imply that the model's predictions have become more balanced across different classes, showing particular improvement in predicting the 'multi-layer' class.

Feature Importance of the model:

Upon employing OneHotEncoder on categorical features, each category within these features is converted into an individual binary feature. Each binary feature is regarded independently by the model, implying that the importance of each binary feature symbolizes the unique contribution of that particular category to the model's predictions. Herein is a bar chart delineating the top 10 feature importance metrics from the refined model.

Figures 6 and 7 reveal the relative significance of the features included in the model. It primarily identifies 'year' and 'product type' as the most significant predictors for determining the layer type, with specific years and product types demonstrating a remarkably high level of influence.

It's crucial to note a shift in the significance between number 9 and number 10 in the comparison of the original model to the enhanced model. Here, Low-Density Polyethylene holds a higher level of importance than Polyethylene Terephthalate within the type of material category.

**

Co-author: Vartika Joshi, who contributed significantly to the data analysis and interpretation.

Additional project images

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.