Full Project Description: LEGO Python Project
Project Description: The project aims to extract valuable insights from the LEGO dataset, exploring LEGO products, themes, prices and correlations between features.
Tools and Techniques:
Data Acquisition: This dataset was downloaded from the Maven Analytics website and contains information about all LEGO sets and other products, LEGO Theme Groups, Themes, Categories, product characteristics like Recommended Age, Release Year, number of Pieces, Minifigures, prices and other.
Data Preparation: The dataset was carefully prepared to address data issues. Unused columns were removed, and missing values were either replaced or dropped to ensure data integrity throughout the analysis.
Summary Statistics: Python methods were employed to derive summary statistics.
Univariate Analysis: This stage involved examining individual variables, including 'Price, USD', 'Pieces', 'Age', and 'Minifigures'. The analysis uncovered notable outliers, prompting the generation of boxplots for 'Pieces' and 'Price, USD' to visualize the data both with and without these outliers. Additionally, histograms were constructed for 'Minifigures' and 'Age' to further explore their distributions.
Categorical Analysis: The analysis delved into categorical variables including 'Year', 'Name', 'Theme Group', 'Theme', 'Subtheme', and 'Category'. A word cloud was generated to showcase the most frequently occurring product names. Additionally, bar charts illustrated the top ten LEGO themes and theme groups by count, and a pie chart was constructed to display the distribution of categories.
Correlation Analysis: Correlation Analysis: Utilizing the corr() method, a correlation heatmap was generated to visually represent the relationships between variables. Notably, a remarkably strong positive correlation (coefficient: 0.87) was identified between 'Price, USD' and 'Pieces'.
However, despite a correlation coefficient of 0.64, the regression line for Pieces and Age lacks steep inclination. Age's discrete nature explains why data points tend to cluster around specific age values. Substituting Pieces with the average contributed to a clearer depiction of the linear relationship between Pieces and Age.
A similar scenario was observed for Price, USD and Age, with a correlation coefficient of 0.59. Substituting the average for Price further enhanced the clarity of the linear relationship between Price and Age.
Moderate correlation observed between the number of Minifigs and the Price, USD, with a coefficient of 0.54, indicating a moderate positive relationship between these variables.
Comparative Analysis: Box plots were created to examine the distributions of 'Price, USD' , ‘Pieces’ and 'Minifigs' across various theme groups. Specifically, the plots compared 'Price, USD' vs 'Theme Group', 'Pieces' vs 'Theme Group', and 'Minifigs' vs 'Theme Group' to reveal insights into how these variables vary within different theme groups.
Data Inquiries: Additional questions regarding the data were explored and visualized, including queries such as: "What are the top 10 themes by years of presence?" "What are the years of presence for the LEGOLAND theme?" and "Which themes have the most minifigures?".
In conclusion, the analysis of the LEGO dataset has uncovered valuable insights into various aspects of LEGO products, themes, and correlations between features. Strong correlations were observed between 'Price, USD' and 'Pieces', along with moderate correlations between 'Age' and other variables. Additionally, the identification of outliers and the examination of distributions across different theme groups provided further understanding of the data. These findings can inform strategic decision-making and future research efforts within the realm of LEGO products.