__STYLES__
This project aims to build a linear regression model to predict a vehicle's fuel efficiency, measured in miles per gallon (mpg), based on various vehicle attributes. The prediction model will be developed using Python, leveraging libraries such as pandas, scikit-learn (sklearn), and statsmodels. The dataset will be sourced from a CSV file containing relevant vehicle data.
The data for this project will be sourced from a CSV file containing vehicle attributes (cylinders, displacement, horsepower, weight, acceleration, model year, origin and car name)
Load the CSV file using pandas.
Inspect and clean the data to handle missing values and correct inconsistent data formats.
Generate summary statistics and visualize the data distribution.
Examine correlations between features and the target variable (mpg).
Select relevant features based on EDA insights.
Encode categorical variables when necessary.
Engineer new features if they can improve the model's predictive power.
Split the data into training and testing sets.
Train a linear regression model using scikit-learn and statsmodels.
Perform a detailed statistical analysis of the model
Evaluate the model's performance using metrics such as Mean Squared Error (MSE) and R-squared on the training dataset.
Validate the model with cross-validation techniques.
Check model assumptions (Linearity, Independence, Normality, No multicollinearity, Equal variance).
Score the model's performance on the test dataset.
Fit, evaluate, and score a Ridge Regression model as an alternative approach.
Choose the final model based on performance metrics and assumptions checks.