Home / My Works

Mar 21, 2025

CO2 Emissions ML Prediction for Canadian Vehicles

Application url : https://mlpredicationvehicles.streamlit.app/

This project explores the use of machine learning to predict CO2 emissions for vehicles in Canada. Using the CO2 Emissions_Canada.csv dataset, a Linear Regression model was developed to analyze the relationship between vehicle features (e.g., engine size, fuel consumption, and vehicle class) and CO2 emissions. Key steps included data preprocessing, one-hot encoding for categorical variables, and model evaluation using metrics like R², RMSE, and MAE. The model achieved a test R² of 0.9910, demonstrating high accuracy. This work highlights the potential of ML in environmental analysis and provides insights for reducing vehicle emissions.

1. Project Overview

Project Title: CO2 Emissions Prediction for Canadian Vehicles.

Objective: To predict CO2 emissions of vehicles in Canada based on features like engine size, fuel consumption, and vehicle class.

Dataset: CO2 Emissions_Canada.csv, which contains CO2 emission data for vehicles in Canada.

Target Variable: CO2 Emissions.

Model Type: Linear Regression.

Key Results:

Training R²: 0.9963, Test R²: 0.9910.

Training RMSE: 3.5694, Test RMSE: 5.6136.

2. Data Collection and Preprocessing

Data Sources The dataset CO2 Emissions_Canada.csv was used, which contains vehicle CO2 emission data for Canada.

Data Description The dataset includes the following features:

Categorical Variables:

Make: Manufacturer of the vehicle.

Model: Model of the vehicle.

Vehicle Class: Class of the vehicle (e.g., SUV, compact).

Transmission: Type of transmission (e.g., automatic, manual).

Fuel Type: Type of fuel used (e.g., gasoline, diesel).

Numerical Variables:

EngineSize: Size of the engine (in liters).

Cylinders: Number of cylinders.

Fuel_Consumption_comb: Combined fuel consumption (L/100 km).

Target Variable:

CO2 Emissions: CO2 emissions (in grams per kilometer).

Data Cleaning No missing values were found in the dataset.

No duplicates were present.

The target variable (CO2 Emissions) followed a Gaussian distribution, so no data transformation (e.g., log transformation) was applied.

Feature Engineering Categorical Variables:

One-hot encoding was applied to the categorical variables: Make, Model, Vehicle Class, Transmission, and Fuel Type.

Numerical Variables:

No standardization or normalization was applied to the numerical variables (EngineSize, Cylinders, Fuel_Consumption_comb) because there were no significant differences in their scales.

Train-Test Split The dataset was split into:

Training Set: 80% of the data.

Test Set: 20% of the data.

A random seed (random_state=42) was used for reproducibility.

3. Model Development

Model Selection A Linear Regression model was chosen for this project due to its simplicity and interpretability.

Libraries Used Scikit-learn: For model training, evaluation, and one-hot encoding.

Statsmodels: For additional statistical insights and diagnostics.

Training Process The model was trained on the training set (80% of the data).

One-hot encoding was applied to categorical variables before training.

Evaluation Metrics The following metrics were used to evaluate the model:

Mean Squared Error (MSE).

Root Mean Squared Error (RMSE).

Mean Absolute Error (MAE).

R-squared (R²).

Results Training Metrics:

MSE: 12.7407.

RMSE: 3.5694.

MAE: 2.0185.

R²: 0.9963.

Test Metrics:

MSE: 31.5125.

RMSE: 5.6136.

MAE: 3.1762.

R²: 0.9910.

4. Results and Analysis

Performance Analysis The model performed exceptionally well on both the training and test datasets, as indicated by the high R² values (0.9963 for training and 0.9910 for testing).

The slight increase in error metrics (MSE, RMSE, MAE) from training to testing suggests minimal overfitting.

Key Insights The target variable (CO2 Emissions) followed a Gaussian distribution, so no transformation was needed.

One-hot encoding was effective in handling categorical variables.

The numerical variables (EngineSize, Cylinders, Fuel_Consumption_comb) did not require standardization, as their scales were already compatible with the linear regression model.

5. Challenges and Limitations

Challenges The dataset contained a large number of unique values for categorical variables like Make and Model, which increased the dimensionality after one-hot encoding.

The model assumes a linear relationship between features and the target variable, which may not capture more complex patterns.

Colab Free Version Limitations: The free version of Google Colab has limited computational resources, which restricted the ability to perform advanced techniques like hyperparameter tuning, VIF, or BIC.

Limitations

The dataset is specific to Canada, so the model may not perform well on data from other regions.

Future Improvements Experiment with more advanced models (e.g., Random Forest, Gradient Boosting) to capture non-linear relationships.

Perform hyperparameter tuning using techniques like GridSearchCV or RandomizedSearchCV.( Still in learning stage)

Calculate VIF to check for multicollinearity and BIC for model selection.

Include additional features like vehicle weight or driving conditions to improve accuracy.

Perform cross-validation to ensure robustness of the model.

6. Code and Reproducibility

Dependencies The following Python libraries were used:

pandas numpy scikit-learn statsmodels matplotlib seaborn Notebooks The project was developed in a Jupyter notebook, which includes:

Data loading and preprocessing.

Model training and evaluation.

Visualizations and analysis.

7. Conclusion

Achievements Successfully built a linear regression model to predict CO2 emissions for Canadian vehicles.

Achieved high accuracy with a test R² of 0.9910.

Impact This model can be used by policymakers and manufacturers to estimate and reduce CO2 emissions from vehicles.

Next Steps Deploy the model as a web application for real-time predictions.

Expand the dataset to include vehicles from other regions for better generalization.

8. Appendix

References Dataset: CO2 Emissions_Canada.csv.

Scikit-learn documentation: https://scikit-learn.org.

Statsmodels documentation: https://www.statsmodels.org.