python machine learning price prediction project

The full project can be seen on the github page by clicking the button below:

My work behind this business intelligence data science project was to look for market trends, understand consumer behaviour more, converting metrics to be able to compare apples to apples with a dataset that had a multitude of different conventions for metrics, and more.

Data science projects are crucial in this process as they provide valuable insights by analyzing large amounts of data, identifying patterns and trends, and making predictions based on that data. These insights can help companies make informed decisions and stay ahead of their competition.

These kinds of data science projects can inform companies how to market their products, what challenges or inconsistencies they may face, and more.

summary of project

The “Used Cars” project is a data science project by May Cooper that uses machine learning techniques to predict the sale price of used cars in India. The project’s objective is to provide statistical information on the average sale and used car market in India to help people make informed buying choices and get a good deal on a used vehicle.

To achieve this objective, the project required the use of complex data conversions to handle different metrics and machine learning techniques.

The project’s objective is to provide statistical information on the average sale and used car market in India to help people make informed buying choices and get a good deal on a used vehicle.

The project aims to minimize the difference between the real price and the price estimated by the model, and it evaluates model performance using metrics such as Mean Squared Error (MSE), Root Squared of the Mean Squared Error (RMSE), and R-squared (R2).

Samples from the project

Exploratory Data analysis (EDL)

Target Variable

Log on Target Variable

Taking the log of a target variable is a common technique used in data preprocessing to transform the target variable and make it more normally distributed. This is important in many machine learning algorithms because many algorithms assume that the target variable is normally distributed. Transforming the target variable to be more normally distributed can improve the performance of the model.

In addition, taking the log of a target variable can also be useful when the target variable is highly skewed, meaning that it has a few large values and many small values. Taking the log of a skewed variable can make the variable more symmetric and more closely approximate a normal distribution.

In this specific code snippet, the logarithmic function is applied to the ‘selling_price’ column of the dataframe, then a histogram of the log-transformed variable is plotted to visualize the distribution of the data. The transformed variable has a more normal distribution than the original variable, which can be seen in the histogram.

Feature Engineering

Feature engineering is the process of transforming raw data into features that can be used to train a machine learning model. It is an essential step in the machine learning process as it can greatly impact the performance of a model.

There are several reasons why feature engineering is important:

  1. Improving model performance: By creating new features or transforming existing features, the model can be given more information to work with, which can lead to better performance.

  2. Handling missing data: Feature engineering can be used to fill in missing data or create new features from the existing data, which can help the model to work with incomplete data sets.

  3. Handling categorical data: Many machine learning algorithms are not able to work with categorical data directly. Feature engineering can be used to convert categorical data into numerical data, which can be used by the model.

  4. Reducing dimensionality: Feature engineering can be used to reduce the number of features in a data set, which can help to improve the model’s performance by reducing overfitting and computational costs.

  5. Identifying patterns: Feature engineering can help to identify patterns in the data that may not be obvious, which can be used to create new features and improve the model’s performance.

In summary, feature engineering is an important step in the machine learning process that can be used to improve the performance of a model, handle missing data and categorical data, reduce dimensionality, and identify patterns in the data.

More on the full project   

Machine Learning Module

Creating ML Model

This code is creating a list of machine learning models (regressors) to be tested on a dataset. The models included in the list are Random Forest Regressor, Linear Regression, Support Vector Machine (SVR), Decision Tree Regressor, XGBoost Regressor, and Gradient Boosting Regressor.

It then creates two dataframes, one for storing the predictions made by each model and another for storing the evaluation metrics for each model. The script loops through each of the models in the list, fits the model to the training data, makes predictions on the test data, and calculates evaluation metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-Squared, and cross-validated max error.

The evaluation metrics are then stored in the metrics dataframe, which is then cleaned up and displayed. The model with the highest cross-validated accuracy and F1 score is considered the best model.

Risidual Learning

In machine learning, residual modeling refers to a technique that involves modeling the difference, or residual, between the predicted value of a model and the actual value. The residual is the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) of the independent variable (x). Residual modeling is a way to analyze the performance of a predictive model by identifying patterns in the residuals that can be used to improve the model.

Residual analysis is a technique used to check if the residuals are random or if there is a pattern in the residuals. A pattern in the residuals indicates that the model is not able to explain the variation in the data. A random pattern of residuals indicates that the model is a good fit for the data.

Residual modeling is often used in regression analysis and is an important technique for evaluating the performance of a model. It allows data scientists to identify patterns in the residuals that indicate where the model is not performing well and make adjustments to the model to improve its performance.

Overall, residual modeling is a powerful technique that helps to identify the strength and weaknesses of a machine learning model, allowing data scientists to improve the model’s performance.

More on the full project   

contact

maycooperinc@gmail.com

437-219-2106

© 2023 All Rights Reserved.