This article is going to cover how to use python to train your first machine learning model! Check out EXLskills.com for more training like this one!
- Know that python exists as a coding language
- Basic knowledge of correlations between variables
- Wanting to learn more about machine learning!
Scope of the Article:
- Show how to identify correlations between variables
- Manipulate a Dataset so that you can use Linear Regression to predict an outcome
- Python code to implement on a Google Colab Notebook!
Problem Statement: Predict MPG
To begin, we must first understand the problem that we are trying to solve. In this article, I will show how to take a dataset of a bunch of cars, and use Linear Regression to predict the MPG of a different model. We will cover the basics of the modeling in this article, and then I will provide you with the code to create it in a Google Colab Notebook.
For this tutorial, I will be using widely available data from here. In addition, I will be uploading this to a Google Colab notebook that can be foundhere. Feel free to copy this notebook into your own Google account to play with the code!
- Link to data: https://www.kaggle.com/uciml/autompg-dataset
- Link to notebook: https://colab.research.google.com/drive/1P9DJIDsaTosUkdSNDHCLb-aHLX4LMk56
We are going to start by loading our data in using the library pandas. For a more in-depth intro to pandas, I have written another article that you can check out! We will load the csv in and then plot a scatter matrix to better understand which variables have a high correlation with mpg.
Understanding the Dataset
The first thing we want to understand are what the features of the data set are. From some basic Exploratory Data Analysis (EDA) in the Google Colab Notebook, the features are as follows:
[‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’, ‘model year’, ‘origin’, ‘car name’]
From just looking at the column names, I can see some variables that may be associated with mpg, but just to be sure,
Here we can see that displacement, cylinders, and weight are negatively correlated with mpg, while acceleration and model yearseem to be positively correlated with mpg.
If I where to go more in-depth with this analysis, I would check to see how strong those correlations are, and if any of those variables are highly correlated with each other.
Test Train Split
Rather than looking at the correlations in depth, I will explore a concept called test — train — split. To get into this, I first want to explain a very important concept: Overfitting vs Underfitting
When you are creating a machine learning model, you generally will use a certain amount of your data to find correlations, and a certain amount to see how well you can make predictions. If your model uses the training data and finds correlations that may not exist, this would be called Overfitting, while on the other hand, training a model that doesn’t find enough correlations is called Underfitting.
A great reference to Overfitting and Underfitting can be found here!
Because of this, we want to split out data and make sure that we find the happy medium in between the two extremes. This leads us to taking the data and splitting it into a test set and training set. Sklearn is a library that has a built in splitting mechanism for us. So in the notebook I am going to split our data in X_train, y_train, X_test, y_test and I will compare my predictions to the actual values for y_test.
# Make target (y) equal to mpg
y = df.pop('mpg')
# Make x a large matrix containing displacement, cylinders, weight, acceleration and model year
X = df[['displacement','cylinders','weight','acceleration','model year']]
# Import the nessecary Library from Sklearn
from sklearn.model_selection import train_test_split
#Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Training The Model
Now that I have done this, I am going to use another library from sklearn to train a linear regression model to the data.
The official definition of Linear regression is:
Linear regression is a linear model, e.g. a model that assumes a linearrelationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).
Obviously it is not always this basic as it is here with the 2D example, but we will kinda jump over that. Check out EXLskills.com for more information on these topics.
In the notebook, you can see this process, but once I have trained the model I can look at the coefficients for each of variables! You can explore the P-values for all of these independently but for now here are the coefficients!
- (‘displacement’, -0.00547947729681713)
- (‘cylinders’, 0.2255864614674237)
- (‘weight’, -0.006472473220529684)
- (‘acceleration’, 0.07036516004706336)
- (‘model year’, 0.7878060025258705)]
Interpreting the Results
Let’s now take a look at how well we did at predicting the mpg! To do this, we are going to look at how far off we where from each data point in the test set and then find out how far of we were on average. below is a graphic that depicts this.
I will use a formula to calculate this and it is called the Mean Square Error.
This may seem very complicated, but really we are just looking at how far away we where from the actual value for each data point. In code, the way I would describe this is:
Sqrt( Mean( ( Predictions — Actual )^2 ) )
By taking the square root, our error value is in mpg units and we can see how the model performed!
It looks like with our model we were only 3 mpg off from the actual values on average! This is pretty good!
Finally, in the Google colab notebook, at the bottom, I built a function that allows you to enter your own values for each of the variables, and it will predict the mpg for you! Check it out
So as you can see, we did pretty well this model. And in this article, we covered a lot of topics very quickly. Many of them you could spend weeks on in particular. Let me know in the comments section what you would like more detail on, or if the Google Colab Notebooks where useful.
As always, check out EXLskills.com for more tutorials, and follow on twitter @ElliottSaslow