Today, let’s go ahead and take a look at some Unsupervised Machine Learning! What is Unsupervised Machine learning?
Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. Alternatives include supervised learning and reinforcement learning.
For us, we are going to take a very basic example, and try to understand how this works! We are going to be working with a built-in dataset from
- Sepal length in cm
- Sepal width in cm
- Petal length in cm
- Petal width in cm
With the three categories of flowers being:
- Iris Setosa
- Iris Versicolour
- Iris Virginica
This dataset is describing flowers if you haven’t already figured that out! So today, I am going to visually show you with this dataset how Unsupervised Machine Learning works!
We are going to be using a clustering algorithm:
Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm.
In particular we are going to use K-Means
Kmeans is one of the most popular “clustering” algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.
K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
I am going to showcase how we go through this algorithm with a K = 5. Here is the first step of our iteration through the Kmeans algorithm:
Now that we have these clusters, we need to do a couple of things:
- Assign closest each cluster to its closest points
- Update the centers by averaging the location of all the clusters.
Notice how I am plotting 2 of the dimensions against each other, although there are 4, this gives a good visual understanding.
Below is the pseudocode algorithm for updating each center.
Assigning the Initial Clusters to their closest points
Now that we understand how our algorithm finds the closest points and assigns them to clusters, lets take a look at our plot after the first iteration:
After we assign the points want to continue to update each cluster until we converge on a solution. I skipped a couple of iterations, but as you can see, the points converge very quickly!
And below, I provide some pseudo code for updating the centers!
This was a very visual example of the algorithm, and below I provide the full algorithm for everyone
Full Algorithm Definition
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assignment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid’s cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
Questions?! Let me know in the comments! As always, follow on twitter for more great content