Clustering Lockdown Data
Clustering using Hierarchical Clustering
Another type of clustering is hierarchical clustering, which builds hierarchies of clusters either bottom-up (agglomerative) or top-down (divisive). Using Hierarchical clustering, nodes are clustered together using various techniques that generally involve some distance calculation between sets of data points. Using hierarchical clustering is another nice way to judging the right number of clusters. Depending on the distance metric, a different number of clusters are created and the user can judge using a dendrogram, whether those clusters are distinct enough to cluster the data with.
When clustering the survey and lockdown data, several distance metrics were used to try to access the clustering capabilities of the dataset. Almost immediately off the bat, the cosine similarity metric was almost unhelpful, but the euclidean and manhattan distance metric was helpful at some capacity. No distance metric is perfect, but it is important to try a couple in case one metric has a profound effect on the quality of the data clusters.
Distance Calculations
Three different distance metrics were used to cluster the data including euclidean, manhattan, and cosine similarity. Each of these distance metrics can be visualized on its own, but using each of the distance metrics, the data can be clustered hierarchically and the results are visualized in the subsequent sections. Interestingly, the simpler methods tend to work better, but that doesn't necessarily mean that a more sophisticated distance metric that wasn't tested here wouldn't be better than the euclidean and/or manhattan distance calculations.
Euclidean Distance
The Euclidean distance between two data points, known as the Pythagoras distance, or in math classes, the distance formula is the square root of the sum of the squares of the differences between each coordinate of two points. In the cartesian plane, one would take the sum of the squares of the absolute differences between the x and y coordinates and take the square root of the resulting sum. In higher dimensions, rather than x and y, the same concept can be applied to every dimension.
Once the distance matrix is generated, the data can be clustered and visualized using a dendrogram. Below is the result of those calculations. One important note is the use of the sampled dendrogram in the middle tab. Considering the full dendrogram including the labels is exceptionally hard to read (in terms of the labels), a randomly sampled version is generated in the middle tab. It should show a similar pattern at a high level and the labels should be visible as well. Also, the distance matrix that is shown should show the correlation between certain points (in terms of distance), where the closer to 0, the closer the point.
From the visualizations, a couple of k values should be tried, including at least 2, 4, and 8, but other k values in between and beyond are certainly not out of the reach. In an ideal world, the value of 2 would be perfect since that could be an indicator of whether the lockdown affects food security. The more people who answer the survey indicating they are distressed about their food situation due to being in lockdown would be important to know. Also, considering the generated distance matrix, it is reasonable to see some points that are separate from the rest of the clusters, as indicated by the purple lines. Further analysis down the page will confirm that point.
Manhattan Distance
The Manhattan distance is simply the sum of the absolute differences between two rows or vectors. This is probably the simplest way to measure distance, and yet it can still be an effective way of clustering data. Similar to the visualizations above, the same types of visualizations were generated, except using manhattan distance rather than Euclidean distance.
From the visualizations, a couple of k values should be tried, including at least 2, 4, and 8, though 6 and 7 also seem like reasonable options given the dendrogram. These are very similar results to the results produced by the euclidean distance, and it makes sense since the Manhattan distance calculation is somewhat close to the calculation of Euclidean distance.
Cosine Similarity Distance
In cosine similarity, the measure of the cosine of the angle formed between two vectors is used as a device to measure similarity between two data vectors. To find the cosine of the angle, the inner/dot product of the vectors (normalized to 1) is taken. Usually, cosine similarity tends to work better with sparse matrices, where there are a large number of terms in the matrix set to 0. That happens to be the case for the text data and not for the record data, but that doesn't mean it won't work. Similar to the visualizations above, the same types of visualizations were generated, except using cosine similarity rather than Euclidean distance.
After plotting and visually inspecting the results of the cosine similarity dendrogram and distance matrix, it is clear that cosine similarity is not helpful at all. Considering the distance matrix, it is probably not going to cluster at all, which can be confirmed using some of the methods described below.
Determining the best value for K
To further confirm the values of k given the dendrograms, several more tests can be run to give more visual indicators. The silhouette method and the gap statistic method further confirm the values of k that should be used in k means clustering. When plotting the two methods, each distance metric is taken into account to even further validate exactly which k value should be used and which distance metric gives us the best results.
The Silhouette Method
The Silhouette method is another visual method that can be used to determine the right k value. The Silhouette method uses a metric called the Silhouette score to measure the effectiveness of each cluster. When plotting the silhouette score against each k value, the objective is to find the point on the curve where the average silhouette width drops significantly for the first time and choose that k value. For each distance metric (Euclidean, Manhattan, and cosine similarity), a plot of the silhouette method is shown below.
Using the Manhattan distance or the Euclidean distance, it seems fairly obvious that 2 clusters should be used. From the cosine similarity, it indicates that 1 cluster is the best, which means the data can't be clustered. Since the cosine distance clearly do not per the previous visualiztion, work in this case, this result makes sense.
The Gap Statistic Method
Antoher test to determine a good number of clusters is the gap statistics method. The gap statistic method measures the jumps in in-cluster distance when another centroid is added. So imagine a k value of 1, where there is only 1 cluster, but visually, it is clear there should be two. When adding another centroid, or increase the k value to 2, the gap between the clusters rapidly drops since a large number of points from the first cluster attach to the new centroid and the inter cluster distance become smaller. When looking at the gap statistic method, the idea is to look for the first large drop and choose the point right before the drop. A visualization of the gap statistic method for each distance metric is below.
Two things are immediately obvious when looking at these graphs. One, the distance metrix doesn't matter, and two, the gap statistic tells us that the data can't really effectively be clustered in more than 1 cluster, unless there is a drop much further along in the graph. Effectively, within 10 values of k, the only drop is after k = 1.
Clustering using K-Means
Using the combination of the previous methods, the following k values: 2, 4, and 8, seem like reasonable choices for K-Means clustering. Using K-Means clustering, each of the following values were used to cluster the data and visualize the data in 2 and 3 dimensions. Included in the analysis is a download of the statistics associated with each cluster of data.
After visualizing the data and viewing the statistics, it is clear that at least 8 clusters should be used for clustering the data. It is clear that if a vector is clustered into clusters 1, 2, 3, 5, 7, or 8, it is pretty clear the vector would indicate the vector would not be in a lockdown. However, for clusters 4 and 6, the vector could be in lockdown with a higher probability, but that does not mean that the vector is not in lockdown.
Clustering using Density Clustering
Density-based clustering is a very useful algorithm for clustering data that is densely packed together, but perhaps unable to be clustered using traditional techniques. The classic image of density clustering is similar to the one below, where there are two distinct clusters, but traditional clustering techniques are not able to distinguish the two clusters. Typically, a 2d visualization of a traditional cluster allows the user to draw separate nonoverlapping shapes around individual clusters, but that can't happen here.
In the case of clustering survey data, it wasn't immediately clear that density clustering would produce distinct clusters and would produce clusters of any significant meaning. Initially, to cluster the data based on density, it is important to determine the epsilon value that would produce the best number of clusters. To do that, a plot the k-nearest neighbor distances is created and the "knee" in the plot can be seen. In this case, the knee of the plot was 8, so using a convex hull plot, the clusters can be seen. Both visualizations are seen below.
According to the density plot, the suitable number of clusters should be 3. Visualizing that using interactive methods in 2 dimensions and 3 dimensions give a better sense of the clusters.
Using the density plot, there are 3 clusters rather than 4, since in density based clustering, an extra "cluster" are the points that don't belong to a cluster. From the generated statistics, after clustering using density based clustering, the usefulness of the clusters is pretty much not there. Even though the data clustered, the clusters don't provide any new meaning or any help with predicting the associated labels of the data.
Conclusion
Using K-Means clustering with either 2 or 8 clusters seemed to give the best results in terms of attempting to label the data and comparing those labels to the known labels. Using 2 clusters, the data in two dimensions is distinctly clustered. Using the statistics, in 8 clusters, a prediction can be made that with a good probability that if the data is clustered into cluster 3 or 7, it is almost certainly not going to have a label of "InLockdown". In fact, only when the vector is clustered into clusters 4 or 6 will the data have any real chance of be labeled "InLockdown". This means when clustering the survey/lockdown data, with some accuracy, predicting whether the new data will be not in lockdown is slightly easier than predicting whether the data will be in lockdown.