Clustering Search Data
To cluster text data property, a couple of important preprocessing steps must be taken to clean the data. This includes removing stop words, non-alpha characters, punctuation included, and lemmatizing (if needed). This process was completed in the Data Cleaning step in case you are interested in a full walkthrough of the process. Now, assuming the text data is cleaned, it can be visualized. In an ideal world, when clustering this text data, there should be 4 distinct categories: covid, drought, locusts, and ebola. This would mean that each cluster can be tied back to one category and that information is used to gain more insight into the different effects each natural disaster has on food security. Before diving into clustering, here is the current state of the data, visualizations included. To explore each category further, click on the tab.
As can be seen above, each wordcloud contains a reasonably distinct vocabulary, so it would not be unusual to try to cluster the data to see if the vocabulary can be used to cluster the data. To cluster the data, the text must first be vectorized. This means turning the text into a matrix of values, where each row is a document and each column is a count of the number of instances a word shows up. There are two popular methods to vectorize the data, one being a CountVectorizer, and one being a TFIDF Vectorizer. The count vectorizer simply counts the number of words in each document, while the TFIDF (Term Frequency - Inverse Document Frequency) Vectorizer focuses on the frequency of the words and the importance of the word in the document. Since TDIDF focuses on the importance as well as the frequency, it would make sense the TDIDF vectorizer would probably give better results.
For each vectorizer, the data was normalized, clustered using K-means, Hierarchical, and Density. For each vectorizer, there are corresponding tabs that should show the difference between the two vectorizers. Also, for each vectorizer, the data was clustered with non-normalized values, but the results are very poor/inconclusive. There will be a couple of visualizations of what the non-normalized data looked like, but those are limited in number.
Clustering using K-Means
When first clustering any dataset, most think about a version of clustering called K-Means. This involves attempting to cluster data into k-clusters using an algorithm called Llyod's algorithm, which calculates centroids, or centers of the data set, and using those centers, attempts to group the data around those centers. The number of centers, or centroids, is the k value in K-Means. In this case, the user/developer is defining the value of k and testing out whether that value of k makes sense when the data is clustered. To automate some of this process, the algorithm should be run within a range of k values. In the case of the text data, the k value is ideal 4. Also, as k approaches the number of data points in the set, it would make sense the clusters become smaller and more accurate, but that defeats the purpose of the classification. The range chosen was between 2-9, which is somewhat arbitrary, but when clustering the data, the hope is to find a good fit for a small value of k.
Determining the best value for K
Since the range is 2-9, the algorithm must be run for each one of those values. For each of the k trials, a numeric metric needs to be generated that accurately reflects the quality of the clustering. Once a metric is generated for each of the k trials, the metrics can be graphed against each of the k values for visualization purposes. Two popular methods are detailed below with accompanying visualizations showcasing the results of each method.
The Elbow Method
The elbow method is a simple method to visualize the efficiency of a k value. The elbow method uses one of two metrics called inertia and distortion respectively, each involving the sums of squares of euclidean distances between each point in a cluster to the centroid. When looking at the plot of the elbow method, the point at which the greatest bend in the curve occurs is the ideal k value. In the case of the text data, the elbow method did not produce a significant bend in the curve, but it could be argued that a value of k = 5 should be checked since there is a slight bend in the curve. This is confirmed using the silhouette method.
The Silhouette Method
The Silhouette method is another visual method that can be used to determine the right k value. The Silhouette method uses a metric called the Silhouette score to measure the effectiveness of each cluster. When plotting the silhouette score against each k value, the objective is to find the point on the curve where the score drops significantly for the first time and choose that k value. In this case, the silhouette method produced a k value of 5 using the count vectorizer and 8 using the TDIDF vectorizer, which are both interesting results considering the actual number of labels. This at least gives us a definitive starting place when comparing the clusters that K-Means creates.
K-Means Clustering Visualizations
To visualize how k-means is clustering the data, the dimensionality of the data must be reduced. A technique to reduce the dimensionality of the data is Principle Component Analysis or PCA. Using PCA, a graph the data to see the clusters is shown below. Since K-Means produces labels associated with each data point, when plotting each point, the points can be color-coded to better visualize the data. Also, since the true labels are known, those being the types of natural disasters associated with each document/row of data, when hovering over each point, the user can manually verify whether the points in the clusters have the same topic. If each point in the cluster has the same topic, then effectively, the right value of k was used and the data was clustered perfectly. Below is a visualization of several values of k, including the optimized value. Included in each tab is a link that downloads a CSV containing the statistics behind the clusters. This gives a better sense of how well the data were clustered for each chosen value of k.
From the visualizations and statistics, the best value of k is probably 8 using the TF-IDF vectorizer. The statistics/data back this up. As can be seen in the data, depending on the cluster the data falls into, there is a good chance that a document that falls into a particular cluster can be classified with some certainty, especially for documents containing the effects of Covid-19 or the effect of locusts on food security. For the other two types of natural disasters, it is less clear.
A table of the statistics is below. It is pretty clear that a document that contains the effects of locusts or covid on food security is going to be a part of Cluster 0 and Cluster 4 respectively. In Cluster 7, the documents are evenly split between ebola and drought. This might mean that ebola and drought have the same effects or are happening in similar timeframes or under similar circumstances.
Clustering using Hierarchical Clustering
Another type of clustering is hierarchical clustering, which builds hierarchies of clusters either bottom-up (agglomerative) or top-down (divisive). Using Hierarchical clustering, nodes are clustered together using various techniques that generally involve some distance calculation between sets of data points. Using hierarchical clustering is another nice way to judging the right number of clusters. Depending on the distance metric, a different number of clusters are created and the user can judge using a dendrogram, whether those clusters are distinct enough to cluster the data with.
When clustering the data using a dendrogram, only the default euclidean distance was used to cluster the data. However, when clustering the lockdown data, multiple distance metrics were used and comparisons were made.
First, the data is clustered and visualized using dendrograms. Below are the two dendrograms, one per vectorizer.
From the clusters, the clusters are not completely clear, but there are at least 3 top-level clusters, and perhaps up to 8-10 large clusters. Since it isn't completely obvious what the clusters were, a sampling of k values was chosen and using agglomerative clustering, the sets of data were clustered and visualized.
As can be seen after looking through the visualizations and the statistics, the best k value is probably somewhere around 8 if using hierarchical clustering. According to the statistics, it seems that covid is pretty clearly part of cluster 1 and some part of 0, locusts are part of cluster 2, and cluster 0 has a larger presence of ebola effects than other clusters. If the data is clustered into cluster 3, there is a high probability that the document is talking about drought. From the results of hierarchical clustering, the data seems to be more distinct than normal k-means clustering.
Clustering using Density Clustering
Density-based clustering is a very useful algorithm for clustering data that is densely packed together, but perhaps unable to be clustered using traditional techniques. The classic image of density clustering is similar to the one below, where there are two distinct clusters, but traditional clustering techniques are not able to distinguish the two clusters. Typically, a 2d visualization of a traditional cluster allows the user to draw separate non-overlapping shapes around individual clusters, but that can't happen here.
Considering the previous visualizations of the points, it isn't immediately clear whether density clustering will have much of an effect on the formation of clusters. The results of attempting to cluster the data using density-based clustering are below. From the results, it is pretty unclear whether density clustering is helpful in any way.
As can be seen in the plots using both the count and TF-IDF vectorizer, the data clusters into a max of two clusters using the count and one cluster using the TF-IDF vectorizer. Shown below are the statistics for the count vectorizer, which show the data is fairly evenly spread across the two clusters. There are only two clusters in this data since the "third" cluster is data the density clustering algorithm didn't put into its own cluster.
Conclusion
Considering the different types of clusters and the results obtained using those clustering techniques, it would seem the hierarchical clustering creates the most distinct clusters and could be used for prediction, especially for documents containing information on locusts or covid as those are pretty distinct clusters. Even though the clusters are not perfectly defined given the initial labels, given a document containing information on one of the chosen natural disasters and its effect on food security, it can be placed into a cluster and with a reasonable probability, the document can be classified. When the document is classified, this means there is a unique vocabulary associated with the document, and since the documents all are related to the effects of certain disasters on food security, it can be reasoned that these clusters are created based on the vocabulary associated with those effects. Since the clusters are reasonably different, the effects are different.