Clustering
Clustering, or cluster analysis, is an unsupervised learning technique used to group a set of objects into similar groups based on one of a variety of statistical techniques. Options to measure similarity include calculating distances between data points and determining the densities between points. When clustering data, you are attempting to explore the data to find patterns between the data and perhaps try to classify the data. Different types of clustering include K-Means, Hiearchicical, density-based clustering, including DBSCAN, and distribution-based clustering, which are models on statistical distributions.
In this research into the effects of Covid-19 on food security, several types of clustering were used, include K-means, Hierarchical, and Density clustering. Each clustering technique was used to attempt to find any patterns in the data that could produce some means of classifying data and predicting the results of future inputs of new data points. For example, if the text data was classified into 4 distinct clusters, each cluster representing a different natural disaster and its effect on food security, then it is possible to conclude that each natural disaster has its unique effect on food security. In addition to that, by exploring that data using a wordcloud and some basic statistics, it is possible to pull that information out from the clusters.
Search Data Clustering
This data comes from the Google Search API and is the result of finding 40+ websites given the queries of certain natural disasters and the effect of those disasters on the food supply and food security. By taking the data, vectorizing it, removing the stop words, the first visuals of what the data looks like are seen through a set of wordclouds. Once those wordclouds were created and the processed data was saved away the idea was to cluster the data and attempt to classify those articles into a set of clusters, each cluster being the set of articles associated with each natural disaster.
To read more about how the data was clustered and understand exactly how the results below were achieved, please click the link below.
More on clustering search data
After clustering the data, it was clear that only certain natural disasters have a specific enough effect on food security to become their own cluster. This specifically refers to the effect of locusts, which swarm in the million to eat crops, so their effect is immediate and known. In contrast, natural disasters such as droughts, wildfires, and covid are far more complicated in their effects and the impact is not immediate. Of course, over some time, the impact is known, but that is more long-term rather than short-term. However, according to the statistics generated, it is possible that given an article, it can be clustered, labeled, and with some probability, a guess can be made as to what kind of natural disaster caused this.
Lockdown Data Clustering
This data comes from two separate sources and is combined before being clustered. The two sources include the household survey data collected on a semi-regular basis by the United States Census and the lockdown data collected by an independent third-party component. The household survey data contains information on when lockdowns were occurring due to the Covid-19 pandemic on a state-by-state basis. The data was combined, the labels were removed, and the data was clustered. For a more detailed look at how the data was clustered and to understand exactly how the results below were achieved, please click the link below.
More on clustering lockdown data
After clustering the data, it was clear, with some accuracy, the survey answers could be associated with a state not in lockdown versus in lockdown. There were distinct clusters that were created with a very significant amount of survey data overwhelmingly associated with states not in lockdown, versus only a couple of clusters where there were a significant amount of survey vectors associated with states in lockdown. However, the problem was that clusters that contained survey data associated with states in lockdown also containing not an insignificant amount of survey data for states in lockdown. Therefore, it could be concluded that different states reacted differently when in lockdown and there food security were affected in different ways, thus trying to predict whether a state was in lockdown based on whether they felt secure foodwise would be difficult to do without more variables, which would ruin the attempt to isolate the effect of covid on food security.