Covid-19 on Food Security

Covid-19 on Food Security

ARM and Networking


Association rule mining is a rule-based learning method that is intended to be used to discover interesting relationships between variables in a data set [Wikipedia]. For example, imagine a set of data where each row of data is a set of items that a shopper at a grocery store has purchased. When looking at each row, one could see relationships between items. For example, every time a basket of items has vanilla extract, the basket probably also has sugar (for baking purposes). If the store can find that relationship, they might consider putting vanilla extract and sugar together so that shoppers would be able to purchase those together for convenience purposes. Other relationships in the data may not be as clear cut, and through association rule mining, the relationships can be visualized better and decisions/conclusions can be made given the data.

Association rule mining

In this research, several sets of text data were gathered, both from the Google Search API, but also manually through a variety of research areas. See the Data Cleaning portion of the project for more information on the Academic Corpus data. Assuming each article is a "transaction" or basket of items/words, it would be interesting to see how the data is related and whether there was a potential to isolate or at least better visualize the relationships between these disasters and their effects on food security. To isolate the effect of each disaster on food security, several filters were added to attempt to focus the relationships on certain keywords. These keywords came from the four pillars of food security, including availability, access, stability, and utilization, plus food and security.

Pillars of food security

When gathering the data initially, there were only a few sets of transactions, since each transaction was an article, and each transaction had a lot of items. After removing duplicate items from the transactions and removing words that didn't make any sense, the following transaction data was produced. Below is a preview of the data, but it can also be downloaded in its raw form.

Shown above is the raw transaction data, where each row is a document containing at least a column of text plus some basic metadata about the transaction. The raw data is the above data stripped of all columns except for the text, but to display information about the transactions, it is more clear to show the transaction data with the extra metadata.

After reading the transaction data using the arules library, a set of relationships is generated. Each relationship is judged using a couple of different metrics. These metrics include support, confidence, and lift. Support simply measures the popularity of a set of items in a set. So in the grocery example, if out of 1000 carts, vanilla extract and sugar were bought together 200 times, the support would be 200 / 1000. The confidence metric measures how often an item exists based on another item. In mathematics terms, it is the conditional probability that an item occurs in a set based on another item occurring in the set. So if vanilla extract was purchased 300 times total, the confidence metric would be 200 / 300. Finally, the lift metric measures the likelihood of how often items are together rather than each item being alone. So if sugar is bought 400 times, vanilla was bought 300 times, and together they were bought 200 times, the confidence metric would be (200 / 1000) / ([400 / 1000] * [300 / 1000]) = 1.6667. Math aside, these metrics help us decide which relationships, when generated, are statistically significant enough to pay attention to. Rather than looking at 500,000 relationships and trying to make a decision is much more difficult than looking at 20 relationships.

When generating these relationships given these transactions, the first step was to prune the list of relationships to determine which are statistically significant (as previously mentioned). Using the apriori algorithm plus the 3 metrics above, list of relationships were pruned down a significant amount. And while that did reduce the number of relationships significantly, the focus on subjects of the relationships needed to be narrowed down to only a couple of words since large articles that include information that doesn't pertain to this analysis. Using a subset of the data, a set of relationships was generated. Only the top 15 rules sorted by support, confidence, and lift are shown below for brevity purposes. Downloads of the top 15 rules are also available.

The top rules sorted by support mean relationships where both terms show up in the same transactions. So it is not at all a surprise that the top rules include the relationship are mostly targeted towards words that are synonyms or antonyms, such as nutrition and malnutrition. Unfortunately, the words that describe the disasters are not showing up, so support might not be the right metric to use in this case.

The top rules sorted by confidence, a metric measuring how often an item shows up in a transaction given that another item is in the transaction, appears to show a better distribution of words, but this could be deceiving. All of the confidence values are 1, which means that every time, when a word shows up in the text, the other word will show up with it. So when locust shows up in the text, the word emerges also shows up. This is helpful in some respects when clustering data, as words associated with locusts would probably cluster well, but that doesn't imply there are clear relationships between data since the items are relatively independent.

The top rules sorted by lift, which is a metric that measures the likelihood that two items are grouped show rules that have a very high lift (20.25!!). A lift value higher than 1 means that the consequent and antecedent of the rule are positively correlated, and a lift value of 20 means they are extremely correlated. However, having a lift value of 20 is somewhat suspicious, so another route should be taken.

One way to visualize the rules given their support, lift, and confidence metrics is using a simple graph that plots the rules in 2 dimensions but includes an extra visual indicator for the remaining metric.

See the codeDownload the Data
Loading... 
Loading...

Given the above rules that are generated, it can be seen that there are large clusters of data with a confidence of 1 and lift near 20. Assuming the lift value is used (or support/lift), the results in the network plots might not be helpful. And in fact, they are not. After iterating through different values of lift, support, and confidence, the relationships were extremely difficult to decipher. The problem was that all relationships needed to include nodes to visualize, including all four disasters (covid, ebola, drought and locusts), and its effect on food security, so words like (food, security, nutrition, utilization, stability, etc...). If interested, the old graph of networks still exists and is linked here. A new approach was needed.

The new approach involved separating each disaster and trying to independently analyze the relationship the disaster has on food security. This approach greatly reduced the number of rules being produced, while keeping the relationships intact, allowing analysis of the effects of each disaster on food security to occur. For different disasters, like ebola, tweaks needed to be made on each of the metrics to produce networks that had a reasonable number of rules. For ebola, the lift values were unnaturally high, probably because the academic papers had far more mentions of ebola than random Google articles. For the other disasters, the metrics were largely similar. If interested, the code is linked to every visualization, so feel free to explore yourself.

IGraph Network

A basic network of covid

Using the igraph visualization, a very basic version of the network is generated using association rule mining. In the covid as well as the drought network, it seems that nutrition is a stronger relationship than food and food security, and nutrition is closely connected to stability and utilization, perhaps implying that covid affects the stability and utilization pillars of food security more than availability and access. For the ebola network, interestingly, it seems to be strongly connected with the words food and security, and those tend to be more connected with words associated with the availability of food. However, for the locust network, the locust node is not connected to any of the nodes concerning nutrition. This makes some sense when examining the nodes it is connected to, which are words strongly associated with the behavior and nature of locusts rather than the effect of locusts on food security. This is even because locusts should have a massive effect on agriculture, which should make a difference in the availability of food.

Vis Network

Loading... 
Loading...

Using the vis network visualization, allows for interaction exploration in one of two ways. Zooming into the network, one can interact with each node and explore the relationships that way. Since there are a large number of relationships, sometimes it is easier to use the dropdown to isolate certain words and explore the relationships that way. From this graph, exploring the relationships between nodes by examing the rules and associated metrics is relatively simple. Here, it helps to view the rules that are darker in color (higher lift) since those are nodes that are more strongly correlated. In the ebola example, it is easier to see that more nodes have rules strongly connecting them with higher lift values than other disasters, so to create the network in the first place, the lift value was more important for the ebola network. Also, it is much easier to view the relationships between the nodes by interacting with the graph. However, that interactive exploration gets even easier when viewing the network using network d3.

Network D3 Visualization

Rather than using the tabs to view each network and switch between, it is easier to view them individually and make conclusions about each network. Below are the final conclusions about each network followed by a conclusion concerning the networks as a whole.

Covid Network
Loading... 
Loading...

Using network d3, a clearer picture of the network of relationships between covid and the pillars of food security can be seen and interacted with. From this network, there are a couple of interesting relationships that also make perfect sense. Covid is related to the word nutrition, which from its derivatives, leads to instability via the word acute and utilization via the word poor. Both of these are negative connotations, which implies that covid has a negative effect on two of the pillars of food security. This is an encouraging sign that our rule mining worked. By further exploring the network, there are a number of other important relationships and effects, including stunting, which is a common metric used to measure food security.

Ebola Network
Loading... 
Loading...

Using network d3, a clearer picture of the network of relationships between ebola and the pillars of food security can be seen and interacted with. From this network, the uniqueness of the network compared to other networks is generated and visualized here. Specifically, the network generated by the keyword ebola shows a very powerful connection between ebola and food and security. By closely examining the network, each of the orange nodes is connected to food, security and ebola. Some interesting connections include death, magnitude, and negative, implying a negative connotation associated between the three. This also makes sense, but the relationship with each individual pillar of food security isn't as clear.

Drought Network
Loading... 
Loading...

Using network d3, a clearer picture of the network of relationships between drought and the pillars of food security can be seen and interacted with. From this network, is it a very similar network to the network associated with the keyword covid. Again there is a solid connection between the derivatives of nutrition and a single connection between drought and that network through the keyword clean. Perhaps this implies the effect of drought is the lack of clean food or water, which is important when it comes to nutrition.

Locust Network
Loading... 
Loading...

Using network d3, a clearer picture of the network of relationships between locusts and the pillars of food security can be seen and interacted with. This unique network that shows a very distinct cluster of words associated with locusts and no defined connection with the cluster concerned with food security. The words associated with locusts make perfect sense, including swarm, aerial, pasture, and pest, but even though the words are negative in connotation and should have a connection to agriculture, which connects to food, either the support, confidence, or lift wasn't high enough to make the connection between the two.

Conclusion

When using association rule mining, the hope when viewing the results is to gain some insight into relationships in the data. In this case, when mining the text data, the hope was to find and confirm the effect of different natural disasters and the pillars of food security. Two outcomes were possible, either confirmation of the previous understanding of the effects of each disaster on food security or finding new relationships. Also, when exploring each network generated by each disaster, it is important to see the words associated with the core keywords that describe each disaster and each pillar. From those words, the focus of this research is narrowed down.

The results of the association rule mining of text data confirmed the negative effect of covid and drought on two of the pillar of food security, mainly stability and utilization. Both covid and drought had an overall adverse effect on each based on the connotation of connecting and surrounding the relationships. For ebola, even though it was a widespread disease, it was an epidemic rather than a pandemic and it manifested itself as a different looking network, affecting mostly the availability of food and its effect on food and security directly. Finally, the network of relationships generated by the word locust is somewhat distinct from the other networks as the cluster surrounding the keyword locust is completely separate from the pillars of food security. Perhaps this means that either the metrics used to calculate the relationships filtered out the connections or more likely, the transactions containing locusts were very targeted and less willing to make the connection between the keywords. Either way, these networks provide some sense of the relationships between the disasters and the pillars of food security and help us answer one of the questions for the research. Different disasters affect different indicators, and using association rule mining, it can support that answer with data.