Sam Pastoriza

Covid-19 on Food Security

Introduction

Data Gathering

Home > Food > Naïve Bayes > Search Data Naïve Bayes

Search Data Naïve Bayes

To run Naïve Bayes properly using the search data, a couple of steps must be taken to clean up and preprocess the data. First, the data needs to be combined. Right now, there is data coming from articles mined using the Google Search API automatically, but also, there is data coming from manually searching google for research articles on the effects of different disasters on food security and food supply. These articles were very difficult to automatically process, so by manually choosing researched articles, the quality of the data was improved greatly. Once the data was combined, it can be visualized using a variety of different methods, but the simplest is a wordcloud. Below is a set of wordcloud's, one per topic.

See the code Download the Data

As seen in the wordclouds, the larger the word, the more times it shows up in the article, and hypothetically, the more important the word is with regards to the article. Now, unlike previous wordcloud's generated during this project, it was very important to remove the words that were close or the same as the actual topic. So words like covid, drought, ebola, and locust are removed since that would bias the algorithm. Plus this not doing clustering, which doesn't care about the labels, unlike decision trees, which is a supervised machine learning algorithm.

Feature Importance

When running the Naïve Bayes algorithm on this data, one of the first visualizations created was the feature importance for the model. Before even making any predictions using the model, the model can tell us the most important features when it comes to classifying the data. This is super important because it helps us understand what words matter when it comes to classifying each disaster (in this case). This is similar to decision trees, but it is another perspective, and if the Naïve Bayes model is better, then the important features are more accurate here than other models. Using the tabs, each of the feature importance graphs can be explored.

See the code Download the Data

A visualization of feature importance for covid

After visualizing the model using the feature importance model, the important features make a lot of sense, which is encouraging. For the Covid classification, the important features all relate to Covid-19 and its effects, like supply, access, and nutrition! These features are part of the pillars of food security. Looking at the drought model, the relevant words like climate and health, the climate being the cause of drought, and health being an effect of drought are present. This is important too since this means the model is generating important features that are causes and effects of the classifications!

Results

After inspecting the important features and confirming the model looks good from a training perspective, the quality of the model is tested using some test data. Using the test vectors, the model worked pretty well. See the confusion matrix below.

See the code Download the Data

After inspecting the confusion matrix above, it had a bit of trouble classifying articles associated with Covid-19, but that makes sense. Covid-19 is the prevalent topic, so there were a couple of test vectors that were predicted as Covid-19 when in fact they were not. More than likely, the articles that were classified as Covid-19 could have been written very recently and talked about the effect of Covid-19 combined with the effect of locusts on food security. That is one plausible way to look at the results. Not all models are perfect.

Conclusion

By using the Naïve Bayes model for classifying text data from Google Search and other research papers, the model was able to identify some important words that were associated with each disaster. By using that feature important plot for each classifier, the model was able to pick out the most important words and come to a couple of important conclusions. First, when looking at articles associating each disaster with food security, the most important features tend to be associated with the disaster itself, the pillars of food security (access, nutrition, supply), and the effect the disaster is having on the food security pillars. This is super convenient to have all in one chart. Since these features are baked into the model, the model was able to predict the type of disaster given a set of words, which can be useful.