Search Data Decision Trees
To run decision tree analysis properly using the search data, a couple of steps must be taken to clean up and preprocess the data. First, the data needs to be combined. Right now, there is data coming from articles mined using the Google Search API automatically, but also, there is data coming from manually searching google for research articles on the effects of different disasters on food security and food supply. These articles were very difficult to automatically process, so by manually choosing researched articles, the quality of the data was improved greatly. Once the data was combined, it can be visualized using a variety of different methods, but the simplest is a wordcloud. Below is a set of wordcloud's, one per topic.
As seen in the wordclouds, the larger the word, the more times it shows up in the article, and hypothetically, the more important the word is with regards to the article. Now, unlike previous wordcloud's generated during this project, it was very important to remove the words that were close or the same as the actual topic. So words like covid, drought, ebola, and locust are removed since that would bias the algorithm. Plus this is not clustering, which doesn't care about the labels, unlike decision trees, which is a supervised machine learning algorithm.
Decision Trees
Once the text was cleaned up and vectorized (in this case, using a count vectorizer), the next task is to create decision trees. Decision trees require a couple of different user-defined parameters. These are known as hyperparameters and they need to be tuned to create a decision tree that is accurate enough using the training data but doesn't overfit. Overfitting is a problem in machine learning where the model, in this case, the decision tree, models the training data too well. It learns all of the noise, details, error, etc... that negatively impacts the performance of the model. When trying the test data set with an overfit model, the model will perform poorly, in the same fashion as an underfit model. To demonstrate this, a decision tree was created to showcase the problem with overfitting. See the decision tree tabs to find the overfit tree.
Below are a set of different decision trees with different hyperparameters set, including metrics for Gini and Entropy and max depth. In python, there are options for a splitter, options including either best or random. In that case, random didn't help, so those decision trees were left out. Please feel free to take a look at the code and play around with it. For the overfit tree, a maximum depth of 100 was used, far more than the number of documents, thus providing the perfect conditions for overfitting. For each decision tree, a visualization of the tree is provided combined with a confusion matrix, which is a matrix of predictions created using the test data set. Both the test and training data set are provided for each tree. Also provided for each decision tree is a graph of feature importance, which is another helpful visualization when understanding how the decision tree made its decisions.
After exploring the data, the resulting visualizations, and the confusion matrices associated with each decision tree, the results are interesting. The defining decision is made by the metric, either Gini or Entropy. Both words, desert, and swarm, distinctly split the data into smaller chunks. This makes complete sense since desert and swarm are words generally associated with natural disasters, drought with desert, and swarm with locusts. Using three different max depths also varied the types of trees produced. It was important to vary the depths of the trees to find the right point where a prediction of the type of natural disaster could be made without going too far and overfitting. Then, by utilizing the confusion matrices, a decent guess can be made as to which model/decision tree best represents the data. In many cases, it was clear that certain topics were harder to classify than others. That was consistent with the clustering technique used in a previous section. The covid topic has affected every aspect of life, and the research papers and google searches showed that. Unless there was a paper written pre-covid, the covid topic along with its keywords, including pandemic, would show up. However, for certain decision trees, the model was pretty good at predicting each topic with some accuracy.
For the gini metric, it was pretty clear that using a max depth of 4 or 5 was the right choice. Using the visualizations plus the confusion matrices, the trees did not differ much, so in that case, choosing the less complicated tree (i.e. a depth of 4) would be appropriate. Also, it was clear that using the overfit decision tree would not work as it generated a massive tree that had a confusion matrix that was far less accurate at predicting topics than the models with max depth at 4 or 5. That is why overfitting is a concern.
For the entropy metric, it was pretty clear that using a max depth of 3 or 4 was the right choice. A max depth of 3 produced slightly better results than a max depth of 4, indicating that 4 might be slightly overfitting the model, but still not a bad choice. A max depth of 3 is a very result since it was the smaller tree of the bunch that was tried, the simplest, and produced the best results. Using this tree, it had a less difficult time predicting covid but a slightly harder time predicting drought. Of course, with a larger dataset, this model could be improved, but that is a given. Using the entropy metric, the top features were desert, pandemic, guinea (odd...), and dry, and disaster. Besides guinea (which might be the region), the rest of the features/words make sense. Desert and dry refer to desert/locusts, and pandemic refers to covid.
Conclusion
Rather than predicting future states of food supply or food security, these decision trees help us understand the relationships between the different disasters. Given the features generated by the decision tree, the model explains how it would classify a particular article. It is also interesting to see which disasters are closer, whether or not there is a reason behind it. For example, when looking at the decision trees created by the Gini metric, one can see that locusts and drought are closer together, only separated by whether the word outbreak showed up more often. Further up the tree, desert and locusts are classified differently from covid due to the word pandemic, which makes perfect sense. However, for the ebola label, it is far less clear the logic. Using decision trees in this context helps us to understand the words that are important when it comes to classifying types of disasters at a very high level.