Survey Data Decision Trees
To create decision trees for the survey data, several steps needed to be taken to clean and preprocess the data. First, the census data needed to be read into a dataframe. Considering the point of creating decision trees for this data was to try to access and predict the impact of covid on food security and that impact on different breakdown's of Americans, the data needed to be subsetted to only look at the survey data that was collected during times of lockdown due to covid. The idea was that surveys done during lockdown because of covid would have the greatest impact on people. So after combining the census data with information on whether each survey was conducted during a lockdown, another piece of information was added that could potentially produce more conclusive results. Ideally, investigating whether living in a particular state has an impact on different segments of society would be of interest. Considering the census data had states already in the data, nothing needed to be cleaned up initially. However, post initial work on creating the model, having an attribute with 51 (including DC) different possibilities caused the decision tree to fall apart. To keep some information about where the surveys took place in case the information was important, the data needed to be discretized. To discretize the data, each state was grouped into a region, reducing 50 states into one of 5 regions: Northeast, South, North, Central, and West.
Below is the data if interested. A download of the data is available too.
Using this data, a model was created using decision trees. Firstly, the data was split into two sets of data, one for training the model and one for testing the model. The results of testing the model are shown as confusion matrices. Below is a set of decision trees where each set tries to predict one of the following three breakdowns of people. These are based on age, gender, and race. Each set of decision trees contains 3 individual decision trees, one for a complexity parameter of 0.01, 0.025, and one decision tree is overfitted, showing that overfit decision trees are worse than those that are tuned. Also included are confusion matrices and graphs of feature importance per decision tree.
Age
By creating a decision tree to try to predict an age range given survey results, some knowledge into the effect of Covid-19 on food security and the effect of that on different age ranges is gained. For example, these decision trees show that the most important feature is whether or not people get enough food, and that distinguishes young people (18-24) from everyone else. Three decision trees were created to try to predict age ranges. These decision trees differed by their complexity parameter, specifically in the default case, 0.01, 0.025 for the second tree, and 0 for the third decision tree. In an ideal world, the complexity parameter of 0 would allow the decision tree to completely overfit, which is just as bad as a decision tree that is under fitted.
After modeling the data using a decision tree and tweaking the hyperparameters associated with them, the best model turns out to be the default of 0.01. It was able to predict the test set of data which pretty good accuracy (besides the 25-39 and 40-54 age range). Using this information and by visualizing the tree, we can see the most important feature is whether people have enough food followed by whether they have a variety of food to eat. Also, it is important to know what the overfitted model shows. As seen in the visualization of the overfitted model, the decision tree is massive and the confusion matrix is worse than the model fit with a larger complexity parameter. Also, the model fit with a complexity parameter of 0.025 is slightly underfitted. Using a confusion matrix, the results can be confirmed as well.
Gender
By creating a decision tree to try to predict the gender given survey results, more knowledge into the effect of Covid-19 on food security and its subsequent effect on people based on gender is gained. In a similar fashion to above, a model of survey results and attempting to predict the gender of the surveyors gives a pretty decent result. The same three complexity parameters were chosen, 0.01, 0.025, and 0, to model the data. Once again, 0.01, or the default, was the best complexity parameter to use and created the best decision tree. When viewing the decision tree, the most important feature was Lack of Variety, which is interesting since it was different from the most important feature for age ranges. However, unlike age ranges, this feature didn't conclusively produce a pure node immediately, and more splits needed to happen. From that information and the visualizations, even though decent results were achieved using the model, it was less clear which features affected gender. Also, interestingly, the overfit decision tree looks very similar to the default decision tree, leading to the belief that perhaps the default tree was overfitted. However, the tree using 0.025 as a complexity parameter was significantly worse, leading to a conclusion that perhaps more tuning needs to be done. Finally, the labeled region data does not seem to make a difference at all.
Race
By creating a decision tree that attempts to predict the effects of food security survey results on different races, again some knowledge is gained on exactly which communities are having a more difficult time during Covid-19 with regards to food security. These decision trees are much larger since there are a larger variety of classes that need to be predicted. Of immediate note is the fact that region makes a difference! It is much easier to view that in the decision tree that uses a complexity parameter of 0.025. This decision tree is much simpler (due to the higher complexity parameter), but the confusion matrix is less accurate than the default. For the sake of argument, either the default parameter of 0.01 or the 0.025 complexity parameter can be used as they both give a decent prediction. The difference between the two is the default parameter gives a better prediction for Hispanic/Latino. One conclusion is that the decision tree classified those that answered the survey as having enough food as white. Another interesting finding is that each important feature at the top of the tree is different, implying that each feature does a pretty good job of sorting out the different breakdowns. Preferably, the smaller decision tree is the way to go, so for this purpose, the complexity parameter of 0.025 is probably the right model since the default looks more like the overfitted model.
Conclusion
In conclusion, the decision trees created for each breakdown reveal some information about the effect of covid 19 on age, gender, and race. Covid-19 has affected all three in different ways and by using decision trees, a better understanding of the most important factors is seen in a visual manner. For age, the most important factors are whether people have enough to eat versus the second most important factor, whether people have enough variety. Interestingly here, the most important factor was whether people had enough variety of food. Those that had less variety tended to be men while more variety tended to be women. Finally, the most important factor when it came to understanding the effect of food security on race turns out to whether people had enough food (similar to age), but again, it was important to note that region also made a difference unlike the other breakdowns.