Sam Pastoriza

Covid-19 on Food Security

Introduction

Data Gathering

Home > Food > Random Forest

Random Forest

Random forests are another supervised learning method generally under the classification and regression areas. Random forests are constructed using a large number of decision trees and predictions are made using an average of the predictions made by each decision tree. Random forests are used generally as a good alternative to decision trees as decision trees tend to overfit the training data. Random forests also very little configuration, which makes them a nice alternative to decision trees as well. Once the data is split into training and testing sets, the random forest is trained on the training set and predictions can be made using the testing set. The results of the testing using a confusion matrix are shown below.

In this case, a random forest was applied to the survey data to try to get better predictions. Using the same breakdowns of people, by Age, Gender, and Race, the following results produced by the random forest algorithm are below.

See the code

The error rate for random forest broken down by age

When exploring the random forests generated for each breakdown of the data, it is interesting to first note the error rate plot for each. The error rate plot indicates the OOB (out of bag) error rate for the training set as the number of decision trees increases. To decide how many trees are used, the random forest algorithm finds a point where the error rate stabilizes. In this case, the error rate seems to stabilize around 400 trees for Age, Gender, and Race, but the error rates for Age and Race are still quite high even after they stabilized. That error rate generally translates into the testing phase, and those results are shown in the confusion matrices.

Starting with Age, the random forest has trouble distinguishing between the 25-39 age gap and the 40-54 age range. This is confirmed using the confusion matrix that is generated after the testing phase. One might conclude that those age ranges are affected similarly, while the other age brackets are affected differently. Using the feature importance plot, while is is not immediately who was affected by each decision, the important features are easy to understand for the random forest. Having enough food was the most important feature, which makes sense considering our analysis of the same topic using decision trees.

See the code

The error rate for random forest broken down by gender

Next, for the random forest focused on Gender, the model had a better time trying to predict what the effect of food security was on each gender. The first indication of success comes from the error plot. As the number of trees increases, the error rate drops to a reasonable level of around 400 trees. Following the error plot up with the confusion matrix, it shows a pretty good indication that there is a determining set of features that separate the genders. Using the feature importance plot in the third tab, the lack of variety of food seems to be one of the more important features. To visualize what the deciding factor was, a plot one of the 400+ trees should be used to get a better sense of the reasoning.

See the code

The error rate for random forest broken down by race

Finally, for the random forest focused on Race, the model again had some trouble trying to completely distinguish between certain races, implying that certain races have similar troubles when it comes to food security. Interestingly, the label of "white alone" seemed to have very little trouble distinguishing itself from the rest of the races. Using the error rate plot, which is a graph of the random forest converging on a set of error rates, the rates are high compared to the rates for Gender. In terms of feature importance, similar to Age, the Enough feature provides the best split of the data, however, in the case of race, Region comes back as an important indicator. This is similar to the conclusions reached from running the decision tree algorithm.

Conclusion

In conclusion, random forests are a very popular supervised machine learning algorithm that includes the more reasonable (at least to humans) way of attempting to classify these breakdowns. Random forests utilize decision trees under the hood and are a good way of preventing overfitting of the data. From a results perspective, the random forest gave slightly better predictions, but in interpreting the data, it is harder to understand the reason behind why the algorithm did what it did. That said, in conclusion for the age range, in general, each range suffered slightly differently and thus the random forest was able to distinguish between the five given the results of the survey. For the gender breakdown, again the conclusion is similar. With some certainty, each gender was affected slightly differently and the random forest was able to pick that out. Finally, for the breakdown by race, the random forest indicated that it was far less clear the effects of covid on food security on each race were. Besides the "white alone" label, not much can be concluded definitely.