Covid-19 on Food Security

Covid-19 on Food Security

Survey Data Naïve Bayes


To use the Naïve Bayes classifier on this set of data, several different steps had to be taken to process and clean the data. First, the census data needed to be read into a dataframe. The survey data contained information on how people answered by race, gender, and age, but for this model, only the total matters. So after subsetting that data, the lockdown status per state needed to be determined. Also, included should be the state of employment for those in food manufacturing and hospitality. After combining all of that information and removing any NA values, the data is visualzed using a pair plot. See the plot below.

See the codeDownload the Data
Naïve Bayes Survey Data Visualization

The pair plot shows the relationships between each of the features in the model plus the distributions of features using histograms. To continue to explore the data, it needed to be fit using the Naïve Bayes model. After fitting the model, a visual can be used to see how each feature affects each state differently. These are called variable density plots.

Variable Density

A variable density plot allows us to compare the distribution of each feature for each classification. So for example, if the distribution of how many people answered "Enough" across all states individually was important, a variable density plot can be used.

State of food security

Shown below is the variable density plots for each of the food security states, those being whether people got enough to eat, whether they lacked variety, sometimes didn't have enough, or never had enough to eat.

While exploring the variable density plots, certain states have a much tighter distribution in the "Enough" category and the mean of the distribution is higher are doing better. Those states include Minnesota, Vermont, Washington, and Wisconsin. They indicate that they generally have enough food, unlike Texas, Georgia, and Alabama, which have the lowest mean survey Enough value on average. Looking at the other side of the spectrum, the expectation is the opposite (or close to it), and this is confirmed with the addition of Lousiana have more trouble getting enough food.

State of employment in industry

Shown below are the variable density plots for the employment numbers for each of the states. There are two sets of employment numbers, one for food and hospitality employment and one for food manufacturing employment. The hypothesis here is that different states have different employment numbers, with some states getting hit harder during the pandemic than others.

Here is a slightly different graph than the previous variable density graph. Noticeably are the states that got hit super hard during Covid. Those states that got hit much harder are the states that have giant spikes around the 0-50 mark, indicating mass unemployment in those industries. Since those spikes are localized to states, that will probably be a very important feature when it comes to classifying states. North Dakota and Vermont took a huge hit according to the data and visualization in the food and hospitality employment rate. In food manufacturing, Vermont and North Dakota again took a huge hit in employment rates. North Dakota makes sense since there are large plants that closed that specifically deal with food manufacturing. On the other hand, California and Texas rode the storm better in terms of employment in those industries.

Feature Importance

Using the variable densities, basic predictions on what the important features can be made when it comes to predicting the states. Our hypothesis is that the employment rates will make a big difference, but that the survey results are also important.

Naïve Bayes Survey Data Feature Importance

Looking at the results of the feature importance plots, the two employment rates make the biggest difference when training the model. However, to complete the model, the survey results have a part to play when predicting which state is associated with the set of test data.

Results

Using a set of test data, the quality of our model can be evaluated. For comparison purposes, three different libraries that contain implementations of the Naïve Bayes classification algorithm were used. Different libraries produce different visualizations, but it is important to show the difference in the results of the model. Below are the results of the three different models. An important note here: the models come from different libraries, but the hyperparameters were not different from each other.

After making predictions for each model, each model performed equally well. In fact, the results are the same. This meant that each of the models produce different visualizations that are used above, but after predicting, the models produced the same results. According to the confusion matrices, the states are predicted with a pretty good accuracy according to the visualization.

Conclusion

In conclusion, the Naïve Bayes model created for this set of data helped us access the effect of food security on different states and also gave us insight into how unemployment rates for different industries affected states differently. Similar conclusions are reached with regards to food security results as the decision trees, that being that having Enough and Sometimes enough to eat is more important than having a lack of variety to eat. However, since this is the first time working with the employment data, the massive effect it has on different states is clear. By looking at individual states and see how each state's unemployment rate climbed, specifically when it comes to hospitality and food manufacturing employment, it is clear the effect. Different states were able to weather the storm differently and by using this the model, it can be proven and backed by data.