Covid-19 on Food Security

Covid-19 on Food Security

Data Cleaning


As a part of cleaning the data in order to prepare it for the ensuing analysis, more data needed to be gathered and other data that was previously gathered, including the WDI indicator and Global household survey data will not be used. As a result, the data gathering section will have more sections with more raw data, and the data that isn't used will remain as evidence of the work.

Leisure and Hospitality Employment Data (FRED)

This record data contains information on state by state employment data for the United States only.

How was it cleaned?

This data was cleaned using R.

As previously mentioned, this data was already interesting since the existing API only had data for a subset of the 50 states, but it never mentioned which states was data available for. Once the raw data was read, the data was cut down to two columns. Then a test for anomalies was run using a package supported by twitter and a couple of anamolies were detected. After viewing the processed data, it was clear the anomalies detected were in fact correct and did not need to be removed. This was because the massive drop in employment numbers during Covid will trigger the anomaly detector since no one expected the massive impact of Covid19 on these industries.

Preview

Here is a preview of the leisure and hospitality data for Maryland. If interested, the rest of the states that have data on this will have visualizations provided in the exploratory data analysis tab.

See the codeDownload the Data
Loading... 
Loading...

Food Manufacturing Employment Data (FRED)

How was it cleaned?

This data was cleaned using R.

This data was very similar in format to the employment data for leisure and hospitality, so it was cleaned in the exact same fashion. After doing some minor preprocessing to remove columns that were not helpful, another test for anomalies was run using the same package. In the same way that anomalies were detected in the previous set of data, there were some anomalies detected, but not as many as the previous data. This can be seen in the visualization below. For more visualizations of this data, please take a look at the exploratory data analysis tab.

Preview

Loading... 
Loading...

Global Covid-19 Lockdown Tracker

How was it cleaned?

This record data was cleaned using Python.

This data was a lot more complicated to clean due to missing values, unconfirmed statuses, and useless columns of data. This data was user collected over a period of time by another company and was downloaded using python. From there, two columns needed to be renamed since they had spaces in between. Those columns were also dates that were not standardized, so that needed to be fixed. The data was a mesh of lockdown information on countries outside of the United States and lockdown information on the United States, so the data was subsetted into only information on the United States. There was also a column that indicated whether the "lockdown" was confirmed, presumably by a user. Any lockdowns that were not confirmed were removed as well. Then the data was grouped by "place", which ended up being the states. Finally, for each state, a new dataframe was created, where the list of rows spanned the entire Covid19 pandemic (late 2019 to present), and another column indicated whether the state was in lockdown. This will be useful since other data gathered is month by month, so the join between the datasets will be slightly easier.

Preview

Loading... 
Loading...

Consumer Price Indices Data

The consumer price index is "the measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". US Bureau of Labor Statistics

How was it cleaned?

This data was cleaned using R.

In order of data cleaning complications, this was again significantly harded than its predeccesor above. The structure of the data was odd from the onset. There were years as labels in the columns of the data, months in the rows of the data, random columns that indicate formatting of other columns, and enumerations of data that didn't have any relevance. After determining the output format of the date, which was consumer price indices (3 of them) grouped by country, where each row was the price index, country, date and associated value of the index at that date. After processing the data and visualizing it for the first time, it was clear that a couple more things needed to be cleaned up. Certain countries didn't match the standard names of the country as defined by standard, so those needed to be transformed. One example was "Viet Nam" needed to become "Vietnam", so the data could be joined with other sets of data before analysis occurred.

Preview

Loading... 
Loading...

Food Security Indicator Data

How was it cleaned?

This data was cleaned using R.

This data was also complicated to clean, but somewhat similar in approach to processing the consumer price indices data. The structure of the data was similar to the consumer price index data since the source (FAO) was the same, and some of the formatting of the columns were the same, but that is where the similarities ended. One particularly interesting piece to cleaning the data was the formatting of the dates in the columns. In half of the cases, the food security indicators were measures over a 3 year average, which actually makes it not terribly useful for analyis when the rest of the data is measured in months or weeks. Nevertheless, the indicators that were measured over a 3 year average needed to be split into single years with the same value. In several cases, there were values that were set to "<0.1", indicating the value was less that 0.1. In this case, those values were set to 0, though there is a case to set them to some very small number since 0 could also indicative of a separate special case. Below is a visualization the food supply variability indicator for Germany. For more visualizations of indicators by country, take a look at the exploring data tab.

Preview

Loading... 
Loading...

Global Covid Data

How was it cleaned?

This data was cleaned using R.

This data was unique considering the amount of the data that existed and the evolution of the data over time. The data started with a very basic structure that included the location and number of cases. Over time, the format changed without notice and columns were added including number of deaths, active cases, more location specific data, and special enumerated columns for unknown purposes. The goal was to take all of that data and consolidate it into time series data for each state with at least two columns, one for the total number of cases recorded and the number of cases recorded that day (net). After processing that data, it was easy to see the missing data since there were a couple of rows where the total number of cases did not grow day over day. A path forward has yet to be determined since the value is very significant as it affects every row after it. Depending on how the data is used, a further cleaning session will have to be completed. A visualization of the cleaned and processed data is below. It is interested to see the massive spikes in the data being flagged as anomalies. Most likely this is a lag in reporting that needs to be divided and spanned across multiple days. To explore more of the data, including the trend of covid cases across all countries and all states, please explore the "Explore Data" tab.

Preview

Loading... 
Loading...

Google Search API

How was it cleaned?

This text data was cleaned using Python.

This data was data obtained from the google search api that was transformed into json and stored. The raw json contained information on the search results, including the title of the article, the article (at least what the api thought was the article), and some other metadata. From the raw json file, the important data was the text within the extracted article. That data was retrieved and processed using a variety of libraries, but ultimately into the CountVectorizer data structure provided by sklearn. Before the text data was inserted into the CountVectorizer data structure, it was preprocessed. This including removing the stop words, lowercasing all words, removing non alpha characters, lemmatizing the words, and finally putting them through the nltk word filter, which got rid of a lot of nonsense. From there, the words were put through the CountVectorizer data structure and extracted into a dataframe for storage. The processed words are visualized in a wordcloud. To explore both the results of the queries "covid19 on food security" and "h1n1 on food security", take a look at the "Explore Data" tab. The data will continued to be cleaned to remove the random words that is still making into the wordcloud, especially the two letter words.

Preview

A visualization of covid search results

Household Survey Data

How was it cleaned?

This data was cleaned using Python.

This data is updated on a weekly basis and stored in sets of xlsx files, so to even begin to process and clean the data, majors transformations were needed. The data needed to be read in and processed into a csv format. The raw data was a large set of excel files that contained a variety of different data, but the goal was to extract only the data that contained information on the results of survey questions concerning food security. The code uses regex to decide whether the excel file contains the information sought after, and if it does, it goes through every worksheet, state by state, to retrieve the right information. Each worksheet, one per state, has information on breakdowns of answers to the questions by age, gender, and race, all which are useful pieces of information. The output of the processed data is a set of csv files, one per state, with all of the information provided, including week number, topic, characteristic, and number of people who answers affirmative to each question. Also as part of the output is a separate set of the same csv files, however, the output is normalized.

Preview

Loading... 
Loading...

Academic Corpus Data

How was it cleaned?

This text data was cleaned using R.

The data was obtained by manually pulling academic articles concerning the impact of covid and h1n1 on food and food security. All of the data were pdfs, so an R library was used to extract the data. For each of the two corpuses, an initial VCorpus data structure was used to better manage the large amounts of text data. A wordcloud was also used to visualize the raw data. From there, the corpus was pushed into a DocumentTermMatrix and further processed. Steps taken to process include remove punctuation, numbers, stopwords, whitespace, and the words were lemmatized. From there, another visualization using a wordcloud was created and all data was saved.

Preview