Sam Pastoriza

Covid-19 on Food Security

Introduction

Data Gathering

Home > Food > Data Gathering

Data Gathering

To gather the data needed to answer these types of questions, a wide variety of sources needed to be used. Both Python and R programs were used to gather data from sources including the Census, World Bank, FAO, and Johns Hopkins amongst others.

Johns Hopkins

Johns Hopkins University is the leader in tracking the Global Covid-19 Pandemic. Their dashboard is highly recommended. The data backing the dashboard is extremely useful when it comes to tracking time-series survey data against the time series covid data on a per-country basis. Using the R code snippet below, the data can be sourced and processed into RDS/CSV files. If interested, a link to the full code is provided here.

processRow <- function (row) {
  country <- row[1]
  fileName = paste(processedDataPath, country, '.csv', sep = '')
  # if there is a csv file that exists with the country.region name, open it into a dataframe, else create a new one
  if (file.exists(fileName)) {
    df <- rbind(read.csv(fileName), t(row))
    write.csv(df, fileName, row.names = FALSE)
  } else {
    df <- data.frame(t(row), row.names = NULL)
    write.csv(df, file = fileName, row.names = FALSE)
  }
}


processCovidData <- function(df, formattedDate) {
  # Preprocess and save again into another R datasource, this time by country
  # for each country in the df
  print('Processing file with date')
  print(formattedDate)

  colnames(df)[which(names(df) == 'Country/Region')] <- 'Country.Region'
  colnames(df)[which(names(df) == 'Country_Region')] <- 'Country.Region'

  groupedByCountry <- df %>%
    filter(!is.na(Confirmed)) %>%
    group_by(Country = Country.Region) %>%
    summarize(Confirmed = sum(Confirmed, na.rm = TRUE), Date = formattedDate) %>%
    mutate(Confirmed = as.numeric(Confirmed),
           Country = as.character(Country),
           Date = as.Date(Date,format="%m-%d-%y")
    )

    apply(groupedByCountry, 1, processRow)
}

To see the raw data gathered by this script, please click this link. Please be aware it will download a zip file to your local machine.

To visualize the data retrieved by this script, the global covid pandemic in Australia can be seen below in this simple line graph animated using D3.js.

World Bank

The World Bank also provides a massive number of World Development Indicators that are compiled from a variety of different sources.

Time Series Indicators: 1400

Economies: 217

Country Groups: 40

Years of Data: 50+

To gather information concerning the World Bank Indicators, the World Bank API can be used to retrieve the WDI indicators as a CSV file. Below is a preview of the code written here.

class WDIIndicators:
    """Retrieve WDI Indicators from the World Bank"""

    def __init__(self, file_storage):
        self._file_storage = file_storage
        self._base_url = 'https://databank.worldbank.org/data/download/WDI_csv.zip'
        self._wdi_data = 'WDIData.csv'

    def retrieve_wdi_indicator_data(self):
        print('Getting zip file from url', self._base_url)
        with urlopen(self._base_url) as zip_response:
            with ZipFile(BytesIO(zip_response.read()), 'r') as zip:
                try:
                    df = pd.read_csv(zip.open(self._wdi_data))
                    output_file_name = 'raw_data/wdi_data/wdi_data.csv'
                    self._file_storage.create_directory_if_not_exists(output_file_name)
                    df.to_csv(output_file_name)
                except Exception as error:
                    print('An error occurred reading', self._wdi_data, error)

To see the raw data gathered by this script, please click this link. Please be aware it will download a zip file to your local machine.

A simple visualization of the world population over time (from 1960-2020) is displayed below. This is also generated in the code and can be seen by navigating using the link.

Food and Agriculture Organization of the United Nations

The FAO Statistics Division in conjunction with the Gallup World Poll began added the FIES Module to the set of poll questions in 2014. Important information on these surveys includes the results of the FIES questions plus key demographic indicators that help break down the data. The FAO provides this data through a proprietary license. To obtain data such as measures of key food security indicators, balance sheets, and historical food prices, I used the FaoStat API to retrieve the data in bulk. Below is a sample of the code used to obtain this data. The full code can be found here.

downloadAndStoreDataFromFAO = function(bulkDataFile, rdsFileName) {
  # The bulk url for the FAO api
  urlBulkUrl <- "http://fenixservices.fao.org/faostat/static/bulkdownloads"
  bulkData <- file.path(urlBulkUrl, paste(bulkDataFile, '.zip', sep = '' ))
  # Download the FAO Stat data in bulk and store it
  download_faostat_bulk(url_bulk = bulkData, data_folder = dataFolder)

  # Read the faostat data in bulk
  bulkData <- read_faostat_bulk(paste(dataFolder, paste(bulkDataFile, '.zip', sep = ''), sep = '/'))
  print(head(bulkData))

  # Save to RDS for faster access later
  saveRDS(bulkData, paste(dataFolder, paste(rdsFileName, '.rds', sep = ''), sep = '/'))

  # Save to CSV for viewing purposes
  write.csv(bulkData, paste(dataFolder, paste(rdsFileName, '.csv', sep = ''), sep = '/'))
}

A simple visualization of the raw data downloaded from the FAO is below. The data is not cleaned, hence a large number of missing values, but the concept is there. This shows the global consumer price indices data for a particular index measured around the world.

Download the Data

United States Census

The United States Census is the leading data source on information concerning the American people. As of the start of the pandemic in late 2019, the Census Bureau began collecting information on the effects of Covid-19 on American households. One of the household surveys was a food security survey, which contains samples of how Americans are doing with regards to food security. Also of importance is the breakdown of demographics within each survey, which will be useful later on. Below is a snapshot of the code used to retrieve that information. The majority of the code is scraping through the Census API to find the household food security survey data. If interested, the full code is linked here.

class HouseholdSurveysApi:

    def __init__(self, file_storage):
        self._household_surveys_endpoint = 'https://www2.census.gov/programs-surveys/demo/tables/hhp/'
        self._file_storage = file_storage

    def retrieve_survey_data(self):
        response = requests.get(self._household_surveys_endpoint)
        soup = BeautifulSoup(response.text, 'lxml')
        survey_years = self.retrieve_survey_years(soup)
        survey_data = dict()
        for survey_year in survey_years:
            survey_data[survey_year] = self.retrieve_food_surveys(survey_year)

        self.save_survey_data(survey_data)
        print('Saved all surveys')
        return survey_data

    def retrieve_food_surveys(self, survey_year):
        survey_week_data = dict()
        survey_weeks_endpoint = f'{self._household_surveys_endpoint}{survey_year}'
        response = requests.get(survey_weeks_endpoint)
        soup = BeautifulSoup(response.text, 'lxml')
        survey_weeks = self.retrieve_survey_weeks(soup)

        for survey_week in survey_weeks:
            survey_week_data[survey_week] = self.retrieve_surveys(f'{survey_weeks_endpoint}/{survey_week}')

        return survey_week_data

To see the raw data gathered by this script, please click this link. Please be aware it will download a zip file to your local machine.

To visualize the results of these household surveys taken by the Census, each week has a state by state breakdown of answers to certain food security questions. Below is an example of what the visualization code can produce.

Household survey data visualized for CA, Week 17

Global Covid Lockdown Data

Another important data source that can be used concerns information pertaining to timelines of when lockdowns happened. Information on that information can be found in a variety of places, so choosing the best option was immedietely clear. The dataset from AuraVison AI, collected by users, seemed to give the best data since each lockdown needed to be confirmed. There was also breakdowns of where lockdowns happen, country by country, and state by state, for the United States. This is critical information. The code to retrieve this data is simple, but if interested, a preview of it is below and the full code is linked here

class GlobalCovidLockdownData:
    """Retrieves lockdown data from a lockdown tracking api"""

    def __init__(self, file_storage, s3_api):
        self._file_storage = file_storage
        self._lockdown_api = 'https://covid19-lockdown-tracker.netlify.app/lockdown_dates.csv'
        self._s3_api = s3_api

    def retrieve_lockdown_data(self):
        """Retrieves the raw lockdown data from the api and stores it for future use"""
        lockdown_data = pd.read_csv(self._lockdown_api)
        print(lockdown_data.head())
        print('Saving to file')
        lockdown_data[['StartDate', 'EndDate']] = lockdown_data[['Start date', 'End date']]
        lockdown_data = lockdown_data.drop(['Start date', 'End date'], axis=1)
        self._file_storage.store_df_as_file('lockdown_data/lockdown_data.csv', lockdown_data)

    def store_raw_data(self):
        """Stores the raw data in S3"""
        file = f'{self._file_storage.get_raw_base_path()}/lockdown_data/lockdown_data.csv'
        print('Processing and storing in s3', file)
        filename = os.path.basename(file).strip()
        lockdown_data = pd.read_csv(file, index_col=False)
        print('Attempting to upload raw lockdown data to s3')
        self._s3_api.upload_df(lockdown_data, f'lockdown_data/{filename}', S3Api.S3Location.RAW_DATA)
        print('Successfully uploaded')

To visualize the lockdown data, a timeline plot seemed most appropriate. Below is an example of that. To see the data that powers that plot, click the download button.

See the code Download the Data

Federal Reserve Economic Data

Another important source of data is the Federal Reserve Economic Data API, which is maintained by the Economic Research Division of the Federal Reserve Bank of St Louis. The FRED API offers an incredible amount of economic data, but importantly, they have up to date information on employment data, which can be used in comparison to Covid19 data. The focus of the data gathering step is to grab all of the data for leisure and hospitality employment and food manufacturing employment data for all states (or as many states as data exists for). The code to retrieve this information scrapes the API to find the relevant information and saves what information the API returns. Below is a preview of the code, but the full code is linked here.

retrieveLeisureAndHospitalityEmploymentDataFor50States <- function() {
  seriesIds <- c(1:56)
  for (i in seriesIds) {
    # Create the FRED series id associated with leisure and hospitality employment data
    seriesId <- paste('SMU', str_pad(i, 2, pad = '0'), '000007072200001SA', sep = '')
    print('Retrieving hospitality and leisure series with id')
    print(seriesId)
    # Attempt to retrieve data for each state. In some cases, the state may not have data,
    # in which case, the FRED api will fail. In the case it fails, print a message
    # and move on.
    tryCatch({
      # Using a library, make the request against the FRED api
      seriesInformation <- fredr_request(series_id = seriesId, endpoint = "series")
      extractedState <- str_extract(seriesInformation$title, paste(state.name, collapse='|'))
      observations <- fredr(series_id = seriesId)
      df <- data.frame(Date = observations$date, Value = observations$value)
      # Add some extra information to the raw data for context
      df$Units <- seriesInformation$units
      df$State = extractedState
      df$Frequency = seriesInformation$frequency
      df$Notes = seriesInformation$notes
      df$LastUpdated = seriesInformation$last_updated

      print('Saving raw data to file')
      print(foodAndHospitalityCsv(extractedState))
      # Save to RDS for faster access later
      saveRDS(df, paste(dataFolder, foodAndHospitalityRds(extractedState), sep = '/'))

      # Save to CSV for viewing purposes
      write.csv(df, paste(dataFolder, foodAndHospitalityCsv(extractedState), sep = '/'), row.names = FALSE)
    }, error = function(cond) {
      print('********************************************')
      print('No series information exists for this id')
      print('********************************************')
    })
  }
}

If interested, you can view some of the raw data here. This is a csv of the food and hospitality employment data for California. Visualizations of this raw data will come later.

Corpus data gathered from academic sources

Another source of important data that will help answer the question concerning the effects of previous disasters, specifically the h1n1 pandemic in 2009, are articles concerning the effects of each pandemic on food security. Several articles were gathered manually and processed through a DocumentTermMatrix data structure provided by the "tm" package. The code below shows the gathering of the raw data and processing of the data using a DTM. A link to the full code is here if interested.

extractCorpusDataAndProcess <- function(corpus_path, type) {
  allFiles <- list.files(corpus_path, full.names = TRUE, pattern = '*.pdf')
  print(allFiles)
  corpus <- vector()
  for (file in allFiles) {
    pdfText <- pdftools::pdf_text(file)
    pdfText <- paste(unlist(pdfText), collapse = ' ')
    corpus <- append(corpus, pdfText)
  }
  vcorpus <- VCorpus(VectorSource(corpus))

  dtm.raw <- DocumentTermMatrix(vcorpus)
  cleaned_matrix.raw <- as.matrix(removeSparseTerms(dtm.raw, 0.99))
  consolidatedMatrix.raw <- sort(colSums(cleaned_matrix.raw), decreasing = TRUE)
  df.raw <- data.frame(words = names(consolidatedMatrix.raw), freq = consolidatedMatrix.raw)

  print('Writing raw data to file')
  write.csv(data.frame(cleaned_matrix.raw), paste('raw_data/', type, '_dtm.csv', sep = ''))

  w.raw <- wordcloud2(df.raw, size = 2)
  print('Attempting to save wordcloud as a raw data visualization')
  saveWidget(w.raw, paste('raw_data_visualizations/', type, '_wordcloud.html', sep = ''), selfcontained = F)

  # Clean the data as much as possible
  vcorpus <- tm_map(vcorpus, removePunctuation)
  vcorpus <- tm_map(vcorpus, removeNumbers)
  vcorpus <- tm_map(vcorpus, removeWords, stopwords("english"))
  vcorpus <- tm_map(vcorpus, content_transformer(tolower))
  vcorpus <- tm_map(vcorpus, stripWhitespace)
  vcorpus <- tm_map(vcorpus, lemmatize_words)

  dtm <- DocumentTermMatrix(vcorpus)
  cleaned_matrix <- as.matrix(removeSparseTerms(dtm, 0.99))
  consolidatedMatrix <- sort(colSums(cleaned_matrix), decreasing = TRUE)
  df <- data.frame(words = names(consolidatedMatrix), freq = consolidatedMatrix)

  print('Writing processed data to file')
  write.csv(data.frame(cleaned_matrix), paste('processed_data/', type, '_dtm.csv', sep = ''))

  w <- wordcloud2(df, size = 2)
  print('Attempting to save wordcloud as a processed data visualization')
  saveWidget(w, paste('processed_data_visualizations/', type, '_wordcloud.html', sep = ''), selfcontained = F)
}

See the code Download the Data