Sam Pastoriza

Predicting Levels of Earthquake Damage: A Comparison of Classification Models

Introduction

Understanding the Data

Home > Earthquakes > Understanding the Data

Understanding the Data

Data and Variables

The dataset was made publicly available by Nepal's government following a survey conducted between January 2016 and May 2016. It consists of 347,469 rows and 39 numeric and categorical columns that provide information about a building such as structure and dimensions of building, structural materials of building, building usage, and more. Table 1 displays each of the features made available in this competition, including the label. The label, as mentioned above, identifies the level of damage sustained by each building: (1) low, (2) medium, or (3) high.

Earthquake Variable Data

Variable	Description
damage_grade	The amount of damage sustained (Response Variable)
building_id	The unique id of the building
geo_level_1_id	The town level location id
geo_level_2_id	The district level location id
geo_level_3_id	The street level location id
count_floors_pre_eq	Number of floors before earthquake
age	Age of the building
area_percentage	Area of the building
height_percentage	Height of the building
land_surface_condition	Type of land the building is on

Table 1. List of Available Features in Data Set, including the Response Variable.

Preprocessing

While there were no missing values across all variables, dummy variables were constructed for the categorical columns and the dataset was split into a 75-25 train-test set. The training data consisted of 260,601 rows and 70 variables (including the label) while the testing set contained 86,860 rows. The training and testing sets from this split were established and provided by the DataDriven competition such that the testing set did not contain labels and would be used for final scoring of model submissions. Therefore, all training and testing of models prior to submission had to be done with the given training set in order to assess model accuracy prior to submission.

In order to train and test several models at first, the training set was subset by 25% due to the intensive computational time and power needed for the tuning process of the different classification algorithms. This subset data set was then split into three data sets: 80% training set, 10% testing set, and 10% validation set. These were used for the creation and tuning of various classification algorithms. However, once the best models were established, they were re-trained on the entire training data set (not the subset), and the final model from this training/tuning process submitted for competition scoring. The competition would run the model on the original testing set and score appropriately. A diagram to clarify this data splitting and subsetting process is displayed in Figure 1.

Figure 1. Data Source and Subsetting Diagram

Exploratory Data Analysis

Before starting the modeling process with the data, an initial grasp of the variables and their distributions were desired. In order to do this, a series of exploratory plots were built. In Figure 2, a plot of the distribution of the damage levels shows that the damage levels were non-uniformly distributed and there were more training data labeled as “Medium” as compared to “Low” and “High”.

See the code Download the Data

Figure 2: The distribution of damage level classifications for the training data. This plot shows that the damage levels were non-uniformly distributed and there were more training data labeled as “Medium” as compared to “Low” and “High”.

Next, in Figures 3 and 4, the distributions of numerical and categorical variables show the relatively large imbalance in data across the variables.

See the code Download the Data

Figure 3: The distribution of numerical variables for the training data set. This plot shows that some of the variables are skewed towards lower values with some outliers (see Age). Additionally, the height variable looks somewhat normally distributed.

In Figures 5 and 6, the distributions of the secondary use variables are shown and compared to the damage grade levels. Using these visualizations, it is clear that secondary usages have very minimal to zero effect on the damage of a building.

See the code Download the Data

Figure 5: The distribution of secondary use variables. This plot shows the sparseness of the variables, which could mean they are less significant when it comes to modeling the data.

However, in Figures 7 and 8, the distributions of the superstructure variables come to a different conclusion. These visualizations show that buildings engineered with reinforced concrete have a far better chance of lower damage while buildings engineered with flagstone sustained high levels of damage.

See the code Download the Data

Distribution of Superstructure Variables

Figure 7: The distribution of superstructure type variables. This plot shows that most super structures are built using mud, mortar and stone.

As seen by the plots included in the (Figures 2-8) and discussed here, exploratory data analysis provides better understanding of the variables used in modeling and thus improves model interpretability. From the exploration, it is important to note the imbalance of class variables, which may lead to difficulty when trying to differentiate between low/high and medium damage. Additionally, using the results of this exploratory data analysis, one-hot-encoded variables can be important based on the value of the variable (see the superstructure variables). In sum, the results of the exploratory data analysis provided a basis of understanding prior to modeling.