Predicting Levels of Earthquake Damage: A Comparison of Classification Models

Predicting Levels of Earthquake Damage: A Comparison of Classification Models

Feature Engineering


In order to attempt to improve the two best models' accuracies and to decrease algorithmic complexity, feature engineering was explored in three main ways: feature selection, feature transformation, and feature extraction techniques.

The feature selection process started with categorizing the variables into four factors that were hypothesized to have a relationship with damage level. The categories were geography, building structure, building materials, and building usage. Of these, geography was considered for transformation for two main reasons. First, it is obvious that a building's location would be an important factor of the level of damage it experiences, as buildings closer to an earthquake's origin are more likely to experience greater damage. Second, it was seen that the three geographic variables were the top three important variables in all three of the top classification models that were produced without feature engineering: Random Forest, AdaBoost, and Bagging. This is shown and discussed in Figure 6 in the Results section below. Considering these variables' importance in the pre-feature engineering models, it seemed likely that combining the geographic variables into a single variable may aid in the models' abilities to interpret and utilize the geographic information.

There were three main geographical variables that represented building location: "geo_level_1_id, "geo_level_2_id, and "geo_level_3_id, which could loosely be interpreted as representing the building's town (level 1), district (level 2), and street (level 3). These variables simply provided an encoding to represent the building's street, district, and town. Therefore, since the values are not coordinate system numbers such as latitude and longitude, they do not provide information on the actual positions of buildings on a map. As such, their encodings provide less interpretability for humans and likely less understandability for a model.

To begin the investigation into these variables, a probability table of the geographical regions is created to better connect location data to the level of damage. This process can be thought of as encoding or grouping the area with a similar distance from the origin of the earthquake. Table 2 below shows the table of conditional probabilities created based on geographical variable 1. The "counts column shows the total number of observations of the geographical identifier. Then, the "damage_1, "damage_2, and "damage_3 columns count the intersections of each damage level and the geographical identifier. Finally, the conditional probabilities are calculated by dividing the intersection count, and the results are shown in the last three columns of the table.

Table 2. Conditional Probabilities for Geographical Variable 1

As discussed above, the conditional probabilities shown in Table 2 are calculated using Equation 1, where the frequencies of the intersections are divided by the number of each geographical identifier in the data set.




$$ \begin{equation} P(damage|location) = \frac{Number\ of\ buildings\ damaged\ at\ a\ certain\ level\ in\ a\ certain\ location}{Number\ of\ buildings\ in\ a\ certain\ location} \tag{1} \end{equation} $$




Based on Equation 1 and the surrounding discussion, the conditional probabilities can be interpreted as the region-specific risk of damage due to the earthquake. Thus, the geographical variables in the test data set can be readily encoded using Table 2.

Next, the variables involved with building materials were investigated for feature extraction as the materials used in buildings could be related to the stability of the building. In the data, the superstructure of buildings are categorized to have mud, stone, or brick, bonded by different mortars. Principal Component Analysis (PCA) was utilized to extract components from these variables. The idea behind applying PCA was similar to the feature engineering technique applied to the location variables: to encode the features by similar stability.

The goal of PCA is to find new dimensions that capture the variability of the data. It transforms the data into a new coordinate system such that the component with the greatest variance lies on the first coordinate. In addition, PCA is a common technique for dimensionality reduction. Table 3 shows the five principal components selected from the eleven building material variables in the original data set.

Table 3. Partial Table of the Principal Components Extracted from the Building Material Variables

With two methods of feature engineering performed, the resulting reduced data set was used to train the top two tuned models: Random Forest and AdaBoost. The results of this are discussed in the following Results section.

Feature engineering decreased the complexity of the model. Since the variables in the original data set were mostly categorical, a numerical transformation of these categorical variables significantly increased the dimensionality of the data. Therefore, PCA's ability to decrease the number of variables in the data set allowed for a decrease in complexity. However, further research would be needed to assess the efficiency of applying PCA to binary variables.