Sam Pastoriza

Predicting Levels of Earthquake Damage: A Comparison of Classification Models

Introduction

Understanding the Data

Home > Earthquakes > Results

Results

Variable Importance

All three of the best-performing models were ensemble methods (Random Forest, Bagging, AdaBoost). To investigate the most important variables from these models, an analysis on the mean decrease of impurity associated with each variable for each model was produced. The top six variables across all three models were the same, and their importances are plotted in Figure 18.

See the code Download the Data

Figure 18: Top 6 Most Important Variables per Three Classification Models

Figure 18 shows that the top six most important variables per the three best classification models are the three geographical indicators, the building's age, and the building's dimensions (area and height). This provides insight into the importance of the variables and also guides feature engineering as was discussed above. Understanding the least important variables can also provide valuable insights for modeling and feature engineering. A plot of the twenty least important variables for these three models is shown in Figure 19 for reference.

See the code Download the Data

Figure 19: The 20 Least Important Variables based on Three Classification Models

Model Performance

The final, tuned Random Forest and AdaBoost models were significantly more accurate in comparison to the nine models trained at the beginning of the training process. The accuracies of the Random Forest and AdaBoost models (trained on the full training set, tuned, and used to predict the full testing set) were 72.68% and 69.75%, respectively. Then, these models were again run but with the variable adjustments made with feature engineering. This resulted in an increase of the accuracy of the Random Forest model to 74.08%. This performance enhancement was led by making the variables more meaningful for the model, adding more interpretability of the features.

Competition Performance

Performance of our algorithm was measured by a micro-averaged F1 score due to the three possible labels, as opposed to a traditional F1 score that evaluates performance on a binary classifier. The DrivenData competition provided a baseline Random Forest model with an accuracy of 58.15%, 15.93% less accurate than the tuned and feature-engineered Random Forest model produced in this analysis. Also, this analysis' Random Forest model ranked in the top 10% compared to models produced by more than 5300 other competitors (top score: 75.58%).

Final Thoughts on Results and Further Exploration

When designing a model for this dataset, it was important to consider the tradeoff between flexibility and interpretability. Tree based methods gave enough flexibility without sacrificing interpretability (unlike neural networks). Also, using tree based methods allowed for the creation of feature importance plots which helped draw conclusions that exploratory methods would not have exposed. Additionally, feature engineering approaches could be translated to future post-disaster datasets containing an ordinal variable describing a building's level of damage. For future direction, encoding can be explored for datasets where a categorical variable can be replaced with the average target variable of a group (target encoding) to raise F1. Thus, it is seen that the operations performed and results found in this analysis could provide a framework or reference point for further exploration.