Exploring Flood Data
To go further, exploration of the components of the flood time series is necessary. It is important to understand how each of the time series components is affecting the overall time series so that further models can take these components into account. There are four components of time series data, mainly level, trend, seasonality, and noise. Level, trend, and seasonality are systematic components, and noise/randomness is nonsystematic. A better understanding of the time series can be achieved by decomposing a time series into these components.
In this exploratory data analysis, the plots will be looking at a time series for the number of floods for California only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only California for simplification purposes. California tends to be on the extreme in the wildfire/drought/flood/weather category (per the Data Visualization), and so it is easier to see the different components of the time series using California data.
Time Series Plot
Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time. Below is that plot of flood data over time for California.
When taking an initial look at the time series plot of the flood data for California, a couple of important observations can be made. First, there appears to be a slight trend in the data. This can be confirmed using the plot of the decomposition of the time series plot. With regards to seasonality, the time series data does not appear to show the classic characteristics of seasonality. It could be argued that floods in California are more likely to appear in the winter months as the rainy season tends to cause more floods then. This isn't an incorrect argument, but according to the basic time series plot, the seasonal trends don't seem prevalent. More notable than seasonality is the large spikes in 2014, 2017, and 2019. These indicate a large number of floods, almost outliers without prior knowledge of the data. Finally, this time series appears multiplicative rather than additive as there appears to be an exponential increase in amplitudes over time.
Lag Plots
A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the flood data.
The lag plots confirm the lack of a strong seasonality presence as there is no correlation amongst the lags.
Further Exploration into Seasonality
An interesting and relevant parallel is to take a look and further confirm /deny the presence of seasonality in the data. Several different ways of doing this are possible, but a couple of fun ways to view this are presented below in an interactive manner. One is a heatmap that effectively visualizes the seasonality of the data. The second is a breakdown of seasonal plots.
Here again, these plots confirm the lack of a presence of seasonality in the data. From both graphs, it is clear that although the worst of the floods tend to occur in the winter, but there is no regular trend that indicates seasonality.
Understanding the trend better
It is important to understand the trend of the data. To do that, a moving average can be applied and plotted at different intervals to bring out trends in the data. Below are a couple of different intervals and the resulting plots.
From the figures above, it is relatively clear the trend steady and positive even though the variation is large. From the 5 year moving average, the number of floods continues to increase.
Decomposed Time Series
In order to truly break down the time series data into its core components, decomposition must be run. When decomposing the time series, the four principal components are extracted, including level, seasonality, trend, and noise/randomness. Below, a visual of the decomposed time series data for floods in California can be seen. In this plot, the seasonality, trend, and noise are seen. The noise is also known as the remainder in this plot.
When looking at the decomposed time series plot, the presence of a slightly positive trend is clear, though the trend is weak at best.
Autocorrelation in Time Series
An essential piece of information when it comes to analyzing time series data is to determine whether a time series is stationary or not. One way to make that determination is by viewing the ACF and PACF plots. The ACF (autocorrelation function) plot is a visualization of correlations between a time series and its lags. In contrast, the PACF (partial autocorrelation function) plot visualizes the partial correlation coefficients and its lags. Below are visualizations of each of the plots described above. Basically, instead of finding correlations of present values with lags like the ACF, it finds correlations of the residuals after removing the effects explained by earlier lags.
The figure above indicates the series is stationary after the first lag. From the PACF plot, we can see significant correlations at the first lag followed by correlations that are not significant. This seems to indicate an autoregressive term in the data.
One of the more important outcomes of graphing the ACF plot is to determine whether the time series is stationary. A stationary series is a time series with a constant mean and variance and no seasonality (periodic fluctuations). When using models on time series data, the assumption is that each point is independent of the other. However, many times, that is not the case. However, if a time series can be transformed to become stationary, the assumptions that models make concerning independence work when the time series is stationary.
Given the results of the ACF plot, it is pretty clear this time series is stationary, given the bars in the chart are inside of the blue dashed lines. Furthermore, the results of an Augmented Dickey-Fuller test (a test for stationarity) produce a p-value less than 0.01, implying a rejection of the null hypothesis in favor of the alternative hypothesis that the series is stationary. Visual inspection seems to confirm these results.
Considering this time series is stationary, a final plot of the ACF is provided below with no additional modifications required.
As shown in the figure above, the series is stationary and ready to be utilized in future analysis.