Exploring Drought Data
To go further, exploration of the components of the drought time series is necessary. It is important to understand how each of the time series components is affecting the overall time series so that further models can take these components into account. There are four time-series data components: level, trend, seasonality, and noise. Level, trend, and seasonality are systematic components and noise/randomness is a nonsystematic component. A better understanding of the time series can be achieved by decomposing a time series into these components.
In this exploratory data analysis, the plots will be looking at a time series for the drought index for California only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only California for simplification purposes. California tends to be on the extreme in the wildfire/drought/flood/weather category (per the Data Visualization) and so it is easier to see the different components of the time series using California data.
Time Series Plot
Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time. Below is that plot of drought data over time for California.
When taking an initial look at the time series plot of the drought data for California, a couple of important observations can be made. First, there does not appear to be a trend in the data. This can be confirmed using the plot of the decomposition of the time series plot, but on initial observation, there does not appear to be a trend. Additionally, there does not appear to be seasonality in the time series. This can also be confirmed using the plot of the decomposition of the time series. However, from the plot of the time series, it appears to have some signs of heteroskedasticity, meaning the variance is unequal across a range of values. Further investigation concerning stationarity will provide more concrete details on whether this initial observation is correct. Finally, this time series appears additive rather than multiplicative there does not appear to be an exponential increase in amplitudes over time.
Lag Plots
A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the drought data.
The lag plot indicates some correlation between the first and second lags, but beyond the fifth lag, the plots indicate there is no correlation with the lags. This seems to confirm some initial beliefs about the lack of seasonality in the time series.
Further Exploration into Seasonality
An interesting parallel is to take a look and further confirm the presence of seasonality in the data. A number of different ways of doing this is possible, but a couple of fun ways to view this are presented below in an interactive manner. One is a heatmap that effectively visualizes the seasonality of the data. The second is a breakdown of seasonal plots.
Here again, these plots confirm the lack of seasonality in the data. When exploring the heatmap, it is interesting to note the spells of drought (indicated by the lighter color). So rather than seasonality, perhaps there are some cyclical trends within the data. A specific example is the lengthy drought in 2007-2009, followed by no drought from 2010-2012.
Understanding the trend better
It is important to understand the trend of the data. To do that, a moving average can be applied and plotted at different intervals to bring out trends in the data. Below are a couple of different intervals and the resulting plots.
Considering a negative value for a drought index implies higher levels of drought, the moving average plot shows a general negative trend, implying more drought in California than usual. This really can be seen in the 20 year moving average. Interestingly, for the other moving average plots, the variation is quite large, implying extreme drought following by lots of precipitation, which confirms what many believe.
Decomposed Time Series
In order to truly break down the time series data into its core components, decomposition must be run. When decomposing the time series, the four principal components are extracted, including level, seasonality, trend, and noise/randomness. Below, a visual of the decomposed time series data for wildfires in California can be seen. In this plot, the seasonality, trend, and noise are seen. The noise is also known as the remainder in this plot.
When looking at the plot of the decomposed time series, it is pretty clear there is no trend.
Autocorrelation in Time Series
An essential piece of information when it comes to analyzing time series data is to determine whether a time series is stationary or not. One way to make that determination is by viewing the ACF and PACF plots. The ACF (autocorrelation function) plot is a visualization of correlations between a time series and its lags. In contrast, the PACF (partial autocorrelation function) plot visualizes the partial correlation coefficients and its lags. Below are visualizations of each of the plots described above. Basically, instead of finding correlations of present values with lags like the ACF, it finds correlations of the residuals after removing the effects explained by earlier lags.
The ACF plot is very telling. Since the ACF plot shows a slow decay, the future values are heavily correlated with past values. From the PACF plot, we can see significant correlations at the first lag followed by correlations that are not significant. This seems to indicate an autoregressive term in the data.
One of the more important outcomes of graphing the ACF plot is to determine whether the time series is stationary. A stationary series is a time series with a constant mean and variance and no seasonality (periodic fluctuations). When using models on time series data, the assumption is that each point is independent of each other. However, many times, that is not the case. However, if a time series can be transformed to become stationary, the assumptions that models make concerning independence work when the time series is stationary.
Given the results of the ACF plot, it is pretty clear this time series is not stationary given the bars in the chart often are outside of the blue dashed lines. Interestingly, the results of an Augmented Dickey-Fuller test (a test for stationarity) produce a p-value less than 0.01, implying a rejection of the null hypothesis in favor of the alternative hypothesis that the series is stationary. Visual inspection seems to contradict these results.
Finally, to make the time series stationary, the correlation should be removed. Additionally, a log of the time series should be taken to remove heteroskedasticity. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.
As can be seen in the figure above, the series is now stationary and ready to be utilized in future analysis.