Exploring Wildfire Data
To go further, exploration of the components of the wildfire time series is necessary. It is important to understand how each of the time series components is affecting the overall time series so that further models can take these components into account. There are four time-series data components: level, trend, seasonality, and noise. Level, trend, and seasonality are systematic components and noise/randomness is a nonsystematic component. A better understanding of the time series can be achieved by decomposing a time series into these components.
In this exploratory data analysis, the plots will be looking at a time series for the number of wildfires for California only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the research to only California for simplification purposes. California tends to be on the extreme in the wildfire/drought/flood/weather category (per the Data Visualization), and so it is easier to see the different components of the time series using California data.
Time Series Plot
Firstly, it is important to visualize the time series at its most basic level, a simple plot of data over time. Below is that plot of wildfire data over time for California.
When taking an initial look at the time series plot of the wildfire data for California, a couple of important observations can be made. First, there does not appear to be a trend in the data. This can be confirmed using the plot of the decomposition of the time series plot, but on initial observation, there does not appear to be a trend. With regards to seasonality, the time series data appears to show familiar characteristics of seasonal data, though this can be further explored in other plots. The peaks and troughs in the time series data do appear to fluctuate, indicating some variation in the data. In 2007, 2012, and 2014, the troughs are higher than average, seemingly indicating the presence of a higher than average winter wildfire season. The wildfire season in California seems to happen around the summer months, with clear indications of "bad" wildfire seasons as indicated by the higher-than-average peaks. Finally, this time series appears additive rather than multiplicative there does not appear to be an exponential increase in amplitudes over time.
Lag Plots
A lag plot is a scatter plot where time series are plotted in pairs against itself, some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the wildfire data.
The lag plot confirms the presence of seasonality due to the strong positive correlation at lag 12. Since this data is monthly totals of wildfires in California and wildfires tend to be seasonal, this lag plot makes sense. In lag 6, the negative relationship exists because the peaks of the wildfire season are being plotted against the troughs.
Further Exploration into Seasonality
An interesting and relevant parallel is to take a look and further confirm the presence of seasonality in the data. A number of different ways of doing this are possible, but a couple of fun ways to view this are presented below in an interactive manner. One is a heatmap that effectively visualizes the seasonality of the data. The second is a breakdown of seasonal plots.
These plots confirm the clear presence of seasonality in the data. From both graphs, it is also clear that wildfires tend to be at the peak from June to August in California.
Understanding the trend better
It is important to understand the trend of the data. To do that, a moving average can be applied and plotted at different intervals to bring out trends in the data. Below are a couple of different intervals and the resulting plots.
From the moving average models, the trend is not clear at all. In fact, there does not seem to be a trend, which can be further confirmed by decomposing the time series into its fundemental components. Looking at the 1 and 4 year moving average, there are years where it is clear that the wildfire season was worse than other years. However, from the 5 year moving average, the line is relatively constant over time.
Decomposed Time Series
In order to truly break down the time series data into its core components, decomposition must be run. When decomposing the time series, the four principal components are extracted, including level, seasonality, trend, and noise/randomness. Below, a visual of the decomposed time series data for wildfires in California can be seen. In this plot, the seasonality, trend, and noise are seen. The noise is also known as the remainder in this plot.
When looking at the decomposed time series plot, seasonality is clear. However, it is not clear there is a trend in this plot, which confirms the initial assumption that there was no trend according to the original plot of the time series data.
Autocorrelation in Time Series
An essential piece of information when it comes to analyzing time series data is to determine whether a time series is stationary or not. One way to make that determination is by viewing the ACF and PACF plots. The ACF (autocorrelation function) plot is a visualization of correlations between a time series and its lags. In contrast, the PACF (partial autocorrelation function) plot visualizes the partial correlation coefficients and its lags. Below are visualizations of each of the plots described above. Basically, instead of finding correlations of present values with lags like the ACF, it finds correlations of the residuals after removing the effects explained by earlier lags.
The figure above indicates a significant correlation at certain lags, specifically strongly negatively correlated at lag 6 and strongly positively correlated at lag 12. This is a clear indication of seasonality. From the PACF plot, we can see significant correlations at the first couple of lags followed by correlations that are not significant. This seems to indicate an autoregressive term in the data.
One of the more important outcomes of graphing the ACF plot is to determine whether the time series is stationary. A stationary series is a time series with a constant mean and variance and no seasonality (periodic fluctuations). When using models on time series data, the assumption is that each point is independent of the other. However, many times, that is not the case. However, if a time series can be transformed to become stationary, the assumptions that models make concerning independence work when the time series is stationary.
Given the results of the ACF plot, it is pretty clear this time series is not stationary, given the bars in the chart often are outside of the blue dashed lines. Interestingly, the results of an Augmented Dickey-Fuller test (a test for stationarity) produce a p-value less than 0.01, implying a rejection of the null hypothesis in favor of the alternative hypothesis that the series is stationary. However, visual inspection seems to contradict these results.
Finally, the seasonal component should be removed to make the time series stationary. After removing the seasonal component from the time series, the following plot of the ACF was achieved. Below is a visualization of the original ACF combined with the stationary version of the time series.
As shown in the figure above, the series is now stationary and ready to be utilized in future analysis.