Exploring Weather Data
To go further, exploration of the components of the average temperature time series is necessary. It is important to understand how each of the time series components is affecting the overall time series so that further models can take these components into account. There are four components of time series data, mainly level, trend, seasonality, and noise. Level, trend, and seasonality are systematic components, and noise/randomness is nonsystematic. By decomposing a time series into these components, a better understanding of the time series can be achieved.
In this exploratory data analysis, the plots will be looking at a time series for the average temperature for California only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only California for simplification purposes. California tends to be on the extreme in the wildfire/drought/flood/weather category (per the Data Visualization), and so it is easier to see the different components of the time series using California data.
Time Series Plot
Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time. Below is that plot of average temperature over time for California.
When taking an initial look at the time series plot of the temperature data for California, a couple of important observations can be made. First, there appears to be a slightly positive trend in average temperature (global warming). This can be confirmed using the plot of the decomposition of the time series plot. Additionally, there is seasonality in the time series as temperature varies over the seasons. This can also be confirmed using the plot of the decomposition of the time series. Finally, this time series appears additive rather than multiplicative there does not appear to be an exponential increase in amplitudes over time.
Lag Plots
A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the temperature data.
This lag plot is a classic example of seasonality in temperature over the period of a year (or many years). Here seasonality is clearly established. In lag 6, the correlations are negatively correlated, which makes sense as the temperatures in the winter are directly opposite of those in the summer. In lag 12, we can see the strong correlation in the plot again, indicating seasonality. Interestingly, the lag plot also shows correlation across lags (only 1 lag), so that will have to be taken into account.
Further Exploration into Seasonality
An interesting parallel is to take a look and further confirm the presence of seasonality in the data. A number of different ways of doing this are possible, but a couple of fun ways to view this are presented below in an interactive manner. One is a heatmap that effectively visualizes the seasonality of the data. The second is a breakdown of seasonal plots.
Here again, these plots confirm the presence of seasonality in the data. The temperature is higher on average in the summer and lower on average in the winter.
Understanding the trend better
It is important to understand the trend of the data. To do that, a moving average can be applied and plotted at different intervals to bring out trends in the data. Below are a couple of different intervals and the resulting plots.
From all of the plots of the moving average over time, the trend is clear. Temperatures are on the rise, steadily starting in the 70's and continuing through to the present.
Decomposed Time Series
In order to truly break down the time series data into its core components, decomposition must be run. When decomposing the time series, the four principal components are extracted, including level, seasonality, trend, and noise/randomness. Below, a visual of the decomposed time series data for wildfires in California can be seen. In this plot, the seasonality, trend, and noise are seen. The noise is also known as the remainder in this plot.
When looking at the plot of the decomposed time series, it is pretty clear there is a positive trend in temperature and a clear presence of seasonality in the data.
Autocorrelation in Time Series
An essential piece of information when it comes to analyzing time series data is to determine whether a time series is stationary or not. One way to make that determination is by viewing the ACF and PACF plots. The ACF (autocorrelation function) plot is a visualization of correlations between a time series and its lags. In contrast, the PACF (partial autocorrelation function) plot visualizes the partial correlation coefficients and its lags. Below are visualizations of each of the plots described above. Basically, instead of finding correlations of present values with lags like the ACF, it finds correlations of the residuals after removing the effects explained by earlier lags.
The figure above indicates a significant correlation at certain lags, specifically strongly negatively correlated at lag 6 and strongly positively correlated at lag 12. This is a clear indication of seasonality. From the PACF plot, we can see a large spike at lag 1 followed by a damped wave of alternating positive and negative coefficients. This generally indicates a higher-order moving average term in the data.
One of the more important outcomes of graphing the ACF plot is to determine whether the time series is stationary. A stationary series is a time series with a constant mean and variance and no seasonality (periodic fluctuations). When using models on time series data, the assumption is that each point is independent of the other. However, many times, that is not the case. However, if a time series can be transformed to become stationary, the assumptions that models make concerning independence work when the time series is stationary.
Given the results of the ACF plot, it is pretty clear this time series is not stationary given the bars in the chart often are outside of the blue dashed lines. Interestingly, the results of an Augmented Dickey-Fuller test (a test for stationarity) produce a p-value less than 0.01, implying a rejection of the null hypothesis in favor of the alternative hypothesis that the series is stationary. Visual inspection seems to contradict these results.
Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.
As shown in the figure above, the series is now stationary and ready to be utilized in future analysis.