I had chance to work on a time series project for a prestigious company. The actual data is not shareable due to confidentiality purpose, but I would like to share my important general learnings below for anyone interested in time series modeling. Please note that some chart measurements and labels were purposefully left out to filter out sensitive data.
I started with exploring and examining seasonality pattern in data. I used a technique called seasonality decomposition. This is very helpful to break down original data into trending, seasonality and noise components (see below). Noted that target variable is growing YoY with an annual seasonality pattern.
On top of that, I would like to build a model not only predicts based on time series component, but also other features, also called exogenous features/variables. For example, predicting a company’s revenue should not only base on previous revenue, but also other factors such as GDP growth, market share growth, etc. In addition, I used Dicky-Fuller test to assess the stationarity of data, which is one of important assumptions for time series modeling. A stationary time series means its statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Noted the original data is not stationary and only becomes stationary by taking differences.
I ended up selecting SARIMAX for the modeling, as it satisfies all of my requirements. Compared to ARIMA, S adds seasonality component, and X adds exogenous variables. It can also model based on difference as well.
During the actual modeling, I used auto_arima package for hyper-parameters tuning, and AIC and BIC as performance metrics, which range anywhere from hundreds to thousands, and lower the better. However, no matter how I tuned the model, the AIC and BIC constantly stuck within 6000s range.
I went back and re-examined the data, and noted possible delaying impact resulted by exogenous variable. For example, when an economic event occurs, it may not cause immediate impact to business outcome; the impact may delay for a few weeks, or even a few months. Recognizing that, I observed the time periods to shift and assessed the best number of periods to shift. Boom, all of sudden the model performance boosted from 6000s to 600s.
Due to the time constraint, I ended my analysis there. I provided my prediction outcome below. Orange curve represents historical actuals and blue represents future prediction. The overlap represents the testing period. As you can see from the testing period, the prediction was able to pick up the peaks and valleys of actual data, which indicates a decent performance.
Last but not the least, let me summarize my key learnings below:
- Separating Trending From Seasonality: Seasonality Decomposition
- Model Selection: SARIMAX is helpful to model for both seasonal and exogenous variables
- Hyper parameter tuning: Auro-Arima automates and adds efficiency to the hyper-parameter tuning
- Delayed impact – exogenous features can have a delayed impact on targeted variable. One approach to solve this is by identifying appropriate delaying periods and shifting the exogenous feature accordingly.