Created date
Jun 16, 2022 01:21 PM
Data Science
Applied forecasting
This section builds up:
  • How to make a regression forecast
  • Metrics to decide good regression model
  • View residual plots
  • Correlation, causation and forecasting behind Regression

Multiple regression and forecasting

notion image
  • yt is the variable we want to predict: the “response” variable
  • Each xj,t is numerical and is called a “predictor”. They are usually assumed to be known for all past and future times.
  • The coefficients β1, . . . , βk measure the effect of each predictor after taking account of the effect of all other predictors in the model. I.E., the coefficients measure the marginal effects

Some useful predictors for linear models

Here we introduce 4 ways of dealing with predictors. They are Dummy Variable, Fourieer, Intervention variable, and distributed lag.

Dummy variables

Uses of dummy variables

Fourier series

We know how to add dummy variables to the regression model, but the problem is that there will be too many terms (). For example,
  • To find quarterly data, we set , then we get 3 dummy variables.
  • To find yearly data, we set , then we get 11 dummy variables.
  • To find hourly data, we set , then we get 23 dummy variables.
  • To find weekly data, we set , then we get 51 dummy variables.
  • To find daily data, we set to get the annual seasonality, then we get 364 dummy variables.
Which is way too many.
Solution: Fourier series - (i.e., harmonic regression)
notion image
We need fewer predictors than with dummy variables, especially when  is large.
  • Particularly useful for weekly data, for example, where m≈52. For short seasonal periods (e.g., quarterly data), there is little advantage in using Fourier terms over seasonal dummy variables.
What is K
K is the parameter we set to specify how many pairs of sin and cos terms to include.
  • , m = seasonal period
e.g., if we set model(TSLM(Beer ~ trend() + fourier(K = 2))) , then that means the model has 2 pairs of sin and cos
What if only the first two Fourier terms are used ( and )?
The seasonal pattern will follow a simple sine wave.
A regression model containing Fourier terms is often called a harmonic regression because the successive Fourier terms represent harmonics of the first two Fourier terms.
Example1: Harmonic regression: beer production
recent_production %>% model( f1 = TSLM(Beer ~ trend() + fourier(K = 1)), f2 = TSLM(Beer ~ trend() + fourier(K = 2)), season = TSLM(Beer ~ trend() + season()) )
R interpretation
notion image

Example2: Harmonic regression: eating-out expenditure
Looking at the cafe data, it shows a seasonality.
Looking at the cafe data, it shows a seasonality.
R Code for Fourier term
fit <- aus_cafe %>% model( K1 = TSLM(log(Turnover) ~ trend() + fourier(K = 1)), K2 = TSLM(log(Turnover) ~ trend() + fourier(K = 2)), K3 = TSLM(log(Turnover) ~ trend() + fourier(K = 3)), K4 = TSLM(log(Turnover) ~ trend() + fourier(K = 4)), K5 = TSLM(log(Turnover) ~ trend() + fourier(K = 5)), K6 = TSLM(log(Turnover) ~ trend() + fourier(K = 6)) )
This is a yearly dataset, so , we can set k up to 6, having up to 6 pairs of cos and sin.
notion image
notion image
Interpretation of the plot
  • When , it's only going to be 11 coefficients. This is because the sin and cos for the last coefficient is redundant. You've got the sine of pi times t is equal to 0.
  • When , the seasonality is actually just a sine wave.
  • When , it gets a smaller AICc, and my pattern is now starting to a bit more complicated by adding in one more harmonic.
  • There is a tiny difference between and . The AICc of 6 is not as good as 5. So the best model is
Rule of thumb:
  • As we add extra terms, the shape of the seasonality gets more complicated. It tries to match what's going on in the data so that's the genius of fourier terms!
  • WE can model any type of periodic pattern by having enough terms
What is the benefit of Fourier term?
  • To find quarterly data, we set , then we get 3 dummy variables.
  • To find yearly data, we set , then we get 11 dummy variables.
  • To find weekly data, we set , then we get 51 dummy variables.
We would not bother using quarterly data, as m is small. But when it comes to long periods of seasonality like
  • monthly data, we can sometimes save one or two degrees of freedom.
  • daily data, with an annual pattern, then , which is large. So, instead of having 364 coefficients, we might use 10, 12 or some small number of fourier terms.
  • hourly data, we are modeling a time of day pattern, then , or we are modelling a half-hourly day pattern where , So, instead of having 23 or 47coefficients, we might use 3 or 8 fourier terms.
  • weekly data, , because of the extra days after the end of the 52nd week. So, you can have a non-integer .
Can k be non-integer
Of course yes! For weekly data, , because of the extra days after the end of the 52nd week.
So, you can have a non-integer . All you're doing is computing things like 2 pi kt on m, you can do that when m is not an integer there's no problem and so you can handle non-integer seasonality.

Intervention variables

Example: ad expense to the sales
We regress sales on our usual variable ad expenditures and we created a dummy variable to capture whether the sales are taking place during the intervention or not.
We increased our ad expenditures during red framed period of time, for example start giving coupons.
Our dummy variable takes a value of 1 whenever the time period or sales are occurring in a period with our intervention.
notion image
The value of show us the effect of that particular intervention on the sales value as compared with the the omitted category that is without the intervention

As this example goes, we can have 3 cases:
notion image
Case 1: Spikes
We have the output variable against time.
  • Our variable was having a slope of and then suddenly we intervene in the market in the mid-period where our ad expenditures went up. Suddenly, we saw that the value of the sales variable went up as well and then the increase in the ad expenditures last until our sales went back to our normal sales.
  • When we say normal, here we assume
  • The slope changes during the intervention with which the dummy variable = 1.
notion image
Case 2: Steps
Slicing our data into parts, we will denote the time period before the intervention by 0 and the period with the “permanent ” intervention denoted by 1.
notion image
Case 3: Change of slope
By increasing our ad expenditures, our happens before the intervention
After the intervention is much steeper .
notion image
  1. Case 1 and case 2 are linear model, whereas case 3 is non-linear.
We can also make the dummy when variables there are 1) Outliers, 2) Holidays, 3) Trading days, and so on

Distributed lags

Say our and .
We don't necessarily expect when advertising goes up, sales go up straight away. It might take a little while before people actually make the sale, especially for expensive goods. So sometimes you'll want to put in lagged variable of how much did we ever spend on advertising last month and how's that affecting sales this month.
Nonlinear trend

Residual diagnostics

For forecasting purposes, we require the following assumptions:
We use 2 types of Residual plots to
  1. spot outliers.
  1. decide whether the linear model was appropriate.
  1. Scatterplot of residuals εt against each predictor xj,t
    1. notion image
      If a pattern is observed, there may be “heteroscedasticity” in the errors. This means that
      1. the variance of the residuals may not be constant.
      1. the relationship is nonlinear
2. Scatterplot residuals against the fitted values yˆt
notion image
If a plot of the residuals vs fitted values shows a pattern, then there is heteroscedasticity in the errors. (Could try a transformation.)
If this problem occurs, a transformation of the forecast variable such as a logarithm or square root may be required

Selecting predictors and forecast evaluation

Comparing regression models
We have AIC, , BIC, to compare the model.
notion image
It is not so useful because
  1. does not allow for “degrees of freedom’ ’.
  1. Adding any variable tends to increase the value of R^2, even if that variable is irrelevant.


Solution to this is to use ADJUSTED R^2
Solution to this is to use ADJUSTED R^2
notion image

Akaike’s Information Criterion (AIC)

notion image
where L is the likelihood and k is the number of predictors in the model.
  • AIC penalizes terms more heavily than R^2.
  • Minimizing the AIC is asymptotically equivalent to minimizing MSE via leave-one-out cross-validation (for any linear regression).
AIC has a caveat because when T (# obs) is too low, the AIC tends to select too many predictors.

Bias-Corrected AIC (AICc)

notion image

Bayesian Information Criterion (SBIC / BIC / SC)

notion image
where L is the likelihood and k is the number of predictors in the model.
BIC penalizes terms more heavily than AIC.
Minimizing BIC is asymptotically equivalent to leave-v-out cross-validation when v = T[1 − 1/(log(T) − 1)].
  • We don’t use BIC because it's not optimizing any predictive property of the mode

To calculate MSE, we use CV too!


notion image
Traditionally you've got training sets and test sets time series cross validation is where you fit lots of training sets and you're predicting one step ahead.

Leave-one-out cross-validation

notion image
We use all of the data, except for one on one points; we predict that one point.
Then, we use all of the data except for a different point to predict that point and so on.
So the test set is always one observation in the data, and we got t possible test sets, where t is the length of the data set.
What are the step of LOOCV?
  1. Remove observation t from the data set, and fit the model using the remaining data. Then compute the error (e∗t=yt−^yt) for the omitted observation. (This is not the same as the residual because the tth observation was not used in estimating the value of ^yt.)
    1. t
  1. Repeat step 1 for t=1,…,T.
    1. t=1,…,T
  1. Compute the MSE from e∗1,…,e∗T. We shall call this the CV.
    1. e1∗,…,eT∗

Choosing regression variables (
Subset selection

Best Subset Selection

Fit all possible regression models using one or more of the predictors.
The overall best model is chosen from the remaining models.
The best model is chosen through cross-validation or some other method that chooses the model with the lowest measure of test error. ETC3550, we use CV, AIC, AICc
In this final step, a model cannot be chosen based on R-squared, because R-squared always increases when more predictors are added. The model with the lowest K-Fold CV test errors is the best model.
It is computationally expensive and as the # of predictors increases, the combination grows exponentially.
  • For example, 44 predictors leads to 18 trillion possible models!

Backward selection (or backward elimination),

Start with a model containing all variables.
Try subtracting one variable at a time.
Keep the model if it has lower CV or AICc. Iterate until no further improvement.
In another words, Starts with all predictors in the model (full model), iteratively removes the least contributive predictors and stops when you have a model where all predictors are statistically significant.
improvement is determined by metrics like RSS,CV or adjusted R square; ETC3550 uses CV or AICc.
notion image
  1. Computational power is very similar to forwarding Selection.
  1. Stepwise regression is not guaranteed to lead to the best possible model.
  1. Inference on coefficients of final model will be wrong.

Forecasting with regression

When using regression models for time series data, we have the different types of forecasts that can be produced, depending on what is assumed to be known when the forecasts are computed.
A comparative evaluation of ex-ante forecasts and ex-post forecasts can help to separate out the sources of forecast uncertainty. This will show whether forecast errors have arisen due to poor forecasts of the predictor or due to a poor forecasting model.

Ex-ante forecasts

Those that are made using only the information that is available in advance.
  • For example, ex-ante forecasts for the percentage change in US consumption for quarters following the end of the sample, should only use information that was available up to and including 2019 Q2.
  • These are genuine forecasts, made in advance using whatever information is available at the time. Therefore in order to generate ex-ante forecasts, the model requires forecasts of the predictors.
  • To obtain these we can use one of the simple methods introduced in Section 5.2 or more sophisticated pure time series approaches that follow in Chapters 8 and 9. Alternatively, forecasts from some other source, such as a government agency, may be available and can be used.

Ex-post forecasts

Those that are made using later information on the predictors.
  • For example, ex-post forecasts of consumption may use the actual observations of the predictors, once these have been observed.
  • These are NOT genuine forecasts, but are useful for studying the behaviour of forecasting models.
  • The model from which ex-post forecasts are produced should not be estimated using data from the forecast period.
  • That is, ex-post forecasts can assume knowledge of the predictor variables (the xx variables), but should not assume knowledge of the data that are to be forecast (the yy variable).

Scenario based forecasting

  • When we don’t know the information that is available in advance, we assumes possible scenarios for the predictor variables known in advance.
    • For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample.
  • Prediction intervals for scenario based forecasts do NOT include the uncertainty associated with the future values of the predictor variables.
    • The resulting forecasts are calculated below and shown in Figure 7.18.
      • notion image
R code for Scenario based forecasting
#1. make future_scenarios future_scenarios <- scenarios( Increase = new_data(us_change, 4) %>% ## we set up 4 peroids ahead # assuming income increase 1%, Savings by .5%, Unemployment by 0%, Production by 0%) mutate(Income = 1, Savings = 0.5, Unemployment = 0, Production = 0), Decrease = new_data(us_change, 4) %>% mutate(Income = -1, Savings = -0.5, Unemployment = 0, Production = 0), names_to = "Scenario" ) #2. Make predictions fc <- forecast(fit_consBest, new_data = future_scenarios)

Building a predictive regression model

notion image

Correlation, causation and forecasting

Correlation is not causation

  • When x is useful for predicting y, it is not necessarily causing y. e.g., predict number of drownings y using number of ice-creams sold x.
  • Correlations are useful for forecasting, even when there is no causality.
  • Better models usually involve causal relationships (e.g., temperature x and people z to predict drownings y)


It occurs when :
  • Two predictors are highly correlated (i.e., the correlation between them is close to ±1).
  • A linear combination of some of the predictors is highly correlated with another predictor.
  • A linear combination of one subset of predictors is highly correlated with a linear combination of another subset of predictors



Simple forecasting methods ARIMA