Evaluating Combined Forecasts for Realized Volatility Using Asymmetric Loss Functions

In this work we provide the findings of a forecast combination analysis carried out on the realized volatility series of three market indexes (DAX, CAC, and AEX). Two volatility types (5 minutes, kernel) have been considered. Different loss functions suggest that forecasts computed through combining models are generally more accurate than those provided by single models. However, the choice of the latter can significantly affect the goodness of the results. JEL classification: C22, C53, C58


Introduction
Volatility is a central parameter in many financial decisions, including the pricing and hedging of derivative products as well as the development of efficient risk management methods.Most of the volatility models presented in the literature are based on the empirical finding that volatility is time-varying and that periods of high volatility tend to cluster (Ané, 2006).The forecasting process of such an important measure represents a major issue.
In the literature there exists a wide variety of models to forecast volatility forecasts, but these are, almost by definition, simple and incomplete (Raviv, 2016).An improvement in the forecast accuracy can be achieved by combining forecasts originating from different types of models.Forecast combinations have been used successfully in empirical works in different areas such as forecasting Gross National Product, currency market volatility, inflation, money supply, stock prices, meteorological data, city populations, outcomes of football games, wilderness area use, check volume and political risks (Timmermann, 2006).
The aim of this paper is to use both single and combined models to forecast the daily realized volatility one step ahead for a one-year period.Thereafter we will compare predicted values with actual data using a number of loss functions.To carry out our analysis, we have used data on realized volatility of three market indexes (DAX, CAC, and AEX) for the period 2008 to 2016.
The remainder of the paper is organized as follows.Section 1 describes the data, the models adopted, and the loss functions used for evaluating the different forecasts.Section 2 presents the results of the analysis while Section 3 concludes the paper.

Data and Methodology
This study focuses on the realized volatility of three European market indexes: • DAX 30 (Deutsche Aktienindex 30 ) is a blue chip stock market index consisting of the • AEX (Amsterdam Exchange Index ) is a stock market index composed of Dutch companies that trade on the Euronext Amsterdam; it includes 25 most frequently traded securities on the exchange.
The time series of the indexes are provided by the Oxford-Man Institute of Quantitative Finance by means of its own website (http://realized.oxford-man.ox.ac.uk/data).For each asset, the dataset contains the realized volatility collected every 5 minutes, the realized kernel volatility (in both cases denoted by rv t ), and the daily returns (denoted by r t ), covering the period from January 1, 2008 to Dezember 31, 2016.
Three different models have been chosen to create the single forecasts: 1. Asymmetric Multiplicative Error Model (AMEM) (Engle, 2002;Engle and Gallo, 2006), which for a basic (1,1) order has the following structure: with ω > 0, α 1 ≥ 0, β 1 ≥ 0, α 1 + β 1 < 1. D t as a dummy variable that takes the value of 1 if the r t < 0 and 0 otherwise; 2. Asymmetric Power Multiplicative Error Model (APMEM), which for the usual (1,1) order is given by: This model is a generalization of the basic MEM and is strictly related to the Asymmetric Power ACD model (cf., Fernandes and Grammig, 2006); 3. Asymmetric Heterogeneous AutoRegressive Model (AHAR), that is the HAR model with a leverage effect term (Corsi, 2009): where (d) stands for the time horizons of one day, [rv t−1 ] is the weekly realized volatility, which at time t is given by the average: and rv t−1 is the monthly realized volatility which at time t is given by the average As a preliminary analysis, in Figure 1 we compare the forecasts obtained using the three models mentioned above for the year 2016 (colored lines) with the actual values of the volatility (dashed black line) for the DAX 5-minute series.The chart shows that all models react satisfactorily to positive peaks of volatility, whereas they are not able to achieve a suitable degree of accuracy when volatility reaches a local minimum.This issue, which is common also to the other observed time series, can be overcome by combining the forecasts of two models, as we will see later.
The combined methods are based on the following two combination models: • comb1 model, based on a simple unconstrained ordinary least squares estimation of the weights.The one-step-ahead forecast is given by with f (1) T (1) and f (2) T (1) denoting, respectively, the first and second model forecasts; • comb2 model, with the combination given by (1) which includes a dummy variable, D t , that takes the value 1 if rv t is lower than rv t−1 and 0 otherwise.The ratio of this choice is given by considering that, as we have mentioned before, the forecast of volatility is often far from the actual realized volatility while the latter is decreasing.

Loss Functions
To compare the results of the combined schemes with those that can be obtained by exclusively relying on a single model, we have computed three loss functions: 1. Mean Square Error (MSE), given by with rv T +i being the observed value of the realized volatility and rv T +i−1 (1) as the one step ahead forecast for time T + i, i = 1, . . ., n; 2. Quasi-Likelihood (QLIKE), defined as where T +i = rv T +i − rv T +i−1 (1).This measure is an extension of the MSE: each term of the sum reduces to T +i 2 when the indicator function is 0 (overestimation of the volatility) and is given by 1 + T +i 2 rv T +i m T +i 2 when the indicator function is 1 (underestimation of the volatility).
We decided to build up a new loss function for the evaluation of forecasts for two reasons.First, as we have already said, AMSE is more suitable than MSE when it comes to assessing forecasts that underestimate the volatility as it penalizes underestimation to a greater extent.Second, it can be shown that AMSE is able to perform more reasonably than QLIKE, one of the most widely used loss functions in the volatility forecasting literature (Patton, 2011).
Figure 2 displays a graphic comparison of QLIKE with MSE and two versions of its asymmetric modification, the first with power term m = 1 and the second with m = 2. On the x-axis we have depicted the relative deviations of the forecasts from the true value (which amounts to 2 in this case), whereas on the y-axis we have represented the relative difference of the loss functions between the cases of underestimation and overestimation of the same size.As expected, MSE appears as a flat line because it is a symmetric loss function.In contrast, QLIKE and AMSE start to rise almost immediately, particularly QLIKE which, as evidenced by the sharp slope of the red line, is able to reach very high values.However, the AMSE loss function appears distinctly smoother than QLIKE (especially when m = 1), indicating that AMSE is well-balanced and also more regular than the QLIKE loss function.
For computing the forecast combinations, we start by splitting the data into an estimation and training set and a test set.The former is again split into two parts, the first being used to estimate the parameters of the model, the second (the training period) to estimate the weights to be attributed to the single forecasts.The test set is used to evaluate the different models.We have chosen to take into account two different training periods in our analysis: a four-year training period and a three-year training period.For instance, with a four-year training period, we estimate the parameters of the models using observations from January 2, 2008 to December 31, 2011, then compute one step ahead forecasts from January 2, 2012 to December 31, 2015; these forecasts are used to estimate the weights of the combinations; finally, the one step ahead forecast for January 2, 2016 is produced.Then, we estimate the parameters of the models using observations from January 3, 2008 to January 2, 2012, compute one step ahead forecasts from January 3, 2012 to January 2, 2016 to estimate the weights of the combinations, and the one step ahead forecast for January 3, 2016 is produced.

Comparisons among forecasting models
In this section we will show the results of our analysis.For each model we display the values of the three loss functions mentioned above for both the forecasts and the observed values.
Because two of the three single models we have used (AMEM and APMEM) are very similar to each other, we present first a comparison between AMEM and AHAR, then between APMEM and AHAR, along with the combined schemes we described in Section 1.

AMEM vs. AHAR
The order of the two single models is defined using the Ljung-Box test on the residuals of the in-sample analysis of the two models.We have selected an AMEM (1,1) for the DAX dataset, an AMEM (1,2) for CAC and AEX, and an AHAR with a second lag term (rv t−2 ) for all datasets.
Table 1 shows the results for the first comparison, i.e., AMEM (1,1) and AHAR models, along with combined forecasts, using DAX data.We can see that the comb2 model performs very well for almost all indicators; only QLIKE prefers the AMEM (1,1) model.
The findings provided in Table 2 for the CAC dataset are very similar to those for the DAX dataset.Indeed, there are only two differences: QLIKE prefers comb2 for rv kernel with a training period of four years instead of AMEM (1,2); and AMSE with m = 1 prefers the AMEM model for rv 5 minutes with a training period of three years, instead of comb2.
The results for the AEX dataset (Table 3) are not so different from the others.The comb2 model predominates, but the AMEM (1,2) also performs well, particularly according to QLIKE (in three cases out of four).

APMEM vs. AHAR
In this subsection we assess if a generalization of the AMEM basic model is able to improve the accuracy of the combined forecasts.According to the Ljung-Box test, we use an APMEM (1,1) for DAX and an APMEM (1,2) for CAC and AEX.
As shown in Table 4, we have observed an actual improvement in the combined forecasts.Compared to the findings shown in Table 1, in 14 cases out of 16 the loss functions appoint the smallest value to a combination.Even according to QLIKE, comb2 is preferred over APMEM half the time.
As we expected, the improvement that occurred for the DAX dataset moving from AMEM to APMEM holds for CAC as well, even if it is less significant.Indeed, the results shown in Table 5 are almost the same as those that we see in Table 2 in terms of loss functions choices.However, this time AMSE (m = 1) selects the comb2 model for all volatility measures and training periods.
Observing the Table 6, we can gladly see that, compared to the Table 3, the transition from AMEM to APMEM has caused a consiistency of the loss functions.Almost all loss functions suggest choosing comb2.The single model APMEM (1,2) thas has only been chosen with the use of the by QLIKE statistic.Overall, these are the same findings that are seen in Table 5.

Accuracy of Forecasts
So far we have evaluated the available forecasts by means of the numerical values provided by the loss functions.Before doing so, however, we need to assess if the forecast series are different from a statistical point of view.To this end, we have used a conditional predictive ability (CPA) test of Giacomini and White (2006) to make pair-wise comparisons among all forecasting models (α = 0.05).The null hypothesis is that the two models under comparison have the same predictive accuracy.Because comb2 has proved to be the best combination scheme in most cases, we tested if it is more accurate than the other models as our alternative hypothesis.
Tables 7-9 provide the findings of the analysis according to the three datasets.At first glance, we observe some similarities among the comparisons.In more detail, we find that the two combination schemes, comb1 and comb2, show the same equal conditional predictive ability for all models, market indexes, types of realized volatility, or the training periods, except for kernel estimates with a training period of four years using the CAC dataset.Regarding the other comparisons, the alternative hypothesis was rejected for comb2 and AMEM twice in DAX, once in CAC, and three times out of four in AEX.The same applies for comb2 and APMEM.Finally, AEX data depict an equal forecast accuracy also between comb2 and AHAR when rv 5 minutes with a three-year training period is involved.In all other cases, the CPA test provides evidence that comb2 has a better predictive ability than the other models.Tables 7-9 especially make it clear that comb2 outperforms AHAR in almost all situations, except for the AEX case mentioned before.

Conclusions
In this paper, we demonstrate that an improvement in the accuracy of forecasts of a measure of volatility, namely realized volatility, can be achieved by combining predictions originating from several models.We forecast the daily realized volatility one step ahead for a one-year period with three single models (AMEM, APMEM, AHAR) and two combinations (comb1, comb2 ).Subsequently, we compare predicted values and actual data using a number of loss functions.We find that combining the AHAR model with APMEM instead of AMEM causes an enhancement in the quality of the forecasts computed using combination schemes, especially the comb2 model, which proves to be the best model in most situations.This finding holds for the DAX, AEX, and (to a lesser extent) CAC datasets, and for all training