# Mixture EMOS model for calibrating ensemble forecasts of wind speed

###### Abstract

Ensemble model output statistics (EMOS) is a statistical tool for post-processing forecast ensembles of weather variables obtained from multiple runs of numerical weather prediction models in order to produce calibrated predictive probability density functions (PDFs). The EMOS predictive PDF is given by a parametric distribution with parameters depending on the ensemble forecasts. We propose an EMOS model for calibrating wind speed forecasts based on weighted mixtures of truncated normal (TN) and log-normal (LN) distributions where model parameters and component weights are estimated by optimizing the values of proper scoring rules over a rolling training period. The new model is tested on wind speed forecasts of the 50 member European Centre for Medium-Range Weather Forecasts ensemble, the 11 member Aire Limitée Adaptation dynamique Développement International-Hungary Ensemble Prediction System ensemble of the Hungarian Meteorological Service and the eight-member University of Washington mesoscale ensemble, and its predictive performance is compared to that of various benchmark EMOS models based on single parametric families and combinations thereof. The results indicate improved calibration of probabilistic and accuracy of point forecasts in comparison with the raw ensemble and climatological forecasts. The mixture EMOS model significantly outperforms the TN and LN EMOS methods, moreover, it provides better calibrated forecasts than the TN-LN combination model and offers an increased flexibility while avoiding covariate selection problems.

Key words: Continuous ranked probability score, ensemble calibration, ensemble model output statistics, truncated normal distribution, log-normal distribution.

## 1 Introduction

In our industrialized world several important applications require reliable and accurate wind speed forecasts. These include, but are not limited to agriculture, aviation or wind energy production. In particular, high wind speeds can cause severe damages to infrastructure and their predictions are important parts of weather warnings. Wind speed forecasts are standard outputs of numerical weather prediction (NWP) models. NWP has traditionally been viewed as a deterministic problem, but over the last decades, a change towards probabilistic forecasts, i.e., forecasts in the form of a full predictive distribution, can be observed. Probabilistic forecasts are important in the context of weather forecasting as they allow for a quantification of the associated uncertainty of the prediction, and further allow for optimal point forecasting by using certain functionals of the predictive distribution (see, e.g., Gneiting, 2011).

Nowadays, weather services typically produce ensemble forecasts which consist of multiple runs of NWP models that differ in initial conditions and/or the numerical representation of the atmosphere (Gneiting and Raftery, 2005). While the transition to ensemble forecasts is an important step towards probabilistic forecasting, ensembles are finite and do not provide full predictive densities. Further, ensemble forecasts are typically underdispersive and subject to systematic bias, they thus require some form of statistical post-processing (Gneiting and Raftery, 2005; Gneiting et al., 2007).

State of the art techniques for statistical post-processing of ensemble forecasts include Bayesian model averaging (BMA) developed by Raftery et al. (2005), and ensemble model output statistics (EMOS) or non-homogeneous regression by Gneiting et al. (2005). The BMA approach uses weighted mixtures of parametric probability density functions (PDFs) which depend on the ensemble forecasts, with the mixture weights being determined based on the performance of the ensemble members in the training period. Possible component choices for wind speed are given by PDFs of Gamma distributions (Sloughter et al., 2010) or truncated normal (TN) distributions (Baran, 2014). By contrast, the predictive distribution of the EMOS approach is given by a single parametric distribution with parameters depending on the ensemble forecasts. Thorarinsdottir and Gneiting (2010) propose the use of truncated normal distributions for EMOS models of wind speed. Alternative choices are given by generalized extreme value distributions (Lerch and Thorarinsdottir, 2013), log-normal (LN) distributions (Baran and Lerch, 2015), and combinations thereof.

The article at hand builds on the EMOS framework and proposes models based on weighted mixtures of TN and LN distributions. This new approach allows for combining the advantages of lighter and heavier-tailed distributions, but avoids problems encountered in using previously proposed combination models. Apart from the flexibility, the mixture models further exhibit desirable properties from a theoretical perspective and provide well calibrated and skillful probabilistic forecasts.

The novel EMOS mixture approach is applied to forecasts of maximal wind speed of the 50 member ensemble of the European Centre for Medium-Range Weather Forecasts (ECMWF; ECMWF Directorate, 2012) and the eight-member University of Washington mesoscale ensemble (UWME; Eckel and Mass, 2005), and to instantaneous wind-speed forecasts of the 11 member Limited Area Model Ensemble Prediction System of the Hungarian Meteorological Service (HMS) called Aire Limitée Adaptation dynamique Développement International-Hungary Ensemble Prediction System (ALADIN-HUNEPS; Hágel, 2010; Horányi et al., 2006). The three ensemble prediction system differ in the generation of their members, which is accounted for in the model formulation. The TN model of Thorarinsdottir and Gneiting (2010), and the LN and TN-LN combination models of Baran and Lerch (2015) serve as benchmark models for the three case studies.

The remainder of this article is organized as follows. Section 2 contains a description of the ensembles and observational data. In Section 3 the EMOS technique is reviewed and the novel TN-LN mixture models are introduced. Section 4 summarizes the results of the three case studies. The article concludes with a discussion in Section 5.

## 2 Data

We consider three distinct data sets of ensemble forecasts and corresponding observations which differ both in the stochastic properties of the ensemble as well as the observed wind quantities. The data sets coincide with those used in Baran and Lerch (2015). We thus limit our discussion here to a succinct summary of the data and refer to Baran and Lerch (2015) for a more detailed description.

### 2.1 ECMWF ensemble

The ECMWF ensemble consists of 50 exchangeable ensemble members of one day ahead forecasts of 10 m daily maximum wind speed (given in m s) along with corresponding validating observations from 228 synoptic observation stations over Germany. The observations are daily maxima of hourly observations of 10-minute average wind speed measured over the 10 minutes before the hour, where the maxima are taken over the 24 hours corresponding to the time frame of the ensemble forecast. The results presented in Section 4.1 are based on a verification period from 1 May 2010 to 30 April 2011, consisting of 83 220 individual forecast cases.

Figure 1a shows the verification rank histogram of the raw ensemble, that is the histogram of ranks of validating observations with respect to the corresponding ensemble forecasts computed from the ranks at all locations and dates considered (see, e.g., Wilks, 2011, Section 7.7.2). The strongly U-shaped histogram indicates a highly underdispersive character of the ECMWF ensemble. The range of the ECMWF ensemble contains the validating observation only in of all cases (the nominal value of this coverage is , that is ), verifying the need of statistical post-processing.

### 2.2 ALADIN-HUNEPS ensemble

The ALADIN-HUNEPS system of the HMS (Horányi et al., 2006; Descamps et al., 2014) consists of 11 members, 10 exchangeable forecasts initialized from perturbed initial conditions and one control member from the unperturbed analysis. The data base contains ensembles of 42-h forecasts for 10 m wind speed (given in m s) for 10 major cities in Hungary, together with the corresponding validating observations for the one-year period between 1 April 2012 and 31 March 2013.

The validating wind speed measurements are considered as instantaneous values (valid at a given time), however, they are in fact mean values over the preceding 10 minutes. The model wind speed values are also considered as instantaneous, but they are representatives for a given model time step, which is 5 min in our case.

Similar to the ECMWF ensemble, the verification rank histogram of the raw ALADIN-HUNEPS ensemble is far from the desired uniform distribution (see Figure 1b), however, it shows a much less underdispersive character. The better fit of the ensemble can also be observed on its coverage value of which should be compared with the nominal coverage of ().

### 2.3 University of Washington Mesoscale Ensemble

The UWME covering the Pacific Northwest region of western North America has eight members that are obtained from different runs of the fifth generation Pennsylvania State University–National Center for Atmospheric Research mesoscale model (PSU-NCAR MM5) (Grell et al., 1995). Our data base contains ensembles of 48 h forecasts and corresponding validating observations of 10 m maximal wind speed (maximum of the hourly instantaneous wind speeds over the previous twelve hours, given in m s, see e.g. Sloughter et al. (2010)) for 152 stations in the Automated Surface Observing Network (National Weather Service, 1998). The ensemble members are not exchangeable as they are generated with initial conditions from different sources.

In the present study we investigate forecasts for calendar year 2008 with additional data from the last month of 2007 used for parameter estimation. After removing days and locations with missing data 101 stations remain where the number of days for which forecasts and validating observations are available varies between 160 and 291.

Figure 1c shows the verification rank histogram of the raw ensemble where, similar to the previous cases, one can again observe a strongly underdispersive character. The ensemble coverage equals , whereas the nominal coverage for eight ensemble members equals , that is .

## 3 Ensemble Model Output Statistics

As mentioned in the Introduction, the EMOS predictive distribution of a future weather quantity is a single parametric distribution, where the parameters depend on the ensemble. For example, a normal distribution provides a fairly good fit for temperature and pressure (Gneiting et al., 2005), whereas wind speed requires a distribution with non-negative support.

In what follows, we consider three different types of EMOS models: standard EMOS models based on a single parametric family, combination models that select one of multiple parametric distributions based on the values of suitable covariates, and new mixture models. Models based on single parametric families for wind speed which employ truncated normal, log-normal or generalized extreme value distributions, as well as combination models selecting one of these distribution based on covariates have been explored in previous works (see e.g. Thorarinsdottir and Gneiting, 2010; Lerch and Thorarinsdottir, 2013; Baran and Lerch, 2015) and are reviewed in Section 3.1. As an alternative choice, we propose new mixture models based on a weighted mixture of truncated normal and log-normal distributions in Section 3.2. The basic EMOS models from previous studies are used as benchmark models in order to assess the predictive performance of the novel mixture models in Section 4.

### 3.1 Basic EMOS models

#### Models based on single parametric distributions

The EMOS model introduced by Thorarinsdottir and Gneiting (2010) is based on a truncated normal (TN) distribution, i.e., the predictive distribution is

(3.1) |

where denote the ensemble of distinguishable forecasts of wind speed for a given location and time, stands for the ensemble mean, and denotes the TN distribution with location , scale , and cut-off at zero having probability density function (PDF)

where and are the PDF and the cumulative distribution function (CDF) of the standard normal distribution, respectively.

As an alternative to the TN distribution Baran and Lerch (2015) propose the use of a log-normal (LN) distribution where the mean and variance are affine functions of the ensemble members and ensemble variance, respectively, i.e.,

(3.2) |

Usually the PDF of the LN distribution is expressed using location and shape parameters and has the form

however, it can be easily expressed in terms of mean and variance with the help of transformations

(3.3) |

Lerch and Thorarinsdottir (2013) propose an EMOS approach based on a generalized extreme value (GEV) distribution, however, this model has the disadvantage of assigning positive probability to negative wind speed values (see, e.g., Baran and Lerch, 2015).

Location and scale/shape parameters of models (3.1) and (3.2) can be estimated from the training data consisting of ensemble members and verifying observations from the preceding days, by optimizing an appropriate verification score (see Section 3.3).

EMOS models (3.1) and (3.2) are valid only in the cases when the sources of the ensemble members are clearly distinguishable, which is the case for the UWME described in Section 2.3 or for the Consortium for Small-scale Modeling Germany (COSMO-DE) ensemble of the German Meteorological Service (Gebhardt et al., 2011). However, in most of the currently used EPSs some members are obtained with the help of perturbations of the initial conditions. These members are statistically indistinguishable and can be considered as exchangeable. This the case for the ECMWF and ALADIN-HUNEPS ensembles described in Sections 2.1 and 2.2, respectively.

Suppose we have ensemble members divided into exchangeable groups, where the th group contains ensemble members such that . In such situations the ensemble members within an exchangeable group should share the same parameters (Gneiting, 2014) resulting in a TN model

(3.4) |

and a LN model with mean and variance

(3.5) |

where denotes the th member of the th group.

#### Combination models

LN and GEV distributions have heavier upper tails than the TN distribution and are therefore more appropriate to model high wind speed values. To combine this advantage with the good performance of the TN model for low and medium wind speeds Lerch and Thorarinsdottir (2013) and Baran and Lerch (2015) also examined a regime-switching combination method, where either a TN or a heavy-tail distribution is used depending on the median value of the ensemble forecast. If the ensemble median is below a given threshold, wind speed is modeled by a TN distribution, otherwise a GEV or LN distribution is employed. The optimal threshold value for a given EPS is determined during a preliminary study and it is then fixed over the whole data set. The problem with this approach is that the threshold parameter is static (rarely updated) and cannot adapt to the changes in the ensemble. Baran and Lerch (2015) also consider a more adaptive method where the threshold parameter is re-estimated as a fixed quantile of the ensemble medians in the corresponding training period for each forecast date. However, this approach is computationally more demanding without yielding a significant improvement in predictive performance.

EMOS models based on combining two parametric families by exclusively selecting one of them at each forecast instance also suffer from the drawback that a suitable covariate has to be chosen as a selection criterion. This necessary step limits the flexibility of the combination models in practice as the adequacy of covariates might depend on the data set at hand. While the ensemble median works reasonably well in the data sets considered in this article, this observation might change for different EPSs.

### 3.2 Mixture models

In order to combine the advantages of lighter and heavier-tailed distributions and to avoid the aforementioned problems in the process, we introduce new EMOS models based on weighted mixtures of two parametric distributions.

In particular, we propose to model wind speed with a weighted mixture of models (3.1) and (3.2) (or (3.4) and (3.5) for exchangeable ensemble members) resulting in the predictive PDF

(3.6) |

where the dependence of parameters and on the ensemble are given by (3.1) (or (3.4)) and (3.2) (or (3.5)) and (3.3), respectively. In case of model (3.6) location and scale/shape parameters of the TN and LN models together with the weight are estimated simultaneously by optimizing some verification score over the training data.

Note that instead of a LN distribution, in (3.6) one can incorporate other non-negative laws with heavy right tails. A natural choice would be the generalized Pareto distribution (GPD) used in extreme value theory (see, e.g., Bentzien and Friederichs, 2012), however, tests for the ensemble forecasts considered here indicate a worse predictive performance of the TN-GPD model compared with the TN-LN mixture and the benchmark models.

In comparison with the basic EMOS models proposed in previous work, the new mixture models exhibit desirable properties from a theoretical perspective as they do not require the exclusive choice of one of multiple parametric families and are more flexible than models based on single parametric distributions. Their advantages from a practical perspective such as a significantly improved calibration will be demonstrated in Section 4.

### 3.3 Verification scores

The main aim of probabilistic forecasting is to access the maximal sharpness of the predictive distribution subject to calibration (Gneiting et al., 2007). Calibration means a statistical consistency between the predictive distributions and the validating observations whereas sharpness refers to the concentration of the predictive distribution. A straightforward way to check the calibration of a probabilistic forecast is the use of probability integral transform (PIT) histograms. The PIT is defined as the value of the predictive CDF evaluated at the verifying observations (Raftery et al., 2005) and the closer the histogram to the uniform distribution, the better the calibration. PIT histograms are continuous analogues of verification rank histograms, a comparison of these histograms can thus be used as a measure of the possible improvements due to statistical post-processing.

Apart from the visual inspection of PIT histograms, formal statistical test of uniformity can be used to assess calibration. As the PIT values of multi-step ahead probabilistic forecast exhibit serial correlation (see, e.g., Diebold et al., 1998) and the probabilistic forecasts cannot be assumed to be independent in space and time, we employ a moment-based test of uniformity proposed by Knüppel (2015) which accounts for dependence in the PIT values. In particular, we use the test of Knüppel (2015) that has been demonstrated to have superior size and power properties compared to alternative choices. Due to the large sample size in case of the ECMWF and UWME data, the null hypothesis of uniformity is rejected for all post-processing models. However, as our focus lies on the comparative assessment of calibration, we report bootstrap estimates of the rejection rates of the test based on 10 000 random samples of size 2 500 each. If a model exhibits superior calibration, the null hypothesis of uniformity should be rejected in fewer cases compared to a model with inferior calibration.

Another approach to assess calibration is the investigation of the coverage of the , central prediction interval, defined as the proportion of validating observations located between the lower and upper quantiles of the predictive distribution, where is chosen to match the nominal coverage of the raw ensemble (ECMWF: ; ALADIN-HUNEPS: ; UWME: ). The coverage of a calibrated predictive PDF should be around and the proposed choices of allow direct comparisons with the raw ensembles. Further, the average widths of these central prediction intervals provide information about the sharpness of the predictive distributions.

Calibration and sharpness can also be addressed simultaneously with the help of scoring rules which measure the predictive performance by numerical values assigned to pairs of probabilistic forecasts and observations (Gneiting and Raftery, 2007). The most popular scoring rules are the logarithmic score (LogS), that is the negative logarithm of the predictive PDF evaluated at the verifying observation,

and the continuous ranked probability score (CRPS; Gneiting and Raftery, 2007; Wilks, 2011). The CRPS of a CDF and an observation is defined as

(3.7) |

where denotes the indicator of a set . While both the CRPS and the logarithmic score are proper scoring rules (Gneiting and Raftery, 2007), the CRPS can be expressed in the same unit as the observation.

Further, point forecasts such as EMOS and ensemble medians and means are evaluated with the help of mean absolute errors (MAEs) and root mean squared errors (RMESs). We remark that the former is optimal for the median, whereas the latter is for the mean (Gneiting, 2011).

Finally, to evaluate the goodness of fit of probabilistic forecasts to high wind speed values a useful tool to be considered is the threshold-weighted continuous ranked probability score (twCRPS)

introduced by Gneiting and Ranjan (2011), where is a weight function. Obviously, yields the traditional CRPS defined by (3.7), while one may set to address wind speeds above a given threshold . Similar to Lerch and Thorarinsdottir (2013) and Baran and Lerch (2015), where the upper tail behaviors of regime-switching EMOS models are investigated, we consider threshold values approximately corresponding to the 90th, 95th and 99th percentiles of the wind speed observations. One can also quantify the improvement in twCRPS with respect to some reference predictive CDF with the help of the threshold-weighted continuous ranked probability skill score (twCRPSS; see, e.g., Lerch and Thorarinsdottir, 2013) defined as

This score is obviously positively oriented, and in this study the predictive CDF corresponding to the classical TN model is used as a reference.

In order to assess the statistical significance of observed score differences between the models we use formal statistical tests of equal predictive performance. Diebold-Mariano (DM; Diebold and Mariano, 1995) tests allow to account for dependence in the forecast errors and are widely used in the econometric literature. Denote the mean values of a proper scoring rule for two competing probabilistic forecasts and by and , respectively, where denotes the forecast cases in a test set of size . The test statistic of the DM test is given by

(3.8) |

where is a suitable estimator of the asymptotic standard deviation of the sequence of score differences . Under some weak regularity assumptions, asymptotically follows a standard normal distribution under the null hypothesis of equal predictive performance. Negative values of indicate a better predictive performance of , whereas is preferred in case of positive values of . The statistical significance of the observed values of the test statistic can be assessed by computing the corresponding p-values under the null hypothesis. Following suggestions of Diebold and Mariano (1995) and Gneiting and Ranjan (2011), we estimate the autocovariance of the sequence of score differences in (3.8) by the sample autocovariance up to lag in case of step ahead forecasts. The results of DM tests based on the LogS, the CRPS and the twCRPS are discussed in Section 4 for the individual ensembles.

## 4 Results

As mentioned in Section 1, the predictive skills of the mixture model (3.6) are tested on the 50 member ECMWF ensemble, the ALADIN-HUNEPS ensemble of the HMS and the eight-member UWME. The three EPSs differ both in generation of the ensemble members and in the predicted wind speed quantity. Model performances are evaluated with the help of the verification scores given in Section 3.3. The basic TN, LN and TN-LN regime-switching combination EMOS models proposed in previous studies are used as benchmark models to assess the predictive performance of the new mixture models. For a detailed evaluation and comparison of the different basic EMOS models, we refer to Baran and Lerch (2015). We further compare the forecasts based on post-processing with the raw ensemble and with climatological forecasts where the observations of the training period are considered as an ensemble.

Following the ideas of Gneiting et al. (2005) and Thorarinsdottir and Gneiting (2010) the parameters of the TN, LN and TN-LN mixture models are estimated by minimizing the mean CRPS of the predictive CDFs and corresponding validating observations over the training period. However, in case of model (3.6) the CRPS can be evaluated only numerically, resulting in very long optimization procedures. Numerical minimization of the right-most part of equation (3.7), i.e., leads to slightly lower computation times and better verification scores compared to minimizing alternative representations of the CRPS integral.

Due to the large computational costs of minimum CRPS estimation, we also investigate maximum likelihood (ML) estimation of the parameters. ML estimation corresponds to minimizing the mean logarithmic score which has a simple and closed form. From a theoretical statistics perspective, both estimation approaches fit into a general framework of optimum score estimation and share asymptotic properties such as consistency, see Gneiting and Raftery (2007) for details. In applications to post-processing ensemble forecasts, the CRPS is often seen as the more appropriate scoring rule for parameter estimation due to the lower sensitivity to outliers and extreme events compared to the LogS (see, e.g., Gneiting et al., 2005; Thorarinsdottir and Gneiting, 2010). In figures and tables, the corresponding mixture models are denoted by TN-LN mix. (CRPS) and TN-LN mix. (ML).

Finally, in order to ensure the comparability with the benchmark models the same training period lengths and for the TN-LN regime-switching models the same thresholds and parameter estimation techniques as in Baran and Lerch (2015) are employed.

### 4.1 ECMWF ensemble

As the fifty members of the ECMWF ensemble are fully exchangeable, the dependencies of the parameters of the TN and LN distributions on the ensemble members are specified by (3.4) and (3.5), respectively, with and .

The preliminary study by Baran and Lerch (2015) suggested that the optimal training period length for this particular data set is 20 days, whereas the optimal value of the threshold parameter of the TN-LN regime-switching combination model equals 8 m s, resulting in the use of an LN distribution in about of the forecast cases. As mentioned in Section 2.1, model verification is performed on 83 220 forecast cases from the one year period between 1 May 2010 and 30 April 2011.

Figure 2 showing the verification rank histogram of the raw ensemble and the PIT histograms of the investigated EMOS models clearly indicates that statistical post-processing significantly improves the calibration of the raw ensemble. However, the histograms of TN and TN-LN regime-switching models are still biased, to a smaller extent the same applies for the LN model, whereas the PIT values of mixture model (3.6) with both parameter estimation methods suggest a better fit to the desired uniform distribution.

Ensemble | TN | LN | TN-LN r.s. | TN-LN mix. (CRPS) | TN-LN mix. (ML) |
---|---|---|---|---|---|

ECMWF | 1 | 1 | 1 | 0.68 | 0.25 |

ALHU | 1 | 1 | 1 | 0 | 0.01 |

UWME | 1 | 1 | 0.68 | 0.47 | 0.31 |

In order to quantify the observed differences in calibration, Table 1 shows rejection rates of the test of uniformity based on random sub-samples. It can be observed that for the ECMWF data, the null hypothesis of uniformity is rejected in all of the cases for the TN, LN and TN-LN combination model, whereas the novel mixture models show much lower rejection rates. This observation is clearly in line with the visual inspection of the PIT histograms in Figure 2.

The positive effect of post-processing can also be observed in Table 2 summarizing the verification scores for different probabilistic forecasts together with the average width and coverage of the central prediction intervals. The improvement with respect to the raw ensemble and climatology is quantified in lower CRPS, twCRPS, MAE and RMSE values and the EMOS predictive PDFs result in calibrated central prediction intervals with coverages very close to the nominal value. The much wider central prediction intervals of the EMOS models compared to the ensemble are a natural consequence of the underdispersive character of the latter.

Forecast | CRPS | twCRPS m s | MAE | RMSE | Cover. | Av.w. | ||
---|---|---|---|---|---|---|---|---|

m s | m s | m s | m s | |||||

TN-LN mix. (CRPS) | 1.030 | 0.194 | 0.106 | 0.041 | 1.384 | 2.135 | 94.34 | 7.71 |

TN-LN mix. (ML) | 1.034 | 0.196 | 0.108 | 0.041 | 1.391 | 2.138 | 95.81 | 8.72 |

TN | 1.045 | 0.200 | 0.110 | 0.042 | 1.388 | 2.148 | 92.19 | 6.39 |

LN | 1.037 | 0.198 | 0.109 | 0.042 | 1.386 | 2.138 | 93.16 | 6.91 |

TN-LN r.s. () | 1.033 | 0.191 | 0.103 | 0.039 | 1.379 | 2.135 | 92.49 | 6.36 |

Ensemble | 1.263 | 0.211 | 0.113 | 0.043 | 1.441 | 2.232 | 45.00 | 1.80 |

Climatology | 1.550 | 0.251 | 0.128 | 0.045 | 2.144 | 2.986 | 95.84 | 11.91 |

Among the competing post-processing methods the TN-LN mixture and regime-switching models clearly outperform the TN and LN EMOS approaches in almost all scores investigated. The lowest CRPS value belongs to the mixture model with parameters estimated by optimizing the mean CRPS, whereas the regime-switching approach produces the best MAE, RMSE and twCRPS scores. The two parameter estimation methods make only a very slight difference in model performance (ML estimation leads to slightly worse scores) and the TN-LN mixture EMOS models are fully able to keep up with the regime-switching approach. This ranking of models can also be observed in Figure 3 displaying the twCRPSS values of the LN, TN-LN regime-switching and TN-LN mixture (with CRPS and logarithmic score optimization) EMOS methods with respect to the reference TN EMOS model as functions of the threshold . While the regime-switching approach outperforms the other models for all threshold values and has the best overall performance, it is clearly less flexible than the mixture models which show the second best performance.

TN | LN | TN-LN r.s. | |

DM test based on LogS | |||

TN-LN mix. (CRPS) | -36.67 | -24.25 | -32.79 |

TN-LN mix. (ML) | -36.43 | -28.91 | -32.86 |

DM test based on CRPS | |||

TN-LN mix. (CRPS) | -45.10 | -27.10 | -4.71 |

TN-LN mix. (ML) | -30.29 | -11.42 | 0.32 |

DM test based on twCRPS with threshold | |||

TN-LN mix. (CRPS) | -21.55 | -17.32 | 6.32 |

TN-LN mix. (ML) | -10.49 | -2.83 | 9.92 |

Table 3 summarizes the values of the test statistics of DM tests in order to assess the statistical significance of the observed score differences between the novel mixture models and the benchmark models. It can be observed that in terms of all employed proper scoring rules, the mixture models perform significantly better than the TN and LN models. The results of the comparison with the TN-LN combination model depend on the scoring rule. While the mixture models perform significantly better in terms of the LogS, the combination model is preferred in terms of the twCRPS.

Figure 4 shows the weights of the mixture model (3.6) estimated using optimizations with respect to the mean CRPS and the mean logarithmic score over the training data. Despite the similar predictive skills (see Table 2), the two parameter estimating methods result in completely different sets of weights having only a minor non-significant correlation of . However, having a closer look at the predictive PDFs one can observe that the corresponding locations and scales/shapes of the TN and LN components produced by the two different estimation methods are strongly correlated, their correlations vary between and , except for the scales of the TN component with a correlation of .

### 4.2 ALADIN-HUNEPS ensemble

The ALADIN-HUNEPS ensemble consists of a control member and 10 exchangeable ensemble members (see Section 2.2) inducing a natural splitting of the ensemble into two groups. The first group contains just the control, whereas the second group consists of the 10 exchangeable ensemble members. This results in TN and LN models (3.4) and (3.5), respectively, where , with and .

For this particular ensemble Baran et al. (2014) and Baran and Lerch (2015) showed that a training period of length 43 days is optimal both for the TN and the LN EMOS models, whereas the optimal threshold for the TN-LN regime-switching combination model is m s. For this threshold value, a LN distribution is used in of the forecast cases.

Forecast | CRPS | twCRPS m s | MAE | RMSE | Cover. | Av.w. | ||
---|---|---|---|---|---|---|---|---|

m s | m s | m s | m s | |||||

TN-LN mix. (CRPS) | 0.736 | 0.100 | 0.053 | 0.011 | 1.037 | 1.358 | 83.02 | 3.62 |

TN-LN mix. (ML) | 0.737 | 0.100 | 0.053 | 0.012 | 1.040 | 1.360 | 83.14 | 3.58 |

TN | 0.738 | 0.102 | 0.054 | 0.012 | 1.037 | 1.357 | 83.59 | 3.53 |

LN | 0.741 | 0.102 | 0.054 | 0.011 | 1.038 | 1.362 | 80.44 | 3.57 |

TN-LN r.s. () | 0.737 | 0.101 | 0.054 | 0.011 | 1.035 | 1.356 | 83.59 | 3.54 |

Ensemble | 0.803 | 0.112 | 0.059 | 0.013 | 1.069 | 1.373 | 68.22 | 2.88 |

Climatology | 1.046 | 0.127 | 0.064 | 0.012 | 1.481 | 1.922 | 82.54 | 3.43 |

Using a 43 days training period one has ensemble forecasts and verifying observations for 315 calendar days (i.e., 3 150 forecast cases) between 15 May 2012 and 31 March 2013. Figure 5 shows the PIT histograms of the various post-processing methods together with the verification rank histogram of the ALADIN-HUNEPS ensemble. Again, compared with the raw ensemble one can clearly see the improvement in calibration of post-processed forecasts, whereas from the competing EMOS methods the two variants of the mixture model (3.6) provide the most uniform PIT histograms.

The significantly better calibration of the mixture models can also be observed from the rejection rates of the test reported in Table 1. The null hypothesis of uniformity of the PIT values is not rejected for almost all of the random samples, whereas it is rejected on almost all occasions for the competing models based on single parametric distributions and the TN-LN regime-switching combination model.

In Table 4 the verification scores of probabilistic and point forecasts and coverage and average width of central prediction intervals are given for the various EMOS models, the ALADIN-HUNEPS ensemble and climatological forecasts. The raw ensemble outperforms climatology and produces sharp forecasts, however, at the cost of being uncalibrated. Post-processing substantially improves the calibration and predictive skill of the raw ensemble, which is in line with the shapes of histograms displayed in Figure 5. The TN-LN mixture and regime-switching combination models show some small improvements over the TN and LN models in terms of all scoring rules and display almost the same predictive performance. For small threshold values the two versions of the mixture model slightly outperform the three benchmark EMOS approaches, however, above the 99th percentile of the validating observations (9 m s), their performances decay quickly, see Figure 6.

TN | LN | TN-LN r.s. | |

DM test based on LogS | |||

TN-LN mix. (CRPS) | 0.71 | -4.78 | 0.63 |

TN-LN mix. (ML) | -2.43 | -5.36 | -2.09 |

DM test based on CRPS | |||

TN-LN mix. (CRPS) | -2.03 | -3.73 | -0.71 |

TN-LN mix. (ML) | -0.58 | -2.56 | 0.31 |

DM test based on twCRPS with threshold | |||

TN-LN mix. (CRPS) | -3.19 | -1.90 | -1.14 |

TN-LN mix. (ML) | -2.24 | -1.86 | -1.11 |

In order to assess the statistical significance of the score differences results of the DM tests of equal predictive performance are reported in Table 5. In terms of LogS and CRPS, the mixture models significantly outperform the LN model, and the same observation holds in terms of the twCRPS for the comparison with the TN model. In comparison with the TN-LN combination model, the preferred model depends on the employed scoring rule. The only significant score difference in this comparison is in terms of the LogS and favors the mixture model based on ML estimation.

Finally, similar to the previous case study, the weights belonging to the two parameter estimation methods for the TN-LN mixture model (see Figure 7) are uncorrelated, whereas the correlations of the corresponding location and scale/shape parameters of the TN ( and ) and LN components ( and ) are and , respectively.

### 4.3 University of Washington Mesocale Ensemble

The members of the UWME are clearly distinguishable, as they are generated using initial conditions from eight different sources. Hence, location and scale/shape parameters of the TN and LN models are linked to the ensemble via (3.1) and (3.2), respectively, with . According to Baran and Lerch (2015) the optimal training period for this data set is of length 30 days and the optimal threshold value of the TN-LN combination model is equal to 5.7 m s. Ensemble forecasts for calendar year 2008 are calibrated using these parameters, and in case of the regimes-switching approach a LN model is used in around one third of the 27 481 individual forecast cases.

Similar to the previous two sections consider first the PIT histograms of the EMOS predictive distributions displayed in Figure 8. Compared with the verification rank histogram of the raw ensemble, all post-processing methods result in significant improvements in the goodness of fit to the uniform distribution, while, from the various calibration methods, the TN-LN mixture and regime-switching models have the best performance.

The rejection rates of the tests reported in Table 1 indicate that the TN-LN mixture model based on ML estimation exhibits the best calibration followed by the mixture model based on minimum CPRS estimation and the TN-LN combination model.

Forecast | CRPS | twCRPS m s | MAE | RMSE | Cover. | Av. w. | ||
---|---|---|---|---|---|---|---|---|

m s | m s | m s | m s | |||||

TN-LN mix. (CRPS) | 1.105 | 0.147 | 0.073 | 0.010 | 1.550 | 2.045 | 79.02 | 4.77 |

TN-LN mix. (ML) | 1.108 | 0.147 | 0.073 | 0.010 | 1.560 | 2.062 | 78.12 | 4.78 |

TN | 1.114 | 0.150 | 0.074 | 0.010 | 1.550 | 2.048 | 78.65 | 4.67 |

LN | 1.114 | 0.147 | 0.073 | 0.010 | 1.554 | 2.052 | 77.29 | 4.69 |

TN-LN r.s. () | 1.105 | 0.149 | 0.073 | 0.010 | 1.550 | 2.050 | 77.73 | 4.64 |

Ensemble | 1.353 | 0.175 | 0.085 | 0.011 | 1.655 | 2.169 | 45.24 | 2.53 |

Climatology | 1.412 | 0.173 | 0.081 | 0.010 | 1.987 | 2.629 | 81.10 | 5.90 |

Verification scores for probabilistic and point forecasts and the coverage and average width of central prediction intervals are reported in Table 6. Compared with the raw ensemble and climatology post-processed forecast exhibit the same behavior as before: improved predictive skills and better calibration. In general, models based on combinations of both investigated distributions outperform the TN and LN methods, the smallest CRPS and MAE values and the best coverage, combined with a rather narrow central prediction interval, belong to the regime-switching approach, while the mixture model with parameters optimizing the mean CRPS provides the lowest twCRPS and RMSE scores.

TN | LN | TN-LN r.s. | |

DM test based on LogS | |||

TN-LN mix. (CRPS) | -7.63 | -13.62 | -4.68 |

TN-LN mix. (ML) | -18.44 | -16.45 | -12.24 |

DM test based on CRPS | |||

TN-LN mix. (CRPS) | -15.73 | -16.70 | -0.95 |

TN-LN mix. (ML) | -5.62 | -6.52 | 2.24 |

DM test based on twCRPS with threshold | |||

TN-LN mix. (CRPS) | -7.68 | -4.63 | 1.03 |

TN-LN mix. (ML) | -7.04 | -5.11 | 1.06 |

The results of DM tests of equal predictive performance are summarized in Table 7. In terms of all employed scoring rules, the mixture models exhibit significantly better predictive performance compared to the TN and LN models. As before, the results for the comparisons with the TN-LN combination model are mixed and depend on the scoring rule. While the mixture models show significantly better results in terms of the LogS, the combination model is preferred in terms of the CPRS and its threshold-weighted version.

This desirable behavior of the mixture models for high wind speeds can also be observed in Figure 9 where the twCRPSS values of the LN, TN-LN regime-switching and mixture models with respect to the TN EMOS reference model are plotted as functions of the threshold. Up to threshold m s the regime-switching method slightly outperforms the mixture model, whereas above it this advantage disappears.

In contrast to the ECMWF and ALADIN-HUNEPS ensembles, the weights of the TN component of the two versions of model (3.6) plotted in Figure 10 show a positive correlation of . Finally, for the UWME the parameter estimates of and exhibit stronger correlations than the estimated location and scale parameters and of the TN component, the corresponding values are and , respectively.

## 5 Conclusions

A new EMOS model for post-processing ensemble forecasts of wind speed is introduced, where the predictive PDF is a weighted mixture of a truncated normal and a log-normal distribution with location and scale/shape parameters depending on the ensemble. Model parameters and mixture weight are estimated simultaneously by optimizing either the mean continuous ranked probabilistic score or the mean logarithmic score (ML estimation) of the predictive distribution over the training data.

The mixture models are tested on three data sets of wind speed forecasts which differ in the generation of the ensemble members and the predicted wind quantities. The predictive skills of the new model are compared with those of the TN based EMOS method (Thorarinsdottir and Gneiting, 2010), the LN and the TN-LN regime-switching EMOS models (Baran and Lerch, 2015), the raw ensemble and the climatological forecasts with the help of graphical tools, appropriate verification scores and formal statistical tests of calibration and equal predictive performance. The presented case studies clearly show that compared with the raw ensemble and climatology, statistical post-processing results in a significant improvement in calibration of probabilistic and accuracy of point forecasts.

In a comparative view of the different EMOS models it can be observed that the TN-LN regime-switching combination model and the new mixture models significantly outperform the simple EMOS models based on single TN and LN distributions and provide much better calibrated probabilistic forecasts as demonstrated by formal statistical tests. The novel mixture models are able to keep up with the TN-LN regime-switching combination method in terms of the various verification scores. Further, the results of formal tests of uniformity indicate a superior calibration of the forecasts produced by the mixture models compared with the combination model. No substantial difference can be observed between the results corresponding to the two parameter estimation methods of the mixture model, ML estimation results in slightly worse verification scores but provides better calibrated forecasts.

Compared with the TN-LN regime-switching combination model, the proposed mixture models exhibit desirable properties from both a theoretical as well as an applied perspective. They are more flexible in that they do not require the exclusive choice of one of the parametric families as forecast distribution. Further, it is not necessary to determine suitable covariates for the model selection, or to estimate the model selection threshold over a training period.

Acknowledgments

Essential part of this work was made during a visit of Sándor Baran at the Heidelberg Institute for Theoretical Studies. The research stay in Heidelberg was funded by the DAAD program “Research Stays for University Academics and Scientists, 2015”. Sándor Baran was also supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences. Sebastian Lerch gratefully acknowledges support by the Volkswagen Foundation within the program “Mesoscale Weather Extremes – Theory, Spatial Modelling and Prediction (WEX-MOP)”, and by the Klaus Tschira Foundation. The authors thank Tilmann Gneiting, Alexander Jordan and Fabian Krüger for helpful discussions, and Fabian Krüger for providing R code for the test. The authors further thank the University of Washington MURI group for providing the UWME data and Mihály Szűcs from the HMS for providing the ALADIN-HUNEPS data.

## References

- Baran (2014) Baran S. 2014. Probabilistic wind speed forecasting using Bayesian model averaging with truncated normal components. Computational Statistics and Data Analysis 75:227–238, DOI: 10.1016/j.csda.2014.02.013.
- Baran and Lerch (2015) Baran S, Lerch S. 2015. Log-normal distribution based EMOS models for probabilistic wind speed forecasting. Quarterly Journal of the Royal Meteorological Society, DOI:10.1002/qj.2521.
- Baran et al. (2014) Baran S, Horányi A, Nemoda D. 2014. Comparison of the BMA and EMOS statistical methods in calibrating temperature and wind speed forecast ensembles. Időjárás 118:217–241.
- Bentzien and Friederichs (2012) Bentzien S, Friederichs P. 2012. Generating and calibrating probabilistic quantitative precipitation forecasts from the high-resolution NWP model COSMO-DE. Weather and Forecasting 27:988–1002, DOI: 10.1175/waf-d-11-00101.1.
- Descamps et al. (2014) Descamps L, Labadie C, Joly A, Bazile E, Arbogast P, Cébron P. 2014. PEARP, the Météo-France short-range ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, DOI: 10.1002/qj.2469.
- Diebold et al. (1998) Diebold FX, Gunther T and Tay A. 1998. Evaluating density forecasts, with applications to financial risk management. International Economic Review 39:863–883, DOI: 10.2307/2527342.
- Diebold and Mariano (1995) Diebold FX, Mariano, RS. 1995. Comparing predictive accuracy. Journal of Business & Economic Statistics 13:253–263, DOI: 10.1198/073500102753410444.
- Eckel and Mass (2005) Eckel FA, Mass CF. 2005. Effective mesoscale, short-range ensemble forecasting. Weather and Forecasting 20:328–350, DOI: 10.1175/waf843.1.
- ECMWF Directorate (2012) ECMWF Directorate 2012. Describing ECMWF’s forecasts and forecasting system. ECMWF Newsletter 133:11–13.
- Gebhardt et al. (2011) Gebhardt C, Theis SE, Paulat M, Bouallègue ZB. 2011. Uncertainties in COSMO-DE precipitation forecasts introduced by model perturbations and variation of lateral boundaries. Atmospheric Research 100:168–177, DOI: 10.1016/j.atmosres.2010.12.008.
- Gneiting (2011) Gneiting T. 2011. Making and evaluating point forecasts. Journal of the American Statistical Association 106:746–762, DOI: 10.1198/jasa.2011.r10138.
- Gneiting (2014) Gneiting T. 2014. Calibration of medium-range weather forecasts. ECMWF Technical Memorandum No. 719. (Available from: http://old.ecmwf.int/publications/library/ecpublications/_pdf/tm/701-800/tm719.pdf.) [Accessed on 27 July 2015]
- Gneiting and Raftery (2005) Gneiting T, Raftery AE. 2005. Weather forecasting with ensemble methods. Science 310:248–249, DOI: 10.1126/science.1115255.
- Gneiting and Raftery (2007) Gneiting T, Raftery AE. 2007. Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association 102:359–378, DOI: 10.1198/016214506000001437.
- Gneiting and Ranjan (2011) Gneiting T, Ranjan R. 2011. Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business & Economic Statistics 29:411–422, DOI: 10.1198/jbes.2010.08110.
- Gneiting et al. (2007) Gneiting T, Balabdaoui F, Raftery AE. 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B 69:243–268, DOI: 10.1111/j.1467-9868.2007.00587.x.
- Gneiting et al. (2005) Gneiting T, Raftery AE, Westveld AH, Goldman T. 2005. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review 133:1098–1118, DOI: 10.1175/mwr2904.1.
- Grell et al. (1995) Grell GA, Dudhia J, Stauffer DR. 1995. A description of the fifth-generation Penn state/NCAR mesoscale model (MM5). Technical Note NCAR/TN-398+STR. National Center for Atmospheric Research, Boulder. (Available from: http://nldr.library.ucar.edu/repository/assets/technotes/TECH-NOTE-000-000-000-214.pdf) [Accessed on 27 July 2015]
- Hágel (2010) Hágel E. 2010. The quasi-operational LAMEPS system of the Hungarian Meteorological Service. Időjárás 114:121–133.
- Horányi et al. (2006) Horányi A, Kertész S, Kullmann L, Radnóti G. 2006. The ARPEGE/ALADIN mesoscale numerical modeling system and its application at the Hungarian Meteorological Service. Időjárás 110:203–227.
- Knüppel (2015) Knüppel M. 2015. Evaluating the calibration of multi-step-ahead density forecasts using raw moments. Journal of Business & Economic Statistics 33:270–281, DOI: 10.1080/07350015.2014.948175.
- Lerch and Thorarinsdottir (2013) Lerch S, Thorarinsdottir TL. 2013. Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A 65:21206, DOI: 10.3402/tellusa.v65i0.21206.
- Leutbecher and Palmer (2008) Leutbecher M, Palmer TN. 2008. Ensemble forecasting. Journal of Computational Physics 227:3515–3539, DOI: 10.1016/j.jcp.2007.02.014.
- Molteni et al. (1996) Molteni F, Buizza R, Palmer TN, Petroliagis, T. 1996. The ECMWF Ensemble Prediction System: Methodology and validation. Quarterly Journal of the Royal Meteorological Society 122:73–119, DOI: 10.1002/qj.49712252905.
- National Weather Service (1998) National Weather Service. 1998. Automated Surface Observing System (ASOS) User’s Guide. (Available from: http://www.weather.gov/asos/aum-toc.pdf) [Accessed on 27 July 2015]
- Raftery et al. (2005) Raftery AE, Gneiting T, Balabdaoui F, Polakowski M. 2005. Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review 133:1155–1174, DOI: 10.1175/mwr2906.1.
- Sloughter et al. (2010) Sloughter JM, Gneiting T, Raftery AE. 2010. Probabilistic wind speed forecasting using ensembles and Bayesian model averaging. Journal of the American Statistical Association 105:25–37, DOI: 10.1198/jasa.2009.ap08615.
- Thorarinsdottir and Gneiting (2010) Thorarinsdottir TL, Gneiting T. 2010. Probabilistic forecasts of wind speed: ensemble model output statistics by using heteroscedastic censored regression. Journal of the Royal Statistical Society: Series A 173:371–388, DOI: 10.1111/j.1467-985X.2009.00616.x.
- Wilks (2011) Wilks DS. 2011. Statistical Methods in the Atmospheric Sciences (3rd ed.). Elsevier: Amsterdam.