Forecast Error Measures: Scaled, Relative, and other Errors | by Manu Joseph | Oct, 2020


Following by means of from my earlier weblog about the usual Absolute, Squared and P.c Errors, let’s check out the options — Scaled, Relative and different Error measures for Time Collection Forecasting.

Manu Joseph

Each Scaled Error and Relative Error are extrinsic error measures. They rely on one other reference forecast to guage itself, and most of the time, in apply, the reference forecast is a Naïve Forecast or a Seasonal Naïve Forecast. Along with these errors, we will even have a look at measures like P.c higher, cumulative Forecast Error, Monitoring Sign and so on.

Once we say Relative Error, there are two essential methods of calculating it and Shcherbakov et al. calls them Relative Errors and Relative Measures.

Relative Error is once we use the forecast from a reference mannequin as a base to match the errors and Relative Measures is once we use some forecast measure from a reference base mannequin to calculate the errors.

Relative Error is calculated as beneath:

Equally Relative Measures are calculated as beneath:

the place MAE is the Imply Absolute Error on the forecast and MAE* is the MAE of the reference forecast. This measure may be something actually, and never simply MAE.

Relative Error is predicated on a reference forecast, though mostly we use Naïve Forecast, not essentially on a regular basis. For example, we are able to use the Relative measures if we’ve an current forecast we are attempting to raised, or we are able to use the baseline forecast we outline in the course of the improvement cycle, and so on.

One drawback we are able to see immediately is that it will likely be undefined when the reference forecast is the same as floor reality. And this may be the case for both very steady time sequence or intermittent ones the place we are able to have the identical floor reality repeated, which makes the naïve forecast equal to the bottom reality.

Scaler Error was proposed by Hyndman and Koehler in 2006. They proposed to scale the errors based mostly on the in-sample MAE from the naïve forecasting technique. So as an alternative of utilizing the bottom reality from the earlier timestep because the scaling issue, we use the common absolute error throughout your complete sequence because the scaling issue.

the place e is the error at timestep t, n is the size of the timeseries, a is the bottom reality at timestep t, and l is the offset. l is 1 for naïve forecasting. One other different that’s popularly used is l = seasonal interval . For eg. l=12, for a seasonality of 12 months.

Right here in-sample MAE is chosen as a result of it’s all the time obtainable and extra dependable to estimate the dimensions versus the out of pattern ones.

In our earlier weblog, we checked Scale Dependency, Symmetricity, Loss Curves, Over and beneath Forecasting and Impression of outliers. However this time, we’re coping with relative errors. And due to this fact plotting loss curves usually are not straightforward anymore as a result of there are three inputs, floor reality, forecast, and reference forecast and the worth of the measure can range with every of those. Over and Below Forecasting and Impression of Outliers we are able to nonetheless examine.

The loss curves are plotted as a contour map to accommodate the three dimensions — Error, Reference Forecast and the measure worth.

We will see that the errors are symmetric across the Error axis. If we maintain the Reference Forecast fixed and range the error, the measures are symmetric on each side of the errors. Not shocking since all these errors have their base in absolute error, which we noticed was symmetric.

However the attention-grabbing factor right here is the dependency on the reference forecast. The identical error result in totally different Relative Absolute Error values relying on the Reference Forecast.

We will see the identical asymmetry within the 3D plot of the curve as nicely. However Scaled Error is totally different right here as a result of it isn’t instantly depending on the Reference Forecast, however moderately on the imply absolute error of the reference forecast. And due to this fact it has the nice symmetry of absolute error and little or no dependency on the reference forecast.

For the Over and Below Forecasting experiment, we repeated the identical setup from final time*, however for these 4 error measures — Imply Relative Absolute Error(MRAE), Imply Absolute Scaled Error(MASE), Relative Imply Absolute Error(RMAE), and Relative Root Imply Squared Error(RRMSE)

* — With one small change, as a result of we additionally add a random noise lower than 1 to ensure consecutive actuals usually are not the identical. In such circumstances the Relative Measures are undefined.

We will see that these scaled and relative errors would not have that downside of favoring over or beneath forecasting. Each the error bars of low forecast and excessive forecast are equally unhealthy. Even in circumstances the place the bottom error was favoring considered one of these,(for eg. MAPE), the relative error measure(RMAPE) reduces that “favor” and makes the error measure extra strong.

One different factor we discover is that the Imply Relative Error has an enormous unfold(I’ve truly zoomed in to make the plot legible). For eg. The median baseline_rmae is 2.79 and the utmost baseline_mrae is 42ok. This huge unfold exhibits us that the Imply Absolute Relative Error has low reliability. Relying on totally different samples, the errors range wildly. this can be partly due to the way in which we use the Reference Forecast. If the Floor Fact is just too near Reference Forecast(on this case the Naïve Forecast), the errors are going to be a lot increased. This drawback is partly resolved through the use of Median Relative Absolute Error(MdRAE)

For checking the outlier affect additionally, we repeated the identical experiment from earlier weblog publish for MRAE, MASE, RMAE, and RRMSE.

Aside from these commonplace error measures, there are a couple of extra tailor-made to deal with a couple of facets of the forecast which isn’t correctly coated by the measures we’ve seen to this point.

Out of all of the measures we’ve seen to this point, solely MAPE is what I’d name interpretable for non-technical of us. However as we noticed, MAPE doesn’t have the very best of properties. All the opposite measures doesn’t intuitively expound how good or unhealthy the forecast is. P.c Higher is one other try at getting that type of interpretability.

P.c Higher(PB) additionally depends on a reference forecast and measures our forecast by counting the variety of cases the place our forecast error measure was higher than reference forecast error.

For eg.

the place I = Zero when MAE>MAE* and 1 when MAE<MAE*, and N is the variety of cases.

Equally, we are able to lengthen this to every other error measure. This offers us an intuitive perceive of how higher are we doing as in comparison with reference forecast. That is additionally fairly immune to outliers as a result of it solely counts the cases as an alternative of measuring or quantifying the error.

That can also be a key drawback. We’re solely measuring the rely of the instances we’re higher. However it doesn’t measure how higher or how worse we’re doing. If our error is 50% lower than reference error or 1% much less, the affect of that on the P.c higher rating is identical.

Normalized RMSE was proposed to neutralize the dimensions dependency of RMSE. The final thought is to divide RMSE with a scalar, like the utmost worth in all of the timeseries, or the distinction between the utmost or minimal, or the imply worth of all the bottom truths and so on.

Since dividing by most or the distinction between most and minimal are susceptible to affect from outliers, widespread use of nRMSE is by normalizing with the imply.

nRMSE =RMSE/ imply (y)

All of the errors we’ve seen to this point focuses on penalizing errors, regardless of optimistic or damaging. We use an absolute or squared time period to ensure the errors don’t cancel one another out and paint a rosier image than what it’s.

However by doing this, we’re additionally changing into blind to structural issues with the forecast. If we’re constantly over forecasting or beneath forecasting, that’s one thing we must always pay attention to and take corrective actions. However not one of the measures we’ve seen to this point seems at this attitude.

That is the place Forecast Bias is available in.

Though it seems just like the P.c Error method, the important thing right here is the absence of absolutely the time period. So with out absolutely the time period, we’re cumulating the actuals and forecast and measuring the distinction between them as a share. This offers an intuitive rationalization. If we see a bias of 5%, we are able to infer that general, we’re under-forecasting by 5%. Relying on whether or not we use Actuals — forecast or Forecast — Actuals, the interpretation is totally different, however in spirit the identical.

If we’re calculating throughout timeseries, then additionally we cumulate the actuals and forecast at no matter reduce of the information we’re measuring and calculate the Forecast Bias.

Let’s add the error measures we noticed now to the abstract desk we made final time.

Once more we see that there’s nobody ring to rule all of them. There could also be totally different selections relying on the state of affairs and we have to choose and select for particular functions.

We have now already seen that it isn’t straightforward to only choose one forecast metric and use it in all places. Every of them has its personal benefits and drawbacks and our alternative ought to be cognizant of all of these.

That being mentioned, there are thumb-rules you may apply that can assist you alongside the method:

  1. If each timeseries is on the identical scale, use MAE, RMSE and so on.
  2. If there are massive modifications within the timeseries (i.e. within the horizon we’re measuring, there’s a big shift is timeseries ranges), then one thing like a P.c Higher or Relative Absolute Error can be utilized.
  3. When summarizing throughout timeseries, for metrics like P.c Higher or APE, we are able to use Arithmetic Means (eg. MAPE). For relative errors, it has been empirically confirmed that Geometric Means have higher properties. However on the identical time, they’re additionally susceptible to outliers. Just a few methods we are able to management for outliers are:
  4. Trimming the outliers or discarding them from the mixture calculation
  5. Utilizing the Median for aggregation (MdAPE) is one other excessive measure in controlling for outliers.
  6. Winsorizing (changing the outliers with the cutoff worth) is one other approach to cope with such big particular person circumstances of errors.

Armstrong et al. 1992, carried out an intensive research on these forecast metrics utilizing the M competitors to pattern 5 subsamples totaling a set of 90 annual and 101 quarterly sequence, and its forecast. Then they went forward and calculation the error measures on this pattern and carried out a research to look at them.

The important thing dimensions they examined the totally different measures for have been:


Reliability talks about whether or not repeated software of the measure produce related outcomes. To measure this, they first calculated the error measures for various forecasting strategies on all 5 subsamples(mixture stage), and ranked them so as of efficiency. They carried out this 1 step forward and 6 steps forward for Annual and Quarterly sequence.

In order that they calculated the Spearman’s rank-order correlation coefficients(pairwise) for every subsample and averaged them. e.g. We took the rankings from subsample 1 and in contrast them with subsample 2, after which subsample 1 with subsample 3, and so on., till we coated all of the pairs after which averaged them.

Supply: Armstrong et al.

The rankings based mostly on RMSE was the least dependable with very low correlation coefficients. They state that using RMSE can overcome this reliability problem solely when there’s a excessive variety of time sequence within the combine which could neutralize the impact.

In addition they discovered that Relative Measures just like the P.c Higher and MdRAE has a lot increased reliability than their friends. In addition they tried to calculate the variety of samples required to realize the identical statistical significance as P.c Higher — 18 sequence for GMRAE, 19 utilizing MdRAE, 49 utilizing MAPE, 55 utilizing MdAPE, and 170 utilizing RMSE.

Assemble Validity

Whereas reliability was measuring the consistency, assemble validity asks whether or not a measure does, actually, measure what it intents to measure. This exhibits us the extent to which the varied measures assess the “accuracy” of forecasting strategies. To check this they examined the rankings of the forecast strategies as earlier than, however this time they in contrast the rankings between pairs of error measures. For eg., how a lot settlement is there in rating based mostly on RMSE vs rating based mostly on MAPE?

These correlations are influenced by each Assemble Validity in addition to Reliability. To account for the change in Reliability, the authors derived the identical desk through the use of extra variety of samples and located that as anticipated the common correlations elevated from 0.34 to 0.68 exhibiting that these measures are, actually, measuring what they’re alleged to.

Supply: Armstrong et al.

As a ultimate check of validity, they constructed a consensus rating by averaging the rankings from every of the error measures for the total pattern of 90 annual sequence and 1010 quarterly sequence after which examined the correlations of every particular person error measure rating with the consensus rating.

Supply: Armstrong et al.

RMSE had the bottom correlation with the consensus. That is most likely due to the low reliability. It may also be due to RMSE’s emphasis on increased errors.

P.c Higher additionally exhibits low correlation(regardless that it had excessive reliability). That is most likely as a result of P.c higher is the one measure which doesn’t measure the magnitude of the error.


It’s fascinating to have error measures that are delicate to results of modifications, particularly for parameter calibration or tuning. The measure ought to point out the impact on “accuracy” when a change is made within the parameters of the mannequin.

Median error measures usually are not delicate and neither is P.c Higher. Median aggregation hides the change by specializing in the center worth and can solely change slowly. P.c Higher isn’t delicate as a result of as soon as the sequence is performing higher than the reference, it stops making any extra change within the metric. It additionally doesn’t measure if we enhance a particularly unhealthy forecast to a degree the place it’s virtually as correct as a naïve forecast.

Relationship to Resolution Making

The paper makes it very clear that not one of the measures they evaluated are perfect for resolution making. They suggest RMSE as a adequate measure and frown upon p.c based mostly errors beneath the argument that precise enterprise affect happens in {dollars} and never in p.c errors. However I disagree with the purpose as a result of once we are objectively evaluating a forecast to convey how good or unhealthy it’s doing, RMSE simply doesn’t make the reduce. If I stroll as much as the highest administration and say that the monetary forecast had an RMSE of 22343 that may fall flat. However as an alternative if I say that the accuracy was 90% all people is glad.

Each me and the paper agree on one factor, the relative error measures of not that related to resolution making.

Tips for selecting Error Measures

To assist with collection of errors, the paper additionally charges the totally different measures of the size they recognized.

Supply: Armstrong et al.

For Parameter Tuning

For calibration of parameter tuning, the paper suggests to make use of on of the measures that are rated excessive in sensitivity, — RMSE, MAPE, and GMRAE. And due to the low reliability of RMSE and the favoring low forecast problem of MAPE, they recommend to make use of GMRAE(Geometric Imply Relative Absolute Error). MASE was proposed approach after the discharge of this paper and therefore it doesn’t actor in these evaluation. But when you consider it MASE can also be delicate and proof against the issues that we see for RMSE or MAPE and could be a good candidate for calibration.

For Forecast Methodology Choice

To pick between forecast strategies, the first standards are reliability, assemble validity, safety towards outliers, and relationship to resolution making. Sensitivity isn’t that necessary on this context.

The paper, immediately, dismissed RMSE due to the low reliability and the dearth of safety to outliers. When the variety of sequence is low, they recommend MdRAE, which is as dependable as GMRAE, however provides further safety from outliers.Given a reasonable variety of sequence, reliability turns into much less of a problem and in such circumstances MdAPE can be an acceptable alternative due to its nearer relationship to resolution making.

Over the 2 blogposts, we’ve seen plenty of forecast measures and understood what are the benefits and drawbacks for every of them. And eventually arrived at a couple of thumb guidelines to go by when selecting forecast measures. though not conclusive, I hope it provides you a path when going about these selections.

However all this dialogue was made beneath the belief that the time-series that we’re forecasting are steady and easy. However in real-world enterprise circumstances, there are additionally plenty of sequence that are intermittent or sporadic. We see lengthy durations of zero demand earlier than a non-zero demand. beneath such circumstances, virtually the entire error measures(with an exception of could also be MASE) fails. Within the subsequent weblog publish, let’s check out a couple of totally different measures that are suited to intermittent demand.

Github Hyperlink for the Experiments:

  1. Shcherbakov et al. 2013, A Survey of Forecast Error Measures
  2. Armstrong et al. 1992, Error Measures for Generalizing About Forecasting Methods: Empirical Comparisons


Source link

Write a comment