We would firstly like to thank HEPEX for giving us the opportunity to set a background on how machine learning can be used in probabilistic hydrological forecasting. In this blog post, we start by providing the schematic summary of the discussion given below. Hope you will enjoy the reading!
Learning practical problems with data: A machine learning algorithm can be explicitly trained for probabilistic hydrological forecasting
Let’s suppose one of our most familiar problems, the typical regression problem. For solving this problem, the algorithm needs to ‘learn’ how the mean of the response variable changes with the changes of the predictor variables. The algorithm is hence ‘told’ to focus on the mean simply by using the least-square error objective function or some other similarly conceptualized error function.
However, what should we do if, for instance, we are interested in that future value of streamflow (or, more generally, that unknown value of streamflow) at time t that will be exceeded with probability 10%? In this case, the algorithm needs to ‘learn’ how the streamflow quantile of level 0.90 changes with the changes of the predictor variables. Whilst a loss function named quantile score (also called “pinball loss”) can put the focus on that specific streamflow quantile, thereby allowing us to explicitly train a (machine learning) algorithm for probabilistic hydrological forecasting (Papacharalampous et al., 2019).
Consequently, instead of using linear regression (e.g., for post-processing the outcomes of hydrological models), one has to switch to linear-in-parameters quantile regression (Waldmann, 2018). And, instead of using neural networks, one has to switch to quantile regression neural networks, while boosting with quantile loss functions is also a reasonable choice. There is indeed an entire family of quantile regression algorithms, and their application is as easy and fast as the application of typical regression algorithms.
What is the “no free lunch” theorem and is there anything we can do about it?
An excellent choice for solving millions of quantile regression problems is quantile regression forests. However, there is actually “no free lunch” in using them, as there is “no free lunch” in using all machine learning algorithms and all models, in general (Wolpert, 1996). This means that among the entire pool of reasonable algorithmic choices for dealing with a specific problem type, there is absolutely no way to know in advance which one will perform the best for one particular problem case. The fortunate thing is that, there are ways to deal with the “no free lunch” theorem in a meaningful sense, and these ways are called “large-scale benchmarking” and “ensemble learning”.
What is “large-scale benchmarking” and can it benefit machine learning solutions?
Let’s suppose a pool of machine learning candidates to select from for performing probabilistic hydrological post-processing and forecasting. Then, for each candidate we wish to know the probabilistic hydrological forecasting “situations” in which it is more likely for it to work better than the remaining candidates. Note that the various probabilistic hydrological forecasting “situations” of our interest could be defined by all the predictive quantiles of our interest, or by all the prediction intervals of our interest, or by all the streamflow magnitudes of our interest, or even by all these factors and several others.
Since there is no theoretical solution to the above problem, we can only provide an empirical solution. We therefore compare the performance of all candidates in a large number and wide range of problem cases (which should collectively represent the various types of probabilistic hydrological forecasting “situations” being of interest to us). This is called “large-scale benchmarking”.
If we empirically prove through large-scale benchmarking that a candidate performs on average better than the remaining ones for a sufficiently large number of cases representing a specific type of probabilistic hydrological forecasting “situations”, then we have found that it is safer to use this candidate than using any of the remaining ones for this same type of probabilistic hydrological forecasting “situations” in the future.
Repeating this for all possible types of probabilistic hydrological forecasting “situations”, one can make the most of all the available candidates and accelerate machine learning solutions by possibly reaching high levels of predictability.
What is “ensemble learning” and can it benefit probabilistic hydrological forecasting?
In forecasting through ensemble learning, instead of one individual algorithm, an ensemble of algorithms is used (Bates and Granger, 1969). These algorithms, known as “base learners”, are trained and then used in forecast mode independently of each other. Their independent forecasts are finally combined with another learner, known as the “combiner”, which is “stacked” on top of the base learners. The final output is a single forecast.
The concept of ensemble learning has been implemented both for point forecasting (Tyralis et al., 2020), and for probabilistic forecasting (Tyralis et al., 2019). In the latter case, the term “ensemble learning” should not be confused with the term “ensemble simulation”, in which the entire ensemble of simulations constitutes the probabilistic forecast.
The simplest form of ensemble learning and “stacking” of algorithms is simple averaging, in which the combiner does not have to be trained, as it simply computes the average of the ensemble of forecasts. For instance, the forecasts of quantile regression, quantile regression forests and quantile regression neural networks for the streamflow quantile of level 0.90 (three forecasts) can be averaged to produce a new forecast (one forecast).
A bunch of very interesting properties and concepts are related to simple averaging. Among them are the “wisdom of the crowd” and the “forecast combination puzzle”. “Wisdom of the crowd” can be harnessed through simple averaging to increase robustness in probabilistic hydrological forecasting (Papacharalampous et al., 2020). By increasing robustness, one reduces the risk of delivering poor quality forecasts at every single forecast attempt. This can also result in increased forecast skill in the long run.
Simple averaging is hard to beat in practice for many types of predictive modelling “situations” and that leads us to the very challenging puzzle of beating this form of stacking with more sophisticated stacking (and meta-learning) methods. Alternative possibilities for combining probabilistic forecasts include Bayesian model averaging, but stacking has been theoretically proved to have some optimal properties compared to this alternative when the focus is on predictive performance (Yao et al., 2018).