Probabilistic Nonlinear Forecast Reconciliation Methods: Projection, Importance Sampling, and Kalman Filtering

Anubhab Biswas

ISAAC, SUPSI

Lorenzo Nespoli

ISAAC, SUPSI

Prof. Giorgio Corani

IDSIA, SUPSI

Lorenzo Zambon

IDSIA, SUPSI

2025-07-08

Why Nonlinear Reconciliation?

Nonlinear forecast reconciliation becomes necessary when the relationships among variables are governed by non-additive constraints — for instance, ratios, maxima, or physical laws (power flow equations) in energy systems.
These nonlinear structures frequently arise in energy networks, transport systems, and engineering domains, where ensuring physically consistent forecasts improves operational reliability and supports better planning and control.

Forecast Reconciliation

Simulation Study

We demonstrate some empirical results and a comparison between 4 nonlinear reconciliation methods with the following simulation study.
Let, \(B_1\) and \(B_2\) be the two independent (bottom) time series related to the single upper \(U\) in the following way: \[ U=B_1^2+B_2^2 \]
Data Generation: The synthetic data is generated by sampling \(B_1\) and \(B_2\) as stationary \(AR(1)\) processes,i.e., with \(0<\phi_1,\phi_2<1\),
\[ b_{1,t} = \phi_1b_{1,t-1}+\epsilon_{1,t}\\ b_{2,t} = \phi_2b_{2,t-1}+\epsilon_{2,t} \]
\(\epsilon_{1,t}\) and \(\epsilon_{2,t}\) are independent random noises.

Base forecasts

After a \(80-20\) train and test split, we independently generate base forecasts for each series using a Random Forest Regressor with \(L_1\) loss criterion.
These provide deterministic forecasts, so we obtain the base forecast distribution using a residual bootstrap technique.

The baseline: Non-Linear Bottom-up (NL-BU)

The non-linear bottom-up approach is the most easy and straightforward way to reconcile a base forecast distribution and thus serves as a natural baseline to compare the other methods.
The function \(f_u(.)\) is directly applied to all the independent (bottom) level base forecast samples.
- \(\boldsymbol{\tilde{u}}^{(i)}_{t+h}=f_u(\boldsymbol{\hat{b}}^{(i)}_{t+h})\)
- \(\boldsymbol{\tilde{b}}^{(i)}_{t+h}=\boldsymbol{\hat{b}}^{(i)}_{t+h}\)
Thus, \(\boldsymbol{\tilde{y}}^{(i)}_{t+h} = [\boldsymbol{\tilde{u}}^{(i)T}_{t+h},\boldsymbol{\tilde{b}}^{(i)T}_{t+h}]^T\) are samples from the reconciled distribution \(\tilde{\pi}\).

Non-Linear Orthogonal Projection (NL-OP)

The projection method is the extension of the (Hyndman et al. 2011) for non-linear surfaces, and follows the same extension of reconciling each sample point in the probabilistic case as done in (Panagiotelis et al. 2023).
The optimization problem here becomes,
\[ \boldsymbol{\tilde{y}}_{t+h} = \arg\min_\boldsymbol{y}(\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h})^T(\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h}) \]

subject to the constraint: \(g(\boldsymbol{y})=0\).
For linear reconciliation the constraint is \(C\boldsymbol{y}=0\) and a closed-form solution exists as \(\boldsymbol{\tilde{y}}_{t+h} = S (S^T S)^{-1}S^T{\boldsymbol{\hat{y}}_{t+h}}\), \(S\) being the summing matrix.
For nonlinear reconciliation such a closed form solution is not available and we adapt a Newton-Raphson approach to obtain the minima by finding a stationary point on the Lagrangian: \({\small \mathcal{L}= (\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h})^T(\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h}) + \lambda^ Tg(\boldsymbol{y})}\)

Error reduction theorem?

For linear reonciliation through projections, it was shown that a reduction in the score is guaranteed against the base forecast by using Pythagorous theorem.
Does such a result exist in the nonlinear case? Let’s have a look at our example.

Probabilistic projection and Energy score

We observe similar results in the probabilistic setting as well.

Non-Linear Importance Sampling (NL-BUIS)

This importance sampling algorithm comes as a natural extension to the BUIS algorithm developed in (Zambon, Azzimonti, and Corani 2024) for the linear reconciliation.
Starting with the base forecast distribution for the independent (bottom) level i.e., \(\hat\nu_b\) as the proposal distribution, and \(\tilde\nu\) as the target distribution, this method adapts reconciliation via conditioning.
The algorithm keeps the assumption of the conditional independence between the independent (bottom) and dependent (upper) levelsi.e., \(\hat{\boldsymbol{B}}_{t+h}\) and \(\hat{\boldsymbol{U}}_{t+h}\) are conditionally independent.
After reconciling via conditioning, the samples out of the reconciled distribution are obtained using the importanace sampling approach.

The Algorithm

Sampling: Draw samples \(\{b^{(i)}\}_{i=1}^N\) from \(\hat\nu_b\) .
Compute weights: For each sample \(b^{(i)}\), compute the importance weight:
\[ w^{(i)} = \frac{\hat{\pi}(f_u(b^{(i)}), b^{(i)})}{\hat{\pi}_b(b^{(i)})} = \hat{\pi}_u(f_u(b^{(i)})) \]
Normalization: Normalize the weights:
\[ \tilde{w}^{(i)} = \frac{w^{(i)}}{\sum_{j=1}^N w^{(j)}} \]
Approximation: The reconciled distribution \(\tilde\nu\) is approximated by the weighted samples \(\{ (b^{(i)}, \tilde{w}^{(i)}) \}_{i=1}^N\).

Unscented Kalman Filter (UKF)

In (Corani et al. 2020), a correspondence was shown between the Bayesian forecast reconciliation and Linear-Guassian Kalman filter.
We propose a similar idea for the nonlinear case, where we use an unscented Kalman filter (UKF) to obtain the posterior distribution (reconciled) from the prior (base) belief.
Basic assumption: independent (bottom) time series are jointly Gaussian \[ p(\boldsymbol{B}_{t+h}|[\boldsymbol{b}_1,\boldsymbol{b}_2,...,\boldsymbol{b}_t]) = \mathcal{N}(\boldsymbol{\hat{b}}_{t+h},\boldsymbol{\hat{\Sigma}}_{B,h}) \] \[ \boldsymbol{\hat{U}}_{t+h}=f_u(\boldsymbol{B}_{t+h})+\boldsymbol{\varepsilon}^u_{t+h}, \] \[ \boldsymbol{\varepsilon}^u_{t+h} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{\hat{\Sigma}}_{U,h}), \]
To obtain: posterior reconciled distribution, \(p(\boldsymbol{B}_{t+h}|\boldsymbol{\hat{U}}_{t+h}=\boldsymbol{\hat{u}}_{t+h})\).

Unscented Transform

Posterior (reconciled) distribution

The reconciled mean and covariance of the posterior independent (bottom) series is approximately a multivariate Gaussian distribution, with
- Mean:\[ \boldsymbol{\tilde{b}}_{t+h} = \boldsymbol{\hat{b}}_{t+h} + K(\boldsymbol{\hat{u}}_{t+h}-\boldsymbol{\hat{u}}^-_{t+h}) \]
- Covariance matrix: \[ \boldsymbol{\tilde{\Sigma}}_{B,h} = \boldsymbol{\hat{\Sigma}}_{B,h} - KS_uK^T \]
A sample-based propagation strategy is applied to the reconciled independent (bottom) samples to derive an empirical distribution for the dependent (upper) levels, since no close-form solution is available and results can vary vastly with each different choice of \(f_u\).

Comparison of Reconciliation methods

We compare the accuracy of the reconciliation methods against the base forecast distribution as well as the baseline reconciliation method i.e., the NL-BU.
The energy score is used for the first comparison being a proper score and a log-score is used to compare the reconciliation methods against the baseline as well since it is proper on the coherent surface.
We compute the log-score only for the independent level (bottoms).
For this simulation study, the UKF and the NL-OP showed significant increase in accuracy against both the base and the baseline forecasts.

Results

Reconciliation Method	ES	Improvements against base (%)	% of improvement points	LS

base (unreconciled)	0.089598	NA	NA	NA
NL-BU (baseline)	0.090328	-0.815%	45.23%	-1.702547
NL-BUIS	0.089954	-0.397%	50.75%	-1.596634
NL-OP	0.080033	10.68%	67.34%	-2.208334
UKF	0.078319	12.59%	62.31%	-2.379933

Extensions

Check the methods with some real-world used case.
Currently we are working with the Swiss mortality data, which serves as an excellent avenue for reconciliation with the ratio being the nonlinear function.
Mathematical proof for the error reduction theorem, which was found to exist for surfaces with constant curvature.

Power-flow example

We have applied the projection method for the Swiss-Grid data where we tried to satisfy power-flow equations.
In the later case, an increase in accuracy was observed after reconciliation for small grids.

Geodesic distance-based scoring rule

The proposal of a new scoring function based on Geodesic distance for measuring the accuracy is under study, since it looks potentially like a more reasonable choice to compare distances between coherent points.

Thank You

Newton-Raphson Approach

We obtain the minima by defining the Lagrangian: \({\small \mathcal{L}= (\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h})^TW_h(\boldsymbol{y}-\boldsymbol{\hat{y}}_{t+h}) + \lambda^ Tg(\boldsymbol{y})}\)
and then finding a stationary point defined by \(\partial_z \mathcal{L} = 0\) and \(\partial_{\lambda} \mathcal{L} =0\) using Newton-Raphson iterations.
In vector form this can be written as: \[ \textstyle G(\boldsymbol{y}, \lambda)=\left[ \begin{aligned} \partial_{\lambda} \mathcal{L}\\ \partial_{\boldsymbol{y}} \mathcal{L} \end{aligned}\right] = \left[ \begin{aligned} 2W(\boldsymbol{y}-&\boldsymbol{\hat{y}}_{t+h}) + J^T \lambda\\ &g(\boldsymbol{y}) \end{aligned}\right] = 0 \]
With the NR update \({\small G(\boldsymbol{y}_k, \lambda_k) = -J_F \left[\begin{aligned}d \boldsymbol{y}\\ d \lambda\end{aligned}\right]}\), by expliciatating the expression for \(G\) the displacement vector \([d \boldsymbol{y}^T, d \lambda^T]^T\) can be found by solving \[ \textstyle \left[ \begin{aligned} 2W(\boldsymbol{y}-&\boldsymbol{\hat{y}}_{t+h}) + J^T \lambda\\ &g(\boldsymbol{y}) \end{aligned}\right] = -\left[ \begin{aligned} 2W& \quad J^T \\ J& \quad0 \end{aligned}\right]\left[\begin{aligned}d \boldsymbol{y}\\ d \lambda\end{aligned}\right] \]

A closer look into the orthogonal-projection

Sigma-point computation

The first step of the UKF is to have a set of deterministically chosen sample points called Sigma points that exactly captures the mean and covariance of the prior Gaussian distribution for the independent (bottom).
Let \(\lambda = \alpha^2(n_b+\kappa)-n_b\), and \(\gamma = \sqrt{n_b+\lambda}\), where \(\alpha\) controls the dispersion of the Sigma points.
The Sigma points \(\chi_i\)s are then computed in the following way:
\[ \textstyle \chi_0 = \boldsymbol{\hat{b}}_{t+h},\\ \chi^\space_i = \boldsymbol{\hat{b}}_{t+h} + \gamma[\sqrt{\boldsymbol{\hat{\Sigma}}_{B,h}}]_i, \space i=1(1)n_b,\\ \chi^\space_{i+n_b} = \boldsymbol{\hat{b}}_{t+h} - \gamma[\sqrt{\boldsymbol{\hat{\Sigma}}_{B,h}}]_i, \space i=1(1)n_b \]

where \(\sqrt M\) is the cholesky decomposition of the matrix \(M\).

Weight computation

The weights to update the mean and covariance of the measurement model are computed as,
\[ W^0_m = \frac{\lambda}{n_b + \lambda},\\ \space W^0_c = \frac{\lambda}{n_b+\lambda}+(1-\alpha²+\beta),\\ W^i_m = W^i_c = \frac{1}{2(n_b+\lambda)}, \space i = 1(1)2n_b \]

Obtaining Kalman Gain

We move on to update the measurement mean, covariance and cross-covariance. Denote \(\boldsymbol{z}_i = f_u(\chi_i)\),

\[ \boldsymbol{\hat{u}}^-_{t+h} = \sum^{2n_b}_{i=0}W^i_m \boldsymbol{z}_i\\ S_u = \boldsymbol{\hat{\Sigma}}_{U,h} + \sum^{2n_b}_{i=0} W^i_c(\boldsymbol{z}_i-\boldsymbol{\hat{u}}^-_{t+h})(\boldsymbol{z}_i-\boldsymbol{\hat{u}}^-_{t+h})^T \\ P_{b,u} = \sum^{2n_b}_{i=0}W^i_c(\chi^\space_i-\boldsymbol{\hat{b}}_{t+h})(\boldsymbol{z}_i-\boldsymbol{\hat{u}}^-_{t+h})^T \]
Thus, the Kalman gain is, \(K = P_{b,u}S^{-1}_u\) .

Posterior (reconciled) distribution

The reconciled mean and covariance of the posterior independent (bottom) series is approximately a multivariate Gaussian distribution, with
- Mean:\[ \boldsymbol{\tilde{b}}_{t+h} = \boldsymbol{\hat{b}}_{t+h} + K(\boldsymbol{\hat{u}}_{t+h}-\boldsymbol{\hat{u}}^-_{t+h}) \]
- Covariance matrix: \[ \boldsymbol{\tilde{\Sigma}}_{B,h} = \boldsymbol{\hat{\Sigma}}_{B,h} - KS_uK^T \]
A sample-based propagation strategy is applied to the reconciled independent (bottom) samples to derive an empirical distribution for the dependent (upper) levels, since no close-form solution is available and results can vary vastly with each different choice of \(f_u\).

References

Corani, Giorgio, Dario Azzimonti, João P. S. C. Augusto, and Marco Zaffalon. 2020. “Probabilistic Reconciliation of Hierarchical Forecast via Bayes’ Rule.” In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 12459:211–26. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-030-67664-3_13.

Hyndman, Rob J., Roman A. Ahmed, George Athanasopoulos, and Han Lin Shang. 2011. “Optimal Combination Forecasts for Hierarchical Time Series.” Computational Statistics & Data Analysis 55 (7): 2579–89. https://doi.org/10.1016/j.csda.2010.03.006.

Panagiotelis, Anastasios N., Puwasala Gamakumara, George Athanasopoulos, and Rob J. Hyndman. 2023. “Probabilistic Forecast Reconciliation: Properties, Evaluation and Score Optimisation.” European Journal of Operational Research 308 (1): 180–96. https://doi.org/10.1016/j.ejor.2022.11.015.

Zambon, Lorenzo, Dario Azzimonti, and Giorgio Corani. 2024. “Efficient Probabilistic Reconciliation of Forecasts for Real-Valued and Count Time Series.” Statistics and Computing 34 (1): 21. https://doi.org/10.1007/s11222-023-10343-y.