MSGNet: Learning Multi-Scale Inter-Series Correlations for Multivariate Time Series Forecasting

Wanlin Cai, Yuxuan Liang, Xianggen Liu, Jianshuai Feng, Yuankai Wu

Anubhab Biswas

anubhab.biswas@usi.ch

Univesit?? della Svizzera Italiana (USI)

2025-07-10

Existing Research Gap

Intra-series correlation modelling: In the recent literature the Deep Learning models like RNNs, TCNs and transformers have worked remarkably well in terms of improving forecast accuracy of individual time series by capturing the dependencies in the temporal domain.
Inter-Series Correlation Modelling: On the other hand, GNNs provided a promising way to learn inter-dependencies among multiple time series.
However most of these models are not very successful in the accurate modeling dependencies for varying inter-series correlations across different time scales among multiple time series.

Existing Research Gap (contd.)

Many deep learning models struggle to capture complex, time-varying inter-series dependencies, limiting their forecasting accuracy.
Approaches using fixed graph structures (e.g., static GNNs) assume uniform relationships, which fails in dynamic real-world settings.
Even models with dynamic graphs often ignore that inter-series correlations may be tied to stable time scales, such as economic or environmental cycles.

Main Contribution

Introduces a novel deep learning framework designed to enhance multivariate time series forecasting by capturing inter-series correlations across multiple time scales.
Multi-Scale Correlation Modeling: Captures inter-series correlations at multiple time scales via frequency-domain decomposition (FFT).
Explainability and Generalization: By learning scale-specific inter-series correlations, MSGNet provides interpretable insights into how different time series influence each other over varying time scales. Additionally, it generalizes well to unseen data.
Empirical Performance: Extensive experiments on real-world datasets, including ETT, Exchange, and Electricity, showcase MSGNet’s superior forecasting accuracy compared to existing models.

Problem Formulation

Consider the input data \(\boldsymbol{X}_{t-L:t} \in \mathbb{R}^{N\times L}\) where:
- \(L: considered\space lag\space or\space window\space of\space past\space observations\)
- \(N:number\space of \space time\space series\space variables\)
The objective is to predict the value of the \(N\) variables at the next \(T\) future time points.
The predicted values are represented by \(\boldsymbol{\hat{X}_{t:t+T}}\in \mathbb{R}^{N\times T}\) .

Problem Formulation (contd.)

Correlations among \(N\) time series can vary across time scales and are modeled using graph structures, where each node represents a series and edges represent learned dependencies.
For each time scale \(s_i(\lt L)\), a graph \(\mathcal{G}_i=(\mathcal{V}_i,\mathcal{E}_i)\) is constructed from a segment of the data \(\boldsymbol{X}_{p-s:p}\) capturing the temporal relationships over that interval.
Considering \(k\) such time scales \(\{s_1,...,s_k\}\), we obtain a set of scale-specific adjacency matrices \(\boldsymbol{A}^1,...,\boldsymbol{A}^k\) where \(\boldsymbol{A^i}\in \mathbb{R}^{N\times N}, \forall i=1(1)k\), encoding inter-series correlations at different time resolutions.

ScaleGraph Block ??? Key Steps

Scale Identification: Extracts meaningful temporal scales from the input time series to guide multi-scale analysis.
Inter-Series Correlation Learning: Applies adaptive graph convolution to uncover how different series influence each other at each scale.
Intra-Series Pattern Extraction:Utilizes multi-head self-attention to capture temporal dependencies within each individual time series.
Cross-Scale Representation Fusion: Aggregates the learned features from all scales using a SoftMax attention mechanism, emphasizing the most informative scales.

Model Architecture

Input Embedding

Embed \(N\) variables at the same time step into a vector of size \(d_{model}:\boldsymbol{X}_{t-L:t}\rightarrow\boldsymbol{X}_{emb}\), where \(\boldsymbol{X}_{emb}\in\mathbb{R}^{d_{model}\times{L}}\) using a combination of normalization, convolution, and embeddings.
The input is first normalized to \(\hat{\mathbf{X}}_{t-L:t}\)??? to improve stationarity, based on recent empirical findings.
A 1-D convolutional layer projects \(\hat{\mathbf{X}}_{t-L:t}\)??? into a higher-dimensional embedding \(\mathbf{X}_{\text{emb}} \in \mathbb{R}^{d_{\text{model}} \times L}\), scaled by a learnable factor \(\alpha\).
Positional encodings (PE) and learnable global time encodings (SE) are added to incorporate temporal context: \[ \boldsymbol{X}_{emb}???=\alpha Conv1D(\hat{\boldsymbol{X}}_{t???L:t}???)+PE+\sum_{p=1}^P???SE_p??? \]

Residual connection

This embedding serves as the input to the MSGNet, which adopts a residual architecture: each layer refines the input via a ScaleGraphBlock and a skip connection.
The input to layer \(l\) is represented as:\[ \boldsymbol{X}^l=ScaleGraphBlock(\boldsymbol{X}^{l???1})+\boldsymbol{X}^{l???1} \]
The ScaleGraphBlock encapsulates the core computation of MSGNet at each layer, combining graph structure with residual learning.

Scale Identification

The choice of scale is crucial for this approach which focuses on enhancing the forecast accuracy using inter-series correlations at different time scales; the source of scale is chosen to be periodicity.
To detect prominent periodicity as the time scale, a Fast Fourier Transform (FFT) is employed: \[ \mathbf{F} = \text{Avg} \left( \text{Amp} \left( \text{FFT}(\mathbf{X}_{\text{emb}}) \right) \right),\\ f_1, \cdots, f_k = \underset{f_* \in \left\{1, \cdots, \frac{L}{2} \right\}}{\arg\text{Topk}}(\mathbf{F}),\quad s_i = \frac{L}{f_i} \]
\(FFT(??)\) and \(Amp(??)\) denote the FFT and the calculation of amplitude values, respectively.

Scale Identifcation (contd.)

The model detects evolving temporal scales by identifying periodicities in input sequences and assumes these correlations are stable over time.
For each selected scale \(s_i\)???, the input is zero-padded and reshaped into a 3D tensor \(\mathcal{X}^i \in \mathbb{R}^{d_{\text{model}} \times s_i \times f_i}???\), enabling multi-scale representations.\[ \mathcal{X}^i = Reshape_{si,fi} (Padding(\boldsymbol{X}_{in} )),\space i ??? {1, . . . , k} \]
These reshaped tensors serve as input to the ScaleGraph block, allowing the model to capture both inter-series and intra-series dependencies at multiple temporal resolutions.

Multi-scale Adaptive Graph Convolution

The model introduces a multi-scale adaptive graph convolution that starts by projecting the reshaped multi-scale inputs \(\mathcal{X}^i\) back into the original series space using a learnable matrix \(\mathbf{W}^i\).
This linear transformation yields \(\mathcal{H}^i = \mathbf{W}^i \mathcal{X}^i\), where \(\mathcal{H}^i \in \mathbb{R}^{N \times s_i \times f_i}\)???, restoring the variable dimension \(N\).
Despite concerns about losing inter-series correlations due to linear mapping, experiments show that the proposed method successfully preserves these dependencies through the graph convolution process.

Multi-scale Adaptive Graph Convolution

At each time scale \(i\), an adaptive adjacency matrix is computed using two trainable embeddings \(\mathbf{E}_1^i, \mathbf{E}_2^i \in \mathbb{R}^{N \times h}\):\[ \boldsymbol{A}^i=SoftMax(ReLU(\boldsymbol{E}_1^i???(\boldsymbol{E}_2^i???)^???)) \]
The SoftMax ensures normalized inter-series weights, enabling meaningful graph structures that capture inter-variable dependencies.
A MixHop graph convolution is applied using powers of the adjacency matrix \(\mathbf{A}^i\) to extract multi-hop interactions:\[ \mathcal{H}_{out}^i=\sigma(||_{j\in \mathcal{P}}(\boldsymbol{A}^i)^j\mathcal{H}^i) \]
The output \(\mathcal{H}^i_{\text{out}}\) is then passed through a multi-layer perceptron (MLP) to yield the scale-specific transformed tensor \(\hat{\mathcal{X}}^i \in \mathbb{R}^{d_{\text{model}} \times s_i \times f_i}\).

Multi-head Attention

To capture intra-series dependencies, Multi-head Attention (MHA) is applied on the scale-transformed tensor \(\hat{\mathcal{X}}^i\) along the time-scale dimension:\[ \hat{\mathcal{X}}^i_{\text{out}} = \text{MHA}_s(\hat{\mathcal{X}}^i) \]
The input tensor \(\hat{\mathcal{X}}^i \in \mathbb{R}^{d_{\text{model}} \times s_i \times f_i}\)??? is reshaped into a batch of size \(B f_i\)??? to enable standard MHA processing on the scale axis.
Potential MHA limitations in modeling long sequences are mitigated by the scale transformation, which compresses long spans into periodic patterns, preserving performance even for large input lengths.

Scale Aggregation

The model aggregates \(k\) different scale outputs \(\hat{\mathcal{X}}^1_{\text{out}}, \dots, \hat{\mathcal{X}}^k_{\text{out}}\) by first reshaping them into matrices \(\hat{\mathbf{X}}^i_{\text{out}} \in \mathbb{R}^{d_{\text{model}} \times L}\) and then weighting them by their corresponding amplitudes.
The amplitudes \(\mathbf{F}_{f_1}, \dots, \mathbf{F}_{f_k}\) are computed via FFT and passed through a SoftMax function to obtain weights \(\hat{a}_1, \dots, \hat{a}_k\):\[ \hat{a}_1, \dots, \hat{a}_k=SoftMax(\mathbf{F}_{f_1}, \dots, \mathbf{F}_{f_k}) \]
The final output is a weighted sum of all scale outputs, following a Mixture of Experts (MoE) strategy: \[ \hat{\mathbf{X}}_{\text{out}}=\sum_{i=1}^k\hat{a}_i \hat{\mathbf{X}}^i_{\text{out}} \]

Output Layer

The final output \(\hat{\mathbf{X}}_{\text{out}} \in \mathbb{R}^{d_{\text{model}} \times L}\) is projected into the forecast space \(\hat{\mathbf{X}}_{t:t+T} \in \mathbb{R}^{N \times T}\) using linear mappings along both variable and time dimensions.
This is done using:\[ \hat{\boldsymbol{X}}_{t:t+T} = \boldsymbol{W}_s\hat{\boldsymbol{X}}_{out}\boldsymbol{W}_t + \mathbf{b} \] where \(\mathbf{W}_s \in \mathbb{R}^{N \times d_{\text{model}}}\), \(\mathbf{W}_t \in \mathbb{R}^{L \times T}\), and \(\mathbf{b} \in \mathbb{R}^{T}\) are learnable.
\(\boldsymbol{W}_s\) maps to output variables and \(\mathbf{W}_t\) handles temporal projection, yielding the final forecast across all \(N\) series over a horizon of length \(T\).

Experiments

8 datasets were used to evaluate the performance of MSGNet - Flight, Weather, ETT (h1, h2, m1, m2), Exchange-Rate and Electricity.
6 baselines methods were chosen for comparison - Informer, Autoformer (both based on transformer architecture), MT-Gnn (graph convolution), DLinear, MLinear (linear models), TimesNet (periodic decomposition based - current state-of-art).
The training loss function was set to be Mean squared error (MSE) and the review window size was kept at \(L=96\) for all the models.
Prediction lengths (horizon) were \(T=\{96,192,336,720\}\), initial learning rate was \(LR=0.0001\), batch size \(Batch=32\), and no. of epochs was \(Epochs=10\).
Train, validation and test split was \((0.7,0.1,0.2)\) or \((0.6,0.2,0.2)\).

Results

Analysis

The MSGNet achieved the best MSE performance on 5 out of 8 datasets (Flight, ETT (h1, h2, m1, m2)) and ranked second on 2 others (Weather and Electricity), demonstrating robust accuracy across varying prediction lengths.
On the challenging Flight dataset, that is significantly affected by the out-of-distribution (OOD) samples due to COVID, MSGNet reduced MSE and MAE by 21.5% and 13.7% respectively, outperforming the current SOTA, TimesNet.
Despite TimesNet using multi-scale inputs, its pure vision-based approach is less effective for temporal modeling compared to MSGNet???s hybrid strategy.
MSGNet also surpassed models like MTGNN and Autoformer in average rank, highlighting its strong generalization across diverse datasets.

Accuracy on Flight data

Figure 3 illustrates that MSGNet closely tracks the ground truth, while other models exhibit significant errors during key temporal shifts and fluctuations in flight data.
These gaps in competing models are attributed to architectural limitations, which hinder their ability to capture multi-scale patterns, abrupt changes, and complex inter/intra-series dependencies.

Learned Inter-series Correlation

Figure 4 shows that MSGNet learns distinct adaptive adjacency matrices across time scales (24h, 6h, 4h), effectively modeling inter-airport interactions in Flight data.
Long-range effects (e.g., Airport 6 influencing distant airports like 0, 1, and 3) are captured at longer scales (24h), while short-range dependencies (e.g., among Airports 0, 3, and 5) emerge more strongly at shorter scales.
These learned patterns reflect real-world behavior, suggesting that spatial and temporal dependencies vary by time scale and physical proximity.

Figure 4

Ablation Analysis

Generalization

An ablation study using a 4:4:2 partition (pre-epidemic training, post-epidemic validation/testing) showed that MSGNet maintained top performance and minimal degradation, demonstrating strong robustness to external disruptions.
This resilience is likely due to MSGNet???s ability to capture multi-scale inter-series correlations, some of which remain effective under out-of-distribution (OOD) scenarios???similar to TimesNet, which also leverages multi-scale cues but ranks second.

Conclusion

This paper presents MSGNet, a novel framework that addresses key limitations of existing deep learning models in time series forecasting.
MSGNet exploits periodicity as a time scale source to effectively capture inter-series correlations across multiple temporal resolutions.
Extensive experiments on real-world datasets show that MSGNet consistently outperforms state-of-the-art models in forecasting accuracy.
The results highlight the critical role of multi-scale inter-series correlation modeling for accurate and robust time series analysis.

Thank You

Reference

Cai, W., Liang, Y., Liu, X., Feng, J., & Wu, Y. (2024). MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 38(10), 11141-11149.