Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Abstract

This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.


I. Analysis-synthesis tasks on different methods with sampling rate of 16 kHz


Sample 1


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 2


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 3


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 4


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 5


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 6


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 7


Natural GLA RAAR vMDNN NSPP SP-NSPP

II. Prediction-synthesis tasks on different methods with sampling rate of 16 kHz


Sample 1


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 2


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 3


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 4


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 5


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 6


Natural GLA RAAR vMDNN NSPP SP-NSPP

Sample 7


Natural GLA RAAR vMDNN NSPP SP-NSPP

III. Analysis-synthesis task on SP-NSPP and its ablated variants with sampling rate of 16 kHz


Sample 1


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 2


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 3


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 4


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 5


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 6


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

Sample 7


SP-NSPP SP-NSPP w/o RS SP-NSPP w/o PSD SP-NSPP w/o TFID

IV. Analysis-synthesis task on different methods with sampling rate of 24 kHz


Sample 1


Natural RAAR NSPP SP-NSPP

Sample 2


Natural RAAR NSPP SP-NSPP

Sample 3


Natural RAAR NSPP SP-NSPP

Sample 4


Natural RAAR NSPP SP-NSPP

Sample 5


Natural RAAR NSPP SP-NSPP

Sample 6


Natural RAAR NSPP SP-NSPP

Sample 7


Natural RAAR NSPP SP-NSPP

V. Analysis-synthesis task on different methods with sampling rate of 48 kHz


Sample 1


Natural RAAR NSPP SP-NSPP

Sample 2


Natural RAAR NSPP SP-NSPP

Sample 3


Natural RAAR NSPP SP-NSPP

Sample 4


Natural RAAR NSPP SP-NSPP

Sample 5


Natural RAAR NSPP SP-NSPP

Sample 6


Natural RAAR NSPP SP-NSPP

Sample 7


Natural RAAR NSPP SP-NSPP

VI. Analysis-synthesis task on different methods for the FSD50k with sampling rate of 44.1 kHz


Sample 1


Natural RAAR NSPP SP-NSPP

Sample 2


Natural RAAR NSPP SP-NSPP

Sample 3


Natural RAAR NSPP SP-NSPP

Sample 4


Natural RAAR NSPP SP-NSPP

Sample 5


Natural RAAR NSPP SP-NSPP

Sample 6


Natural RAAR NSPP SP-NSPP

Sample 7


Natural RAAR NSPP SP-NSPP