Fusing Multiple Bandwidth Spectrograms for Improving Speech Enhancement
Published in APSIPA ASC, 2022
The spectrogram is a common feature of frequency domain speech enhancement (SE). It can be divided into wideband and narrowband according to the resolution of the spectrogram, which is controlled by the length of framing time. Although narrowband and wideband spectrograms have their own spectral characteristics, SE systems conventionally utilize single narrow bandwidth spectrograms. In this paper, we propose an SE system that simultaneously utilizes multiple bandwidth spectral information, more specifically, augments the wider bandwidth (16ms and 8ms) spectrograms as auxiliary information. Multiple bandwidth information fusion is implemented in the encoder in two ways: fusion only in the last layer (MI-F) and fusion layer by layer (MI-L). Experiments using the VB dataset show that different bandwidth spectrograms can provide supplementary information, which provides more than 0.1 PESQ improvement. The embedding dimension affects the position of the fusion position: MI-F requires less embedding dimension, while MI-L requires a larger dimension and more varied bandwidth. Moreover, the spectrogram which differs more from the main enhancement spectrogram provides better auxiliary information.