Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition

Published in APSIPA ASC, 2021

To improve the robustness of automatic speech recognition (ASR), speech enhancement (SE) is often used as a front-end noise-removal process. Although there is complementarity between the mapping-based and the mask-based SE system, one of the SE systems has been conventionally used as the front-end of ASR. We propose a spectrogram fusion (SF)-based end-to-end (E2E) robust ASR system, in which the mapping-based and masking-based SE are used as the front-end simultaneously. We adopt SF to combine the advantages of mapping-based and masking-based SE systems. SF and ASR modules are connected in an E2E manner, and joint training is conducted to finetune the front-end and the back-end. We compared the performance of different front-ends after joint training. From the experiments using Aishell and PNL 100 Nonspeech Sounds datasets, we found that the fusion of two SEs are beneficial for ASR, especially under low signal-to-noise ratio, where a relative improvement of more than 7% is achieved.