VM-ASR is a lightweight, dual-stream U-Net model designed for efficient audio super-resolution (ASR), or bandwidth extension. It enhances low-resolution audio (e.g., 8 kHz) by reconstructing missing high-frequency components to produce high-fidelity sound (e.g., 48 kHz). The model integrates Visual State Space (VSS) blocks derived from VMamba to capture both global and local acoustic contexts, while its dual-stream U‑Net architecture separately processes magnitude and phase spectra to improve harmonic accuracy and phase reconstruction.
Extensive evaluation on the VCTK dataset shows that VM‑ASR surpasses state‑of‑the‑art approaches in spectral reconstruction quality across multiple upscaling configurations. Despite its strong performance, it maintains an exceptionally compact design—just 3.01 M parameters and 2.98 GFLOPS—achieving ~27× real‑time processing speed for 16 kHz → 48 kHz upsampling on a Tesla V100 GPU. This combination of accuracy, speed, and efficiency highlights VM‑ASR’s potential for real‑time deployments in telecommunications, speech synthesis, and audio restoration, enabling high‑quality sound enhancement even in resource‑constrained environments.
GitHub • Paper • Demo