In independence-based blind audio source separation, such as independent component analysis (ICA), we must solve the permutation problem (see here for ICA and the permutation problem), which is an alignment of estimated components along with frequency. To solve this problem, independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA) were proposed, where IVA assumes co-occurrence of frequency components of sources and ILRMA assumes both co-occurrence of time-frequency components and a low-rank structure of sources.
In these methods, blind source separation can be achieved without encountering the permutation problem if the assumed source model fits to the target sources.
However, a time-frequency structure strongly depends on the type of sources. For example, vocals have complex (not low-rank) structures because of continuously and dynamically fluctuating pitches, and ILRMA is not suitable for vocal separation. Guitars and drums include the same timbre patterns many times, thus they have a low-rank time-frequency structure and ILRMA can separate these sources with high accuracy.
It is almost impossible to find a valid source model for all the sources. Moreover, since the criterion for "suitability of source model" cannot be defined well, users must choose the appropriate source model based on their knowledge or experiences.
If we automatically obtain a suitable source model from a dataset of various sources, the separation performance will further improve. In practice, we first prepare 30 hours of solo-recorded vocals and guitar sounds in advance. Then, "the vocal-enhance model" and "the guitar-enhance model" are trained to utilize them as the source model.
These models can be trained by using deep neural networks (DNN), namely, the time-frequency structures of guitars and vocals are captured by DNN. Since DNN is used only for estimating linear time-invariant separation filters (as the same as those in IVA and ILRMA), the separated signal includes less artificial distortion compared with that of single-channel DNN-based separation methods. Also, similar to ILRMA, an efficient optimization algorithm for the demixing matrix called iterative projection (IP) can be utilized to achieve computationally cheap BSS. We call this method "independent low-rank matrix analysis (IDLMA)."
In this demonstration, we compare the performance of IDLMA and its extension, t-IDLMA, where the DNN source models are trained for vocals, bass, and drums using 50 songs. The following four methods are compared: ILRMA, t-ILRMA (extension of ILRMA), Duong+DNN, and DNN+WF, where ILRMA and t-ILRMA are the blind (unsupervised) method. Duong+DNN is a multichannel audio source separation technique that estimates a spatial covariance matrix using the DNN source model. However, Duong+DNN requires a high computational cost (10 times slower than ILRMA). DNN+WF is a single-channel audio source separation technique based on a Wiener filter constructed by DNN outputs. All the music data used in this demonstration were obtained from SiSEC and used only for an academic research purpose.