Demonstration of Research

Audio source separation based on supervised nonnegative matrix factorization with basis deformation

Nonnegative matrix factorization (NMF) is a mathematical algorithm that decomposes a nonnegative matrix (matrix X in fig.) into two nonnegative matrices (matrices T and V in fig.). NMF is interpreted as low-rank approximation and can be used for extracting meaningful patterns and unsupervised learning.

When we apply NMF to audio signals, we decompose a time-frequency matrix (matrix X in fig.). As a result, several bases (timbre parts, matrix T in fig.) and their activations (volume and values, matrix V in fig.) are obtained. Audio editing or audio source separation can be achieved by editing or modifying these parts and activations. Also, automatic music transcription (producing music scores from sound) may be realized by NMF with post-processing.

Figure of NMF decomposition for audio signals

Audio source separation is a technique to separate each of the audio sources in the mixture. In particular, a supervised approach is often used, where some sample sound files of the target source signal are prepared and trained. In the context of NMF, "supervised NMF" is the most popular algorithm and provides better separation performance compared with unsupervised approaches.

In supervised NMF, a sample sound signal is decomposed by NMF as a training stage, and the timbre parts (matrix T in fig.) are obtained in advance. Then, in a separation stage, other matrices (matrices G, H, and U in fig.) are estimated while the pre-trained timbre parts are fixed. As a result, the target source components (TG in fig.) and the other source components (HU in fig.) are separated, and their audio signals can be obtained by transforming them into time-domain waves.

In supervised NMF, we must prepare the sample sound file of the target source as supervision. For example, if we want to separate piano sound from music, the sample sound of completely the same piano (the same player, model, and playing style) is required in the training stage. This is a big problem because, basically, we only have already mixed music, and the individual sound sources are not obtainable. Typical sample sound files that users can prepare are, e.g., the sound played by themselves or artificial sounds synthesized by MIDI/DAW software.

To solve the problem mentioned above, we proposed a new algorithm called "Supervised NMF with basis deformation." This algorithm can extract the source components that have not "the same" but "similar" timbres to the pre-trained sample sound. In the proposed method, we introduce a new term (matrix D in fig.) that deforms the pre-trained timbre parts, where this term can have both positive and negative values. Since excessive deformation degrades the source separation performance, the value range in the deformation term is constrained.

Figure of supervised NMF with basis deformation

In this demonstration, we produced music signals and sample sound using different MIDI synthesizers, where the sample sound consists of a two-octave scale sound (24 notes). The proposed algorithm can extract the target source, which has similar timbers to the sample sound.

Mixture signals with five sources

Input music
(Sax, A.Guitar, E.Guitar, Bass, Drums)

Separated Sax sound using supervised NMF with basis deformation

Sample sound
(Sax produced by different MIDI synth.)

Separated source
(Sax)

Other sources
(A.Guitar, E.Guitar, Bass, Drums)

Separated Bass sound using supervised NMF with basis deformation

Sample sound
(Bass produced by different MIDI synth.)

Separated source
(Bass)

Other sources
(Sax, A.Guitar, E.Guitar, Drums)

References

NMF

D. D. Lee and H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, vol. 401, pp. 788–791, 1999.

Derivation of update rules in NMF

D. D. Lee and H. S. Seung, "Algorithms for non-negative matrix factorization," in Proc. Adv. Neural Inform. Process. Syst., 2000, vol. 13, pp. 556–562.

Supervised NMF

P. Smaragdis, B. Raj, and M. Shashanka, "Supervised and semi-supervised separation of sounds from single-channel mixtures," in Proc. Int. Conf. Independent Compon. Anal. Signal Separation, 2007, pp. 414–421.

Supervised NMF with orthogonal/maximum-divergence penalty

D. Kitamura, H. Saruwatari, K. Yagi, K. Shikano, Y. Takahashi and K. Kondo, "Music signal separation based on supervised nonnegative matrix factorization with orthogonality and maximum-divergence penalties," IEICE Trans. Fundamentals Electron. Commun. Comput. Sci., vol. E97-A, no. 5, pp. 1113–1118, 2014.

Supervised NMF with basis deformation

D. Kitamura, H. Saruwatari, K. Shikano, K. Kondo, and Y. Takahashi, "Music signal separation by supervised nonnegative matrix factorization with basis deformation," Proc. DSP, 2013.