This is a demonstration of the paper entitled "Determined BSS based on time-frequency masking and its application to harmonic vector analysis."
Blind audio source separation (BSS) is a technique to estimate individual audio sources in an observed mixture signal, where the mixing system (spatial locations of microphones and sources) is unknown. When the number of sources is equal to the number of microphones, this problem is called "determined BSS." BSS can be used as a frontend processing of almost all audio applications including automatic speech recognition, hearing-aid system, automatic music transcription, and so on.
The most successful algorithms in the determined BSS history are independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA) proposed in 2006 and 2016, respectively (see here for IVA and ILRMA). These algorithms utilize a source model, an assumption of time-frequency structure for each source. In IVA, frequency vector model is assumed to represent the time-varying activation of each source. In ILRMA, the low-rank time-frequency model is assumed to represent repetition of similar spectral patterns (timbres).
An accurate source model improves the performance of BSS. For seeking more effective source models, we proposed time-frequency-masking-based determined BSS (TFMBSS), where any kind of time-frequency masks can be utilized as a source model in a plug-and-play manner. For example, TFMBSS based on harmonic/percussive sound separation (HPSS) was proposed by incorporating a single-channel HPSS algorithm.
Almost all audio sources including speech and music have a harmonic structure: the fundamental frequency and its overtones are generated. On the basis of this basic principle, we proposed harmonic vector analysis (HVA). In HVA, a time-frequency mask in TFMBSS is calculated by enhancing harmonic structures of each source. The fig. shows typical example of a voiced speech spectrum and its enhancement by the cepstrum thresholding. The log-amplitude spectrum of voiced speech (top-left) is converted to cepstrum (top-right) by the Fourier transform. By thresholding the cepstrum coefficients (bottom-left) and taking the inverse Fourier transform, the enhanced version of the log-amplitude spectrum is obtained (bottom-right).
In the parameter optimization of HVA, the harmonic structure of each estimated source is iteratively enhanced by the above-mentioned operation. Thus, the components that have a harmonic structure are automatically grouped as the same source, and the sources are gradually separated during the iterative parameter optimization. This process is demonstrated in fig.
The fig. shows (whitened) spectrograms and masks for the first five iterations of HVA in two-source determined BSS case. We can confirm that the time-frequency masks for each source are drastically changed to have the harmonic structure (striped patterns in the time-frequency domain). As a result, the estimated sources are approaching to their original time-frequency signals.
The following demonstration separates the speech source signal by IVA, ILRMA, and HVA. The two-channel mixture signal was used as an observed signal in this demonstration, where this signal is recorded in the room with 130 ms reverberation time.
The signals were obtained from SiSEC and used only for an academic research purpose.