Voice and Accompaniment Separation in Music
This is a review of the paper Voice and accompaniment separation in music using self-attention convolutional neural network
Reader is assumed to be familiar with basic constructs like CNN, LSTM, Attention, etc.
Introduction
Automatic Karoake / Remixing
Let’s say there is a new song released by one of your favourite musician/artist.
- You want to
crysing your heart out to it but you only need the music from that song, you want to sing the lyrics yourself. What do you do? - Let’s say you are an up-and-coming DJ. You want to keep the original lyrics of the singer in the song as is, but you want to add/mix your own music to show off to your audience.
Problem
How do you do it? How do you separate music from lyrics or vice-versa?
In general, how do you separate one source of voice from another?
Solution
- Base network : Dense-UNet
- Improvement : Self-attention subnets on base network
Dense-UNet voice and accompaniment separation
In CV, UNet is used for semantic segmentation.
Let $X_1(t, f)$ denote Short-Time Discrete Fourier Transform (STFT) of human voice where $t$ and $f$ are time and frequency indices. Let $X_2(t, f)$ denote STFT of accompaniment. Then the music mixture can be described as :
\begin{equation} Y(t, f) = X_1(t, f) + X_2(t, f)\tag{1} \end{equation}
References
Neural Network Architectures mentioned in the paper
- UNet-CNN
- Skip-UNet-CNN
- Dense-UNet-CNN
- MMDenseLSTM