Voice and Accompaniment Separation in Music

Introduction
1. Automatic Karoake / Remixing
Problem
Solution
1. Dense-UNet voice and accompaniment separation
References
Neural Network Architectures mentioned in the paper

This is a review of the paper Voice and accompaniment separation in music using self-attention convolutional neural network

Reader is assumed to be familiar with basic constructs like CNN, LSTM, Attention, etc.

Introduction

Automatic Karoake / Remixing

Let’s say there is a new song released by one of your favourite musician/artist.

You want to ~~cry~~ sing your heart out to it but you only need the music from that song, you want to sing the lyrics yourself. What do you do?
Let’s say you are an up-and-coming DJ. You want to keep the original lyrics of the singer in the song as is, but you want to add/mix your own music to show off to your audience.

Problem

How do you do it? How do you separate music from lyrics or vice-versa?

In general, how do you separate one source of voice from another?

Solution

Base network : Dense-UNet
Improvement : Self-attention subnets on base network

Dense-UNet voice and accompaniment separation

In CV, UNet is used for semantic segmentation.

Let $X_1(t, f)$ denote Short-Time Discrete Fourier Transform (STFT) of human voice where $t$ and $f$ are time and frequency indices. Let $X_2(t, f)$ denote STFT of accompaniment. Then the music mixture can be described as :

\begin{equation} Y(t, f) = X_1(t, f) + X_2(t, f)\tag{1} \end{equation}

References

Neural Network Architectures mentioned in the paper

UNet-CNN
Skip-UNet-CNN
Dense-UNet-CNN
MMDenseLSTM