Jukebox by OpenAI outputs a new music sample from scratch


OpenAI released Jukebox. Its a neural network that can generate music with simple singing, in a variety of genres and artists’ styles. The company stated in their post, “Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch.”

AI model deals with many long-range dependencies to re-create the sound. because, Generating CD-quality music is a challenging problem as a typical song has over 10 million timesteps. The team of OpenAI used autoencoder to compresses raw audio, it can a lower-dimensional space by discarding irrelevant information bits.

The compression enables the researchers to train the model. In compressed space researchers generate audio and later unsampled it back.

The jukebox is built on the company’s previous work on MuseNet.  The company said, “Now in raw audio, our models must learn to tackle high diversity as well as very long-range structure, and the raw audio domain is particularly unforgiving of errors, in short, medium, or long-term timing.”

To train the model, the team gathered a dataset of 1.2 million songs. Out of 1.2 million 600,000 of them are English songs with lyrics and metadata. The metadata included genre, artist, and year of the songs.

The company, Jukebox said, “A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics.”

OpenAI has standardized its deep learning frameworks on PyTorch, and this project continues that pattern.

NVIDIA V100 GPU is used for performing interference. And it takes almost three hours to completely sample 20 seconds of music with one GPU.

Recent News

Related Posts