Unknown
so voice! is the spiritual successor to my Deep Learning Capstone Project. The goal is to develop a choir synthesizer that leverages deep learning to produce audio indistinguishable from real recordings.
At a high level, so voice! consists of two main parts: the parsing engine, and the synthesizer. The parsing engine reads in sheet music, currently from well formed (read curated) MusicXML files, and converts them to a serial representation that can be fed into a machine learning pipeline. The synthesizer then pipes that representation through a series of deep learning models which ultimately output an audio file that is the synthesized performance of the input. Eventually I'd also like to add in a machine learning pipeline for reading sheet music from images (e.g. PDFs).
In terms of the actual machine learning pipeline, I have been developing custom architectures ever since I started working on the project in early 2019. Initially I centered my design around the WaveNet, which Google uses to great effect for their Text-to-Speech service, however several compounding factors cause WaveNets to be unsuitable for choral voice synthesis, e.g. difficulty parallelizing, low output sample rates, small pitch range, etc. More recently a article/paper by project magenta has piqued my interest, and I have since developed several prototype differentiable synthesizer networks that leverage those techniques. I'm also very interested in experimenting with transformers, which I think have good prospects to work well in the pipelines I've been building.
This demonstrates the final step in the pipeline which generates audio given output from all the previous steps. To test it in isolation, I have it attempt to recreate real audio clips, which gives an idea of how well it will perform in the full pipeline.
Additionally, using the music parsing engine, I put together a silly spinoff project leveraging Google Arts & Culture's Blob Opera as a stand in for choir synthesis.
These demonstrate a few previous approaches I tried for audio generation. For the Auto-Encoder GAN experiment, I attempted to build a mostly vanilla autoencoder and train with adversarial loss, commonly used in Generative Adversarial Networks.
For the LPC synthesis experiment, I recorded samples of each commonly sung phoneme by a professional singer, and then used the Linear Predictive Coding technique to splice them together at the correct pitches, generating a whole song. I had a lot of trouble generating noise-based phonemes (e.g. 's', 'f', 't', 'ch', etc.) which is the popping and clicking sounds in the full lyrics example. The second example locks the phoneme to 'a' for the whole song, to give an example without the clicks and pops.
so voice! is the next project on in my queue to focus on, so I'm hoping to get some good progress in over the coming year. I've definitely gained a lot of new machine learning experience since I last worked on it, so I'm excited to take a fresh look—with any luck, I'll have many new and better examples to add to this page!