The ongoing pandemic has forced a shift towards online and virtual content delivery- be it your Zoom classes, or lectures, or business meetings, or even technical conferences! There is demand for creating such presentations, and a large portion of the population including the least Tech-Savvy person is now familiar with Google Slides and Powerpoint, as they access it multiple times a week.
We propose a way to save such vast amounts of time and energy spent by individuals, in making and delivering such presentations - by proposing an AI alternative that creates a base-line generic presentation from a research paper, and also delivers it in YOUR voice - or any person's voice, just give us a voice sample!
Overview
All educational institutions, work-places, research centers, etc. are trying to bridge the gap of communication during these socially distanced times with the use of online content delivery. The trend now is to create presentations, and then deliver them using various virtual meeting platforms. The time being spent in such creation of presentations and delivering is what we try to reduce and eliminate through this AI module which aims to use ML algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides based presentation from a document, and then use State-of-the-art voice cloning models to deliver the content in the desired author's voice.
Lets get into the details!
We consider a structured document such as a research paper to be the content that has to be presented. The research paper is first summarized using BERT summarization techniques and condensed into bullet points that go into the slides. The idea at this stage is to modularize the document into segments like Abstract, Introduction, Methodology, etc. The fact that research papers are usually ordered neatly, make it the modularization a simple and useful step to segregate different contents of the document. One thing to note here is that, in case there are some knowledge gaps in the research paper, then those would not be covered in the presentation, and could be a drawback as an ideal presentation with a human presenter would have bridged the knowledge gap. This is why we suggest this approach to be a valid generic template, to which further edits can be made.
Tacotron
Next, we use a Tacotron inspired architecture for the voice cloning and Text-to-Speech modules. Do give this papers and blogs a read to get a better idea as to how the voice cloning happens. Essentially, the idea is that given a voice sample, an embedding for the same is constructed. The Embedding can be of a different representation, having vastly different dimensions and feature spaces compared to the input. The key point regarding this embedding is that it associates a voice to a feature space that can be implemented for further AI synthesis. The properties of the voice such as frequency, timbre, quality, deepness, etc., would be entailed in this embedding. The embedding would however not account for the following derived properties- such as accent, or individual differences in pronunciation of different words, or the individual's "style" of speaking- unless there are a large amount of data-points to assert the learning model to build embedding representations around these factors. Once, we have a representation that can map voice samples to their respective embeddings, a new user can create an embedding for their voice. These embeddings can then be used to perform ML based voice cloning, using the following architecture proposed below.
Voice Cloning
The Voice Cloning module has 3 segments 1) Encoder, 2) Synthesizer, and 3) Vocoder. If you gave the above mentioned links a read, it would be clear how these 3 segments fit in together. But don't worry if you didn't, here is an overview. We mixed and matched the work suggested in these blogs and papers to derive the best encoder, synthesizer, and vocoder for our particular setup.
Firstly, the Encoder is the network trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker.
Synthesizer is the next component, and is a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel-spectrogram from text, conditioned on the speaker embedding. Again, its okay to not go into the details of Mel-Spectrogram- so lets look at like this- You must have seen those viral filters and applications that morph pictures into a 80s based theme, or modify your face to map into a celebrities, or making a funny video of you ridiculously jamming to a popular tune from a static picture of yourself?
Well, in most cases, there has been GANs being used in some step or the other. GANs essentially figure out the styles in these images, and map it to another style (which could be another theme, or celebrity face, etc,). The point is, most of the study on GANs have been dealt using images, mostly because we interpret images better than sounds. The idea of the Mel-Spectrogram thus is to visualize these waveforms and its embeddings (which contains all properties of a voice). For the cloning aspect, GANs are used on these sound images, and modifications are done, much like these popular filters- the only difference being, it is a sound image!
Thus the synthesizer does the following- it takes in the speaker embedding (which was generated by the encoder), and also the text embedding (phenome sequence) - which is the sentence the user wants communicated in the given speaker's voice- and concatenates the two embeddings. The concatenation is then represented as a mel-spectrogram.
Technically, the voice cloning completes at this stage, as we have a combined representation of the speaker and text embedding, which basically means the given sentence is being communicated in a desired person's voice. However, the Mel-spectrogram is in frequency domain, and to successfully transform and interpret it in a time-domain waveform we need the Vocoder. The output of the vocoder is a time-domain waveform, and is the audio containing the text being conveyed in another individual's voice. GAN based Vocoder, is used to convey the contents of the slides in the author’s voice (or any customized voice).
How does it look finally?
Here's how the web based product looks in its early stages- can't share the website link yet as we are still making the final edits, but here's a peek!
This is where you can choose the research paper (file), and also choose the voice you want to deliver the content in. You can either choose one of the existing voices, or record your own for around 15 seconds! The audio files are displayed on the right, and are divided into slides which have been segmented through the paper summaries!
The presentation itself, is viewable on a different page as is seen below (Yes, we still need to work on the UI a bit :D). We chose a sample paper that was authored by us in the past - which is also in this blogpost on this link- and noticed how the AI assistant created the presentation. Of course, this is just the most basic template, and further edits can be made onto this to beautify it.
The cloned voice itself- well here is how they sound! Here are the audio clippings of me and my teammate that worked on this project (Muvazima Mansoor) - for those who know us, well you can hear for yourself how the cloning works. Mind you- we do not have this accent, but the cloning works to mimic only the voice properties, and since the dataset we used contained predominantly voices with an American accent, it has rubbed off on our voice clones. For those of you who do not know us, well, this is how we sound. Here is also a cloned voice clipping of Amitabh Bachchan (you HAVE to know him at least), for your further idea on how the cloning works!
So, as you can see, we are able to mimic the voice properties quite well. The voice properties such as depth, quality, timbre, etc., have been replicated to a good extent. Specific properties such as accent, style etc., is however not captured. We are still working on a version that can be deployed as a web-application (between our college and internship, we're trying to find some time to keep this project going).
Huge thanks to Muvazima Mansoor, and our guide Dr. Ramamoorthy Srinath for making this project possible. Also, feel free to reach to collaborate on this project, and also let me know if you want me try out certain other voice samples!
Comments