How are the voices of Talkens created?

Creating a truly synthetic voice that closely resembles a human voice starting from scratch is still a low-hanging fruit just out of reach, as most attempts have resulted in robotic-sounding voices similar to the Daleks or Cybermen from the Doctor Who franchise. This is because human voices have a wide range of natural variations and fluctuations that make it very difficult for a machine to reproduce. All commercially available voice solutions present on the market are built starting from a sample of real voices.

The voices of Talkens started from a preexisting audio corpus of approximately 50 vocal identities, equally distributed between male and female, each voice having about 20 minutes of audio recording to provide sufficient data. To generate new voices from the existing dataset, the team behind Talkens worked on a complex AI algorithm that interpolates existing voices, combining their characteristics like pitch, tonality, intensity and frequency to generate a new voice identity. In the initial stages of the project, pairs of two vocal identities were combined to generate new ones, but as results proved fruitful, multiple vocal identities were combined to exponentially increase the total number of voices, up to 10 000. Voice synthesis is a technique that has been around for quite some time, but Talkens is the first project to investigate this topic on a large scale in the NFT domain. As a matter of fact, Talkens managed to push this technology to its limits by managing to generate 10 000 voices starting from an initial sample size of 50 voice identities, without needing to record additional speakers.

A major innovation in the voice synthesis part is the fact that Talkens manages to achieve a pure voice synthesis. Popular voice assistant solutions available on the market like Alexa or Siri use concatenated voice synthesis, meaning that each line spoken by the assistant is prerecorded and joined together, meaning that such systems are unable to generate new speech that hasn’t been prerecorded. Pure voice synthesis solutions like Talkens leverage complex AI neural networks to generate speech without requiring any prerecording.

The biggest advantage of modern machine learning and deep learning techniques is the fact that if you have a sufficiently complex algorithm and feed a neural network enough quality data, it can, in theory, at least, learn anything. On the deep learning side of things, there is the concept of fine-tuning, which allows a broader neural network to be trained to focus on a specific domain. The best way to explain this concept is to take an image example. Let’s say you want to train a neural network to be able to detect pictures of cats. To do this, you would need to feed a vast quantity of data to the network to train it. The data in question doesn’t necessarily require annotated data, labelled data that says this is a cat and so on. Neural networks that learn without labelled data perform self-supervised learning, which means that it detects their own patterns and conclusions. Once the neural network is trained to detect cats, you can input a labelled data set to train the neural network to detect certain cat breeds, like Angora or British short hair. This is what fine-tuning is in a nutshell. The advantage of this approach is that fine-tuning requires a smaller dataset.

A similar process was performed with the Talkens neural network. Instead of creating multiple neural networks for every 10 000 voices, a single neural network is trained and fine-tuned for each individual vocal identity.

Natural Language Processing (NLP) techniques also play an important part in the Talkens project. NLP converts the text input given by a user into phonemes, which are later converted by the neural network from an AI Serum into speech.