How a group of scientists created artificial intelligent voices

This is the story of Talkens, the first AI experiment that explores how innovative technologies like machine learning, deep learning and artificial intelligence techniques can resuscitate non-fungible tokens (NFTs), bringing them to life by giving them a unique synthetic voice that expresses your ideas, your story in the digital space. The up-and-coming web3 revolution will forever change how we interact with each other in the digital space. Talkens.ai makes things easier by giving NFTs a voice using Humans’ AI technology.

The idea behind the voices

An interesting fact about humans is that in general, we tend to take things for granted, quickly becoming unaware of our potential and the invaluable contribution that we can bring to the world. With a total population just shy of 8 billion people, you would be inclined to think that uniqueness is only a numbers game, a matter of statistics, but in fact, there are defining characteristics that differentiate us, making us unique. Our voice identifies us as unique as our looks or fingertips. Even if two people do sound alike, no two voices are exactly the same, as anatomical differences as well as where we grew up are defining factors that shape how our voices sound.

Recent advances like the advent of the web3 era and the Metaverse are signaling a shift in the digital paradigm. People will become more connected with each other, eventually leading a double life, one grounded in the physical world, the other one down the digital rabbit hole where reality is defined by your imagination and how you wish to shape it. The team of scientists behind the artificial intelligent voices asked themselves how will people communicate in this new digital and interconnected medium. With no clear answer, the team slowly started to explore several modalities for people to express their creativity.

As speech is one of the main means for us to communicate, it’s only natural that the primary focus of the Talkens.ai experiment was to empower people with a unique synthetic voice that can act as an extension of our real voice in the digital space.

NFTs and blockchain, the technology behind this type of digital assets, proved to be the ideal medium to contain these voices as well as a suitable testing ground for the technology employed by Talkens. As the idea started to germinate and take root in the collective consciousness of the team, the challenge became clear — how to create a reasonable number of synthetic voices, 10 000 as an arbitrary number, and how can we enable people to use them.

NFTs have become one of the most popular digital collectables in the world, having one of the largest tight-knit communities. Talkens.ai decided to experiment with the concept of NFTs and AI to create a new asset class that has a unique voice attached to it which can be used by its owner to create new content and to communicate in the digital sphere.

The scientists behind the technology

Before delving into how things were done and how the technology was used to create the voices of Talkens, let’s focus for a brief moment on the team that put in the elbow grease and made everything happen.

At its core, Talkens is an AI experiment project launched on Humans Scale launchpad. The project is using a technology that requires a multifaceted team of experts, each member specializing in one particular branch of technology like voice synthesis, natural language processing (NLP), computer vision, machine learning and so on. This is exactly one of the factors that make Talkens special — the fact that it incorporates multiple innovative fields of artificial intelligence to produce a unique set of synthetic voices.

With a multinational team at heart, members come from countries like Italy, Switzerland, China and Romania. From this point of view, the dynamic of the Humans AI (the technology provider) research team wasn’t any different from a proper research lab. The fact that the team was composed of multinational members, each of them experts in their own field, streamlined the development process as members complemented each other, but, at the same time, it also posed a challenge in some situations as it sometimes proved difficult to find a common language to communicate.

Of course, the global pandemic didn’t make things easy for the research team as the office space was traded for the living room. Luckily, collaboration tools and online meetings helped the team synchronize and maintain a coherent workflow.

From a workflow perspective, the Talkens team was divided into two branches, one composed of researchers who outlined the artificial intelligence modules sending them to the other branch, composed of software developers who were responsible for integrating the AI models in software applications directed toward the end-user.

Each week the team coordinated itself to deliver small tasks to meet KPIs and have an incremental increase in the quality of the synthetic voices and other factors like the naturalness of the generated voices, their classification as female, male voices, voices with a thicker or thinner tonality and so on. Towards the end of the research lifecycle, the models that resulted from the research work were delivered in the form of containers packaged together with the executable code to the software development team that got them ready for the final customers.

The innovation behind the project

One of the first items that needed to be addressed by the Humans’ AI research team, the technology provider for the Talkens.ai NFT project, was the identification of a machine learning model suitable for the project, more precisely a toolkit for speech synthesis capable of generating speech from multiple speakers according to an embedding, a feature vector specific to a particular speaker.

Generate speech from multiple English speakers

A thorough scan of the market revealed that not many suitable models were adequate for the task. After a model was chosen and adjusted to fit the necessary specifications, the team of scientists started to train the system with a dataset of native English speakers. The dataset that yielded the best results was publicly available for research purposes and other commercial activities. Using these first sets of tools, the team managed to obtain a system, a first solution for generating speech in English that could generate speech from a few dozen to a few hundred real speakers. This marked an important milestone — making sure that the first iteration of the Talkens solution can generate speech from multiple speakers, from a few dozen to a few hundred, with a naturalness as close as possible to a real human voice.

Another very important aspect, which was also a problem that frequently occurred during the development stages, was the fact that the team finally managed to preserve the identity of the new speakers, meaning that whenever you generate a voice with the same feature vector, you want to make sure that the identity of the voice stays the same — that the generated voice always sounds the same.

New vocal identities

Moving forward, the team tried to figure out how to mix different voices and how to combine specific feature vectors of real voices in a way to generate new feature vectors that would act as the foundation for new vocal identities. In short, new synthetic voices.

This was an important turning point in the development of Talkens during which the team tried various combinations to obtain the best results like arithmetic averages, linear combinations and clustering algorithms depending on the positioning of these vectors in the latent space of speech characteristics to finally reach an optimal solution that practically combines more feature vectors to get a new vector for a new voice. The process of combining multiple feature vectors is also known as interpolation.

Gender-neutral, feminine, and masculine voices

Clustering helped the team make the interpolation process more intelligent. For example, if you want to interpolate a high-pitched female voice with another high-pitched female voice, you will obtain another vocal identity that is also a high-pitched female voice. Following the same logic, you can also interpolate a lower-pitched male voice with a higher-pitched female voice, to get a voice that sounds neither feminine nor masculine.

In any case, the voice in question will sound gender-neutral, being neither low nor high pitched. This clustering based on the tone of the voice enabled the Humans.ai researchers to determine which voices are very low pitched, which voices are very high pitched, which voices are masculine, and which voices are feminine so that when they performed various combinations, they knew what feature vectors to choose to obtain the desired result.

All these research processes were segmented into several stages. Initially, the research team delivered the first 1000 new voices to the development and production team, who, in turn, initiated a feedback loop regarding the voice quality and necessary improvements. After that, the next 1000 voices followed, and the process repeated itself until a total sum of 10 000 unique synthetic voices were created.

The technologies assembled

In general, the Talkens.ai NFT project relied on existing tools and technologies to achieve results that are not usually described by existing scientific research papers, namely there are few documented instances of attempts to develop new synthetic voices at scale, each of them with a unique set of defining characteristics like pitch, tonality, frequency, pace and gender. This is an innovation by itself.

Primarily, researchers in the speech synthesis field are trying to uncover how they can create speech synthesis systems that can generate speech from multiple speakers, and of course, how to preserve the identity of the generated voices. An interesting and innovative aspect for the Talkens.ai project was the idea of generating synthetic voices in which the team of researchers didn’t have audio recordings to train a machine learning model because the respective vocal identities did not exist.

This can be considered a big step forward as the AI researchers were able to demonstrate that by interpolating a scarce number of existing voices, (the dataset was composed of 50 vocal identities) it is possible to generate a large number of new voice identities, each of them with its unique set of defining properties. It’s one thing to set out to generate a dozen new voices, and a whole other undertaking to generate 10 000 unique synthetic voices.

Another innovative aspect of the Talkens project is the fact that it does not rely on concatenative synthesis, a technology that was widely embraced before the deep learning era. In concatenative synthesis, which was all the rage 5–10 years ago, much larger sets of speech data from a single speaker were used, as much as tens of hours from a single speaker, to be able to extract segments of speech in which the person utters various vowels, consonants, agglutinations of two, three, four letters. When you wanted to synthesize a new message, the idea was to concatenate, to link together small, pre-recorded segments of real speech. This type of technique is no longer practical if you want to create new voices due to an acute lack of data, as most data sets are composed of recordings of 15 to 20 minutes.

By leveraging complex AI neural networks and Natural Language Processing (NLP) techniques, Talkens AI can generate speech without requiring tens of hours of pre-recording from a single speaker.

For the interpolation process, the team started with a relatively standard toolkit. In the initial stages of the project, interpolation was performed between two voices, a process in which the team modified the interpolation coefficient to make the resulting voice resemble one of the two starter voices. So, in the beginning, there wasn’t an equal distribution of characteristics, a mean between the two voices. As such, if voices A and B were used to generate a new voice, the resulting voice would resemble more closely either voice A or B. This changed once 3 or 4 voices were combined at the same time to generate a new voice.

Another direction the team explored was the one in which they started to generate new voices from the voices that were created through interpolation by modifying their pitch. The pitch represents a fundamental frequency, or transposed in words with which we operate daily, how thick or how thin is the voice. By manipulating the pitch, the researchers could control how thin or how thick a voice sounded. The speed of pronunciation was also altered, the so-called peace to generate voices that speak faster or voices that speak slower. Overall, these were the axes of variation to produce new voices.

AI, deep learning and machine learning also played an important role in every facet of the development process. The Humans’ AI solution works by transforming a text input into an audio message, a process which involves multiple deep learning models. A first deep learning model takes a text input and generates Mel spectrograms that indicate the frequencies of the voice at various time frames, a visual representation of speech. An x-ray of the sound, so to speak.

A second deep learning model called Vocoder takes the Mel spectrogram and converts it into actual audio, a WAV file, which users can listen to. As previously mentioned, Mel spectrograms are a visual representation of sounds. A researcher specializing in speech synthesis can look at a Mel spectrogram and deduce where a set of vowels, where the consonants are, where the pauses in speech are, and so on. But to listen to the audio message, the Mel spectrogram needs to pass through the Vocoder model.

Both AI models have been trained and adapted to fit the requirements of the Talkens project. The first module, the one that transforms the text into Mel spectrograms, also receives the feature vector of voice-specific parameters. This means that the model is trained with voice parameters to output spectrograms that correspond to the text input.

Breakthrough through challenges

Every project comes with its inherent set of challenges that pop up along the way. In the case of Talkens, one of the first challenges that needed to be overcome by the AI researchers was the discovery of a suitable dataset of voices that could act as a good starting point. This was mostly a trial-and-error endeavour until the team stumbled upon a dataset that was suitable for the interpolation process.

Another point in the early development process that required more fine-tuning was how to manipulate the interpolation coefficient in a way to obtain a combination that generates a new voice identity that isn’t too similar to the initial voices.

The biggest challenge by far was to make sure that the generated voices would retain their identity regardless of the text input. For example, if you generated a very short piece of audio like “Hello World” using a certain voice identity, the same voice identity would change if the audio generated was longer. This was certainly an issue that needed to be solved.

In fact, this issue required a bit of out-of-the-box thinking and clever use of the technology. Returning to the “Hello World” example, the team concatenated short words four or five times so that the text at the input would be longer. This way, the system generated a longer audio sequence, “Hello World”, repeated five times, with the same voice identity. After ensuring that voice identity is preserved, the extra audio was later cut and discarded.