News » Science & Technology
VALL-E can imitate the tone and manner of speech by listening to the voice of a real person in just three seconds. And although the sound gives out an electronic voice a little, the result is still amazing – the speech synthesis model can preserve the emotional tone of the speaker and even the acoustic environment.
Microsoft called their development “Neural Codec Language Model”. VALL-E is based on EnCodec, a machine learning audio codec developed by Meta in 2022.
Unlike other text-to-language methods, which typically synthesize speech by manipulating waveforms, VALL-E generates separate audio codec codes from text and acoustic cues. In fact, it analyzes how a person sounds, breaks this information into separate components “tokens”; thanks to EnCodec, and uses training data to answer what he “knows” about how that voice would sound if it spoke different phrases outside of the three-second sample.
There are several comparative notes in the article describing the technology. They are divided into 4 columns:
- In the Speaker Prompt section, you can listen to the original voice recording, limited to only three seconds.
- In the Ground Truth column – the whole phrase.
- Baseline gives an example of a conventional language synthesizer.
- The fourth column allows you to listen to the phrase performed by the VALL-E neural network.
VALL- E was taught on the basis of the LibriLight library, containing 60,000 hours of English speech from more than 7,000 people. The developers suggest that the technology can be used for high-quality programs for converting text to broadcast, editing language recordings where human words are allowed to be changed, creating audio content, and more.
Follow us on Telegram