Largest text-to-speech AI model to date shows ’emergent capabilities’

[ad_1]

Researchers at Amazon have trained the largest text-to-speech model ever, which they claim exhibits “emergent” properties that improve the ability to speak even complex sentences naturally. Does. Technology may be the breakthrough it needs to avoid the uncanny valley.

These models were always going to grow and improve, but researchers specifically hoped to see the kind of leaps in capability that we saw after language models moved beyond a certain size. For reasons unknown to us, once LLMs advance beyond a certain point, they begin to become more robust and versatile, able to perform tasks they were not trained for.

This doesn’t mean they’re gaining sentience or anything, just that after a certain point certain conversational AI tasks are performed on a hockey stick. The team at Amazon AGI – it’s no secret what they’re aiming for – thought this might be the case as text-to-speech models grow, and their research shows that’s indeed the case. .

new model is called Large Adaptive Streamable TTS with Emerging Capabilities, which they changed to the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain speech, 90% of which is in English, with the remainder in German, Dutch and Spanish.

At 980 million parameters, BASE-large appears to be the largest model in this category. They also trained 400M- and 150M-parameter models based on 10,000 and 1,000 hours of audio, respectively, for comparison – the idea is that, if one of these models shows emergent behavior but the other does not, you have a The boundary is where those behaviors begin to emerge.

As it turns out, the medium-sized model showed the jump in capability the team was looking for, not necessarily in general speech quality (it has been reviewed better but only by a few points) but in emerging capabilities. In sets he observed and measured. Here are examples of tricky text Mentioned in the paper,

compound nouns:The Beckhams decide to rent a charming stone-built quaint rural holiday cottage.
emotions: “Oh, my God! Are we really going to the Maldives? This is incredible!” Jenny shouted, bouncing on her toes with no trace of joy.
foreign words: “Mr. Henry, famous for his mise en place, arranged a seven-course meal, each dish being a piece de resistance.
paralinguistics (i.e. readable non-words): “Shh, Lucy, shh, we shouldn’t wake your little brother,” Tom whispered as they walked past the nursery.
Full Stop: She received a strange message from her brother: ‘Emergency @ home; Call asap! Mom and Dad are worried…#familymatters.’
questions: But the Brexit question still remains: after all the trials and tribulations, will ministers be able to find answers in time?
syntactic complexitiesDe Moya’s film 2022, recently awarded a Lifetime Achievement Award, was a box-office hit despite mixed reviews.

“These sentences are designed to involve challenging tasks – parsing garden-path sentences, putting phrasal stress on long-winded compound nouns, producing emotional or whispered speech, or interpreting foreign words like “qi.” or “producing the correct sound for punctuation marks such as “@” – none of which Base TTS has been explicitly trained to perform,” the authors write.

Such features typically plague the text-to-speech engine, which will mispronounce, skip words, use strange intonation or make some other mistake. The base TTS still had its troubles, but it performed far better than its contemporaries – models like Tortoise and VALL-E.

There are many examples of these difficult lessons being spoken quite naturally by the new model On the site they created for it. Of course these were chosen by researchers, so they’re essentially cherry-picked, but it’s impressive nonetheless. Here are a couple, if you don’t feel like clicking:

Because the three base TTS models share an architecture, it seems clear that the size of the model and the extent of its training data appear to account for the model’s ability to handle some of the complexities above. Keep in mind that this is still an experimental model and process – not a business model or anything. Subsequent research will need to identify inflection points for emerging capabilities and how to efficiently train and deploy the resulting models.

In particular, this model is “streamable”, as the name suggests – meaning that it does not need to generate entire sentences at once, but rather runs from moment to moment at a relatively low bitrate. The team has also attempted to package speech metadata such as sentiment, prosody, etc. into a separate, low-bandwidth stream that can accompany vanilla audio.

It looks like text-to-speech models may get a breakout moment in 2024 – just in time for the elections! But the usefulness of this technology, especially for accessibility, cannot be denied. The team noted that it declined to publish the model’s source and other data due to the risk of it being exploited by bad actors. However, eventually the cat will be out of the bag.

[ad_2]

Thanks For Reading

Leave a Comment Cancel Reply