AI voice startup launches marketplace of cloned voices

AI speech startup ElevenLabs reaches unicorn status on multilingual tech

ElevenLabs, an AI speech company created by former Google and Palantir employees, has achieved unicorn status (a term when a startup valuation reaches or exceeds $1 billion) in just two years since its founding. With the announcement of raising $80 million, the company’s valuation increased to $1.1 billion, a ten-fold increase.

Along with Sequoia Capital and SV Angel, the investment was co-led by current investors Andreessen Horowitz (a16z), former GitHub CEO Nat Friedman, and former Apple AI leader Daniel Gross.

According to this article, ElevenLabs, a company that has perfected the technique of employing machine learning for multilingual voice synthesis and cloning, stated that it will use the funds to expand its product line and further its research. In addition, many additional features were revealed, such as a tool for dubbing full-length movies and a new online store where users could sell their voice clones for money.

Universally accessible content

It is impossible to localize content for everyone in a world where dialects and languages vary by region. Traditionally, the strategy has been to hire dubbing artists for certain markets with development potential while concentrating on the English or mainstream language. Distribution is then made possible by the artists’ recording of the material in the intended language. The problem is that these manual dubbings don’t even come close to the source material. Furthermore, even with this, scaling the content for widespread distribution is impossible—especially with a small production crew.

Piotr Dabkowski, a former Google machine learning engineer, and Mati Staniszewski, an ex-Palantir deployment strategist, are both from Poland. They initially noticed this issue when watching movies with bad dubbing. They were motivated by this challenge to start ElevenLabs, a company whose goal is to use artificial intelligence to make all content globally accessible in any language and voice.

Since its launch in 2022, ElevenLabs has gradually expanded. It first gained attention when it developed a text-to-speech technology that produced English voices that sounded natural. Later, the concept was updated to include support for synthesis in more languages, including Hindi, Polish, German, Spanish, French, Italian, Portuguese, and Portuguese.

In addition, the company created a Voice Lab where customers could access the synthesis tool to create completely new synthetic voices or clone their own sounds by randomly sampling vocal parameters. This gave them the ability to transform any text—such as a podcast script—into audio files in the voice and language of their choice.

>>> The first A.I. using imagination

“ElevenLabs’ technology combines context awareness and high compression to deliver ultra-realistic speech. Rather than generate sentences one by one, the company’s proprietary model is built to understand word relationships and adjust delivery based on the wider context. It also has no hardcoded features, meaning it can dynamically predict thousands of voice characteristics while generating speech,” Staniszewski said.

AI Dubbing

After putting the products through beta testing, ElevenLabs attracted over a million users in a short period of time. By introducing AI Dubbing, a speech-to-speech translation tool that lets users translate audio and video into 29 other languages while keeping the original speaker’s voice and emotions, the company expanded on its AI voice research. As of now, it counts 41% of the Fortune 500 among its customers. This also includes notable content publishers such as Storytel, The Washington Post, and TheSoul Publishing.

“We are constantly entering into new B2B partnerships, with over 100 established to date. AI voices have wide applicability, from enabling creators to enhance audience experiences to broadening access to education and providing innovative solutions in publishing, entertainment, and accessibility,” Staniszewski noted.

ElevenLabs is currently aiming to develop on the product side to give users the best collection of features to work with as the user base grows. This is where the new Dubbing Studio workflow comes in.

The workflow expands on the AI Dubbing product and provides specialized tools to professional users so they can develop and edit transcripts, translations, and timecodes in addition to dubbing full movies in their preferred language. This offers them more direct control over the production process. Like AI Dubbing, it supports 29 languages, but it is devoid of lip-syncing, a crucial component of content localization.

This means that if a movie is localized using the tool, the lip movement in the video will stay the same, but it will only dub the audio in the desired language. Though Staniszewski plans to offer this functionality in the future, he acknowledged that the company is currently laser-focused on providing the best audio experience.

>>> NVIDIA's Face Generator

However, the technology for lipsyncing has already been developed by Heygen, which allows a good audio translation while keeping the original speaker’s voice and a mouth replacement that syncs the lips with the translated audio.

Marketplace to sell AI voices

ElevenLabs is unveiling not only the Dubbing Studio but also an accessibility tool that can transform text or URLs into audio and a Voice Library, which functions as a type of marketplace where users can monetize their AI-cloned voices. The company offers consumers the freedom to specify the terms of payment and availability for their AI-generated voice but warns that sharing it would require several steps and multiple levels of verification. Users will benefit from having access to a wider variety of voice models, and the developers of those models will have a chance to make money.

“Before sharing a voice, users must pass a voice captcha verification by reading a text prompt within a specific timeframe to confirm their voice matches the training samples. This, along with our team’s moderation and manual approval, ensures authentic, user-verified voices can be shared and monetized,” the founder and CEO said.

With the broad release of these functionalities, ElevenLabs wants to attract more customers from different sectors. With this funding, the company has raised $101 million in total. It intends to use the money to expand its research on AI voice, build out its infrastructure, and create new vertically-specific products. At the same time, it will be putting robust safety controls in place, such as a classifier that can recognize AI audio.

“Over the next years, we aim to build our position as the global leader in voice AI research and product deployment. We also plan to develop increasingly advanced tools tailored to professional users and use cases,” Staniszewski said.

MURF.AI, Play.ht, and WellSaid Labs are other companies doing voice and speech generation using AI. According to Market US, the global market for these products was valued at $1.2 billion in 2022 and is projected to grow at a compound annual growth rate (CAGR) of just over 15.40% to reach nearly $5 billion in 2032.

>>> Nanowire networks act like a human brain

ElevenLabs offers a great tool to generate natural voices, but some features should be implemented in order to make it a complete and versatile text-to-speech. Some other similar tools offer the possibility of changing the output, but ElevenLabs doesn’t. Although this tool is well-trained to produce perfect results without intervention, sometimes it would be good to have the possibility to change the emphasis or express different emotions through the speech, as other tools allow.

Even when the lip-sync feature like the one in Heygen is implemented, there will be other problems concerning dubbing since it is a more complex process involving dialogue adaptation. This means that sometimes the length of a translated line can be longer or shorter than the original one; therefore, a simple translation could alter the sync between audio and video. In addition, some expressions can’t be translated literally but need a slight or big change to be effective. Not to mention the tone of how the line is pronounced, which differs in every language.

However, the risk is that most people, and especially companies, could opt for this tool because it is cheaper than a dubber. And the audience could prefer this tool because it could be perceived as an improvement to subtitles. People don’t look for quality but for convenience. That’s why it’s easier to replace things and jobs with technology. Even if people can do things better, you can always make do with something with less quality but more convenient.

Music will also face new problems. The ability to clone voices and the new tools that now allow the voice to be used as if it were a new virtual instrument for making music will make it much easier for producers, who will no longer need a singer, but will make it complicated for artists to try to avoid having their voices stolen for unauthorized songs.

Having the possibility to alter video and speech with tools like those of Heygen will make it harder for everybody to understand what’s real or not. We are officially in the era of deception.

Dan Brokenhouse