The development of artificial intelligence technology is advancing by leaps and bounds, especially after the release of various pre-trained large models. Various industries have been impacted to varying degrees. In developed cities, people may have already become accustomed to intelligent robots serving as guides, home interactive devices, smart driving systems, and various other intelligent devices.
However, does the development of artificial intelligence truly benefit everyone? Can artificial intelligence products, like smartphones that are ubiquitous, bring benefits and convenience to each individual?
The answer is definitely not a simple yes. Let’s take the most commonly used artificial intelligence technology, speech synthesis, as an example, to analyze its uneven development and the technological barriers to widespread adoption.
Speech synthesis, also known as Text to Speech (TTS), is a core technology of Artificial Intelligence Generated Content (AIGC). AIGC refers to the use of artificial intelligence technologies such as large pre-trained models and Generative Adversarial Networks (GANs) to find patterns based on existing data and generate relevant content through appropriate generalization capabilities.
ChatGPT, widely used across various industries, falls into the category of text generation within AIGC. Speech synthesis is an essential terminal product of AIGC, directly targeting the general public. As a terminal product, it plays a crucial role in determining user experience and market sales.
Text-to-Speech
TTS has undergone three generations of evolution: concatenative synthesis, parametric synthesis, and end-to-end synthesis. It has achieved large-scale applications, and the current direction of technological upgrades focuses on achieving a fully human-like effect through improving the rhythm and emotional expression of speech, as well as realizing real-time speech synthesis. Any product utilizing AICG will incorporate speech synthesis technology, such as virtual assistants, intelligent customer service, home smart interactive devices, and intelligent robots.
From”Text to Speech Synthesis: A Systematic Review, Deep Learning Based Architecture and Future Research Direction”
The following figure illustrates five types of TTS systems.
In Type 1, a traditional structure is employed, consisting of text analysis, an acoustic model, and a vocoder (SPSS).
Type 2 and Type 4 integrate text analysis and acoustic models, predicting general acoustic features or Mel-Spectrogram directly from input characters. Examples of such systems include Tacotron 2, DeepVoice 3, and FastSpeech.
Type 3 TTS combines acoustic models and a final vocoder, transforming linguistic features directly into a waveform (WaveNet).
Finally, fully end-to-end (E2E) models amalgamate all three blocks into one, converting characters directly into a waveform (WaveTacotron, Char2Wav, ClariNet).
From “AN OVERVIEW OF TEXT-TO-SPEECH SYSTEMS AND MEDIA APPLICATIONS”
Technical Difficulties of Speech Synthesis
Products for text-to-speech synthesis have seen many releases, but they haven’t become as ubiquitous as smartphones. Many of these products are only used by a small portion of the population, highlighting the imbalance in the widespread adoption of AI products. The popularization of these products faces numerous challenges, including user habits, the completeness of product functionality, and whether the product is user-friendly for everyone. Particularly challenging is achieving natural and realistic speech, considering regional and language differences. This is because TTS systems created using traditional and statistical algorithms may produce voices that sound robotic or mechanical, which may not be easily accepted by users.
Additionally, creating interactive end-side applications that produce flexible and adaptive speech can be challenging, as TTS systems depend on various factors such as datasets, models used, and module types. This complexity might make it difficult for developers to generate refined and expressive speech.
Finally, it is crucial to create efficient and scalable TTS systems, as end-side products in real-world scenarios require generating large amounts of speech in real-time without sacrificing speech quality.
Barriers to the popularization of speech synthesis products
Speech synthesis technology has significant potential and opportunities in untapped markets, especially in low-resource small language and dialect markets. These markets are often overlooked by larger language markets, but there are several reasons to advocate for attention and development in these areas.
Firstly, small languages and dialects have strong communities and cultural backgrounds worldwide. For those who use these languages, having access to high-quality speech synthesis technology is a valuable resource that can help them better preserve and pass on their language and culture.
Secondly, as globalization accelerates, small language and dialect markets are becoming increasingly important. These languages have demand in business, tourism, and international cooperation, so providing speech synthesis technology that supports these languages can facilitate cross-cultural communication and collaboration.
However, successfully developing speech synthesis technology in these markets requires overcoming certain challenges. Training speech models for small languages and dialects may be more difficult due to data scarcity. Additionally, speech synthesis systems need more speech samples to capture and simulate the speech differences and intonations in these languages. Nevertheless, with the continuous advancement of technology, the application of techniques such as transfer learning and data augmentation can gradually overcome these challenges.
Multilingual and Multispeaker Emotional Speech Synthesis Dataset
DataOceanAI has a massive speech synthesis datasets, aimed at addressing the issue of data scarcity in the field of low-resource languages and dialects. We provide high-quality data covering various emotions and domains to meet diverse application needs. Here are some notable features of our dataset:
Large Scale: Each language/dialect speaker, has a relatively balanced gender ratio. Our dataset not only covers diverse speech characteristics but also has sufficient data volume to support training deep learning models and improving performance.
Comprehensive Annotation: Our dataset includes pronunciation annotations as well as prosodic annotations, providing TTS systems with additional information to generate more natural and fluent speech.
Multi-Domain Coverage: Our dataset spans multiple domains, including news, technology, dialogue, entertainment, novels, and comprehensive content. Whether you are developing TTS systems for news reporting, education, or entertainment, our dataset can meet your needs.
Multi-Emotion Coverage: Our dataset covers not only regular speech but also various emotional states, including happiness, sadness, surprise, and anger. This helps in developing more emotionally expressive TTS systems.
Multilingual Coverage: Our dataset includes various languages and dialects, including Tamil, Dutch, Swahili, Bulgarian, Urdu, Estonian, Maltese, North Uzbek, Luganda, Tibetan, etc. This provides great convenience for cross-cultural and multilingual applications.
DataOceanAI’s is committed to addressing the issue of data scarcity and providing rich resources to developers and researchers to advance TTS technology in various languages and application domains. We look forward to seeing more innovative applications and research that leverage our dataset to provide users with an outstanding speech synthesis experience.