Recently, OpenAI has delivered another breakthrough in the field of speech technology. By using a text input along with a 15-second audio sample, they can generate speech that sounds both natural and remarkably similar to the original voice. What’s particularly impressive is that even with a small model, a 15-second sample is enough to create speech that is emotionally rich and highly realistic. OpenAI has named this new speech engine “Voice Engine,” and the preview version has recently made its.
What Kind of Data is Needed for Large Speech Foundation Models?
1.Speech Data
Speech data is the most critical component, requiring vast amounts of recordings. This data should include a wide variety of dialects, accents, intonations, speaking speeds, and environmental noise conditions. This diversity ensures that the speech model can perform effectively across different real-world scenarios.
2. Speech-to-Text Transcription Data
A speech recognition system requires corresponding transcript data for training its deep learning recognition algorithms. This transcription data must accurately match the spoken content of the audio recordings. High-quality transcriptions are essential for the model to learn and recognize speech patterns effectively.
3. Pronunciation lexicon
A pronunciation lexicon maps words to their phonetic representations, which is crucial for both speech recognition and synthesis. This dictionary helps the model understand how words should be pronounced, aiding in accurate speech-to-text conversion and natural-sounding text-to-speech synthesis.
High-Quality Multilingual and Multidialectal Labeled Data
China’s linguistic landscape presents a unique and complex diversity, which not only reflects the richness of the language itself but also embodies the deep cultural and historical heritage of the region. While Mandarin serves as the national official language, various dialects and local accents remain deeply rooted in people’s daily lives. These dialects carry the characteristics and historical imprints of their regions, posing significant challenges for the development of speech recognition technology.
To build models capable of effectively recognizing these diverse linguistic variants, it is necessary to gather extensive and in-depth data. This means not only including major dialects such as Northern Mandarin, Southern Wu, Cantonese, and Minnan, but also covering more regionally distinct and less commonly spoken dialects. Additionally, people of different ages, genders, and educational backgrounds exhibit distinct speech characteristics, further complicating the data collection process.
After data collection, it is essential to label the speech data. This labeling process goes beyond simple text transcription; it includes precise descriptions of speaking speed, intonation, pauses, and accents. Only with such detailed labeling can the trained models exhibit high sensitivity to various speech variants and robust recognition capabilities in real-world applications. This approach ensures that large speech foundation models can better adapt to China’s complex linguistic environment, enabling technology to better serve the diversity of society and culture.
High-Quality Data for Large Speech Foundation Models
Recently, Dataocean AI has released a meticulously labeled speech dataset specifically designed for Large speech foundation model. This dataset covers 29,954 dialect speakers from 26 provinces in China, ranging in age from 12 to 75, with a total duration of 34,073 hours and an average recording length of nearly 60 minutes. The gender ratio is balanced.
The topics covered are very diverse, including news, text messages, vehicle control, music, general conversations, maps, daily conversations, family, health, tourism, work, social interactions, celebrities, weather, and other common life topics. Additionally, the dataset includes both read texts and spontaneous conversations, aimed at enhancing the capability of large-scale voice models to handle Chinese dialects.
King-ASR-963 Ten Thousand People Dialect with High-Quality Labeling Speech Corpus