Essential Data for Training Large Speech Foundation model

Blog
May 2, 2024
Recently, OpenAI has delivered another breakthrough in the field of speech technology. By using a text input along with a 15-second audio sample, they can generate speech that sounds both natural and remarkably similar to the original voice. What’s particularly impressive is that even with a small model, a 15-second sample is enough to create speech that is emotionally rich and highly realistic. OpenAI has named this new speech engine “Voice Engine,” and the preview version has recently made its.

What Kind of Data is Needed for Large Speech Foundation Models?

1.Speech Data
Speech data is the most critical component, requiring vast amounts of recordings. This data should include a wide variety of dialects, accents, intonations, speaking speeds, and environmental noise conditions. This diversity ensures that the speech model can perform effectively across different real-world scenarios.
2. Speech-to-Text Transcription Data
A speech recognition system requires corresponding transcript data for training its deep learning recognition algorithms. This transcription data must accurately match the spoken content of the audio recordings. High-quality transcriptions are essential for the model to learn and recognize speech patterns effectively. 
3. Pronunciation lexicon
A pronunciation lexicon maps words to their phonetic representations, which is crucial for both speech recognition and synthesis. This dictionary helps the model understand how words should be pronounced, aiding in accurate speech-to-text conversion and natural-sounding text-to-speech synthesis.

 

High-Quality Multilingual and Multidialectal Labeled Data

China’s linguistic landscape presents a unique and complex diversity, which not only reflects the richness of the language itself but also embodies the deep cultural and historical heritage of the region. While Mandarin serves as the national official language, various dialects and local accents remain deeply rooted in people’s daily lives. These dialects carry the characteristics and historical imprints of their regions, posing significant challenges for the development of speech recognition technology.

 

To build models capable of effectively recognizing these diverse linguistic variants, it is necessary to gather extensive and in-depth data. This means not only including major dialects such as Northern Mandarin, Southern Wu, Cantonese, and Minnan, but also covering more regionally distinct and less commonly spoken dialects. Additionally, people of different ages, genders, and educational backgrounds exhibit distinct speech characteristics, further complicating the data collection process.

 

After data collection, it is essential to label the speech data. This labeling process goes beyond simple text transcription; it includes precise descriptions of speaking speed, intonation, pauses, and accents. Only with such detailed labeling can the trained models exhibit high sensitivity to various speech variants and robust recognition capabilities in real-world applications. This approach ensures that large speech foundation models can better adapt to China’s complex linguistic environment, enabling technology to better serve the diversity of society and culture.

High-Quality Data for Large Speech Foundation Models

Recently, Dataocean AI has released a meticulously labeled speech dataset specifically designed for Large speech  foundation model. This dataset covers 29,954 dialect speakers from 26 provinces in China, ranging in age from 12 to 75, with a total duration of 34,073 hours and an average recording length of nearly 60 minutes. The gender ratio is balanced.

 

The topics covered are very diverse, including news, text messages, vehicle control, music, general conversations, maps, daily conversations, family, health, tourism, work, social interactions, celebrities, weather, and other common life topics. Additionally, the dataset includes both read texts and spontaneous conversations, aimed at enhancing the capability of large-scale voice models to handle Chinese dialects.
King-ASR-963 Ten Thousand People Dialect with High-Quality Labeling Speech Corpus

 

Share this post

Related articles

cn cover1
Chinese Continuous Visual Speech Recognition Challenge Workshop 2024 Has Concluded Successfully
Presentation2 2.jpg555
Hi-Scene's Dataset of Over 50,000 Sports Videos Enhances AI Referees' Precision in Capturing Thrilling Moments
Presentation2 2.jpg333
How AthleteGPT is Perfectly Prepared for the Paris Olympics: The Technology Behind the Success

Join our newsletter to stay updated

Thank you for signing up!

Stay informed and ahead with the latest updates, insights, and exclusive content delivered straight to your inbox.

By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.