Dataocean AI New Datasets – September

Blog
September 27, 2024
We are excited to announce our NEW arrivals, including Accented English Speech Recognition Corpus from 60+ Countries, Accented Spanish Speech Recognition Corpus from 8 Countries, India Multilingual Speech Recognition Corpus, Spontaneous Dialogue Speech Synthesis Corpus, Multi-Style, Multi-Tone Speech Synthesis Corpus. These high-quality AI training datasets will help enhance model performance, allowing AI products to meet the diverse needs of global users.

 

Accented English Speech Recognition Corpus from 60+ Countries:

English, is a global language, and is spoken with various accents across different countries and regions. Using a high-quality dataset with diverse English accents helps improve the robustness and accuracy of speech recognition systems.
  • Key Features: This corpus includes English accents from over 60 countries and regions, with 58,000 speakers and a total of 30,000 hours. The word accuracy rate is over 97%. The corpus ensures gender balance and covers an age range from 18 to 65, fully capturing linguistic features across different age groups. The accents covered include American English, Singaporean English, British English, Australian English, Canadian English, Indian English, Japanese English, French English, German English, Hong Kong English, and Taiwanese English.
  • Content Topics: The corpus spans over 20 fields, including call centers, business meetings, finance, insurance, marketing, education, healthcare, and tourism.

 

Accented Spanish Speech Recognition Corpus from 8 Countries:

Spanish is one of the six official languages of the United Nations and ranks as the second-largest language in terms of native speakers worldwide.
  • Key Features: This corpus includes over 7,800 speakers with a total duration of over 5,500 hours and a word accuracy rate of over 97%. It features gender balance and covers an age range from 18 to 65. It includes Spanish accents from 8 countries and regions, such as the United States, Mexico, Colombia, Venezuela, Chile, and Argentina.
  • Content Topics: The corpus includes read and dialogue data from various fields such as emotional dialogue, business meetings, daily conversations, news, and finance.
 

India Multilingual Speech Recognition Corpus:

In India, apart from Hindi, the first official language, there are 122 major languages spoken, 30 of which have over 1 million speakers, along with 6 officially recognized classical languages.
  • Key Features: This corpus includes 16,000 speakers, with a total duration of over 12,000 hours and a word accuracy rate of over 97%. It covers many of India’s major languages, such as Maithili, Bengali, Marathi, Telugu, Malayalam, Tamil, Odia, Urdu, Kashmiri, Punjabi, and Assamese.
  • Content Topics: The corpus includes read data from a variety of fields, including daily life, news, entertainment, and chatting.

 

Spontaneous Dialogue Speech Synthesis Corpus:

We are releasing two new Chinese spontaneous dialogue speech synthesis corpus, characterized by high naturalness, providing strong support for speech synthesis technology.
Chinese Multi-speaker – Spontaneous Dialogue:
  • Key Features: This corpus contains 18 hours of recordings from 3 male and 2 female speakers, covering various scenarios such as spontaneous dialogue, reading, and mixed Chinese-English reading. All data is precisely labeled, including pronunciation, intonation, and paralinguistic features, ensuring high quality and practical value.
  • Content Topics: The corpus includes spontaneous dialogue, words, jokes, riddles, proverbs, tongue twisters, poetry, idioms, and interjections.
King-TTS-298  Chinese Multi-speaker – Spontaneous Dialogue
 
Chinese Multi-speaker – Amateur Spontaneous Dialogue:
  • Key Features: This corpus contains 27 hours of recordings from 6 males and 21 females. The pronunciation and intonation are precisely labeled to ensure high quality and usability. The voices are recorded by non-professional speakers, making the tones more natural, though some accents or hoarseness may be present.
  • Content Topics: The corpus includes expanded conversational topics such as daily life, hobbies, and special skills.
King-TTS-135  Chinese Multi-speaker – Amateur Spontaneous Dialogue

 

Multi-Style, Multi-Tone Speech Synthesis Corpus:

Chinese Multi-speaker – Unique Voices:
  • Key Features: This corpus contains 11 hours of recordings, featuring various unique voice tones such as mature female, deep male, bass, high-pitched, and elderly-like tones. The data is labeled with pronunciation, intonation, and paralinguistic features (e.g., stress, drag).
  • Content Topics: The corpus includes casual conversation topics such as name origins, hobbies, and childhood experiences.
King-TTS-267  Chinese Multi-speaker – Unique Voices

 

Chinese Multi-speaker – Amateur Multi-Emotion, Multi-Style:
  • Key Features: This corpus contains 11 hours of recordings, featuring various unique voice tones such as mature female, deep male, bass, high-pitched, and elderly-like tones. The data is labeled with pronunciation, intonation, and paralinguistic features (e.g., stress, drag).
  • Content Topics: The corpus includes casual conversation topics such as name origins, hobbies, and childhood experiences.
King-TTS-306  Chinese Multi-speaker – Amateur Multi-Emotion, Multi-Style
Share this post

Related articles

WX20241211-122704@2x
Dataocean AI New Datasets - December
cover
Dataocean AI: An Expert in Content Moderation for a Safe and Reliable Network Environment
WX20240929-172037@2x
Dataocean AI New Datasets - September

Join our newsletter to stay updated

Thank you for signing up!

Stay informed and ahead with the latest updates, insights, and exclusive content delivered straight to your inbox.

By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.