Multimodal

Search our off-the-shelf datasets.

Filter by
Language
Filter by Languages
Language
Devices
Devices
Applicable Fields
Applicable Fields
More
Applicable Scenarios
Applicable Scenarios
More
DMS with Multi-skin color Drivers Corpus
【Collector Information】 Ethnicity: Divided into two categories, Black and White. Among them, Black includes (black, brown, olive), and White includes (fair, medium, very fair). Nationality: Involves more than 39 countries (Switzerland, Colombia, Peru, Paris, Ghana, Brazil, Latin America, Latvia, Samoa, South Africa, etc.). Age: Collectors cover the age range of 18-60+, with a majority being middle-aged and young adults. 【Video Information】 Each video segment is at least 20 seconds long, with a resolution of no less than 720P.【Data Collection Information】 Daytime: Includes (front lighting, backlighting, side lighting, dappled sunlight, overcast, rainy, snowy weather) Nighttime: Includes (interior vehicle lighting, street lamp lighting, oncoming vehicle high and low beam) Facial Expressions and Actions: Eyes open, mouth open and closed, exaggerated mouth open and closed, exaggerated expressions, mouth twisted, making faces, etc. Other Actions: (smoking, drinking water, using a mobile phone, hand occlusions, etc.) Accessories: All subjects wear accessories, including (glasses, hats, etc.)
General Knowledge Text-Image Pair Corpus
Product Features: This corpus includes data from 23 categories such as cuisine, landscapes, architecture, cities, countryside, health, sports, medical, automobiles, backgrounds, finance, education, oil paintings, illustrations, watercolors, travel, fashion, romance, animals, plants, space, and technology.
High-Definition Dance Video Corpus
Product Features: This dataset has collected 100,000 dance videos, each averaging 30 seconds in length, at 4K resolution, including adults and teenagers with a foundation in dance, with a balanced gender ratio. It includes both solo and group dances, with high richness in videos from various angles such as front, side, back, and turning. Dance types include folk dance, jazz, street dance, and more. Application Fields: This dataset can be applied to virtual humans, VR, dance education, video production, and other fields, promoting the application and development of multimodal technology in the corresponding areas.
Lip-movement Video Corpus
The corpus uses high-definition cameras to capture lip speech video data from approximately 208 individuals. The capture scenario is an indoor quiet environment, simulating various types of lighting, including normal light, strong light, backlight, and weak light. The shooting distance includes 0.5m and 1m, with a primary focus on 0.5m, accounting for about 90% of the recordings. The shooting angle is frontal, with the imaging size focusing mainly on the upper body. In addition to solo collections, the collection also simulates queue scenarios, with about 30% of each person's video data being collected in multi-person scenarios, where the number of people appearing in the multi-person scenes is mostly two. The collectors primarily speak Mandarin (prioritizing northern pronunciation individuals, with some collectors having better southern Mandarin pronunciation), some collectors may have a slight local accent, speaking at a normal pace, recording 10 sentences per person, with an average of 10 to 15 characters per sentence. The collectors' ages range from 7 to over 60 years old, mainly children and middle-aged and young people, with a balanced gender ratio. While the video is being recorded, there is also a front-facing interface microphone recording synchronized with the collector, and the other audio file comes from the collected video.
Lip-reading Speech Video Corpus
The corpus uses six cameras and two microphone arrays to simultaneously capture the lip speech video data of speakers. The capture and filming scenario simulates the interior of a cockpit, with diverse shooting angles and lighting. Data is collected from 250 individuals, all of whom are adults, primarily middle-aged and young people. Each person's target effective recording time is approximately 0.5 hours, with an average of about 600 short sentences per person. The product library also extracts audio from any one of the six video routes captured for each ID, saving it as a separate audio file. The results from the six cameras will be aligned with an error of less than 30 milliseconds, and the two microphone results will also be synchronized with the camera results.
Multimodal 3D Sign Language Corpus
A total of 8,264 groups of data were collected for national general sign language and sports, of which 8,189 groups were repaired for action. Among them, 75 groups of data were not repaired. The rest of the categories were not repaired (5,366 groups).
Professional Scenario Text-Image Pair Corpus
Product Features: Images from various scenarios, multiple time periods, and different shooting angles, covering architecture, displays, urban streetscapes, home environments, competition scenes, shopping malls, schools, exhibitions, and natural environments. Corresponding text descriptions are provided.
Telephoto Landscape Corpus
【Product Features】 High-quality images of architecture and plants, with no blurring within the full size of the image, ensuring that both the foreground and background show clear textures even when enlarged; no more than 5 images of the same subject from different angles to ensure diversity in the content captured. 【Image Specifications】 Resolution above 4k (shoot in the highest quality mode with the camera); focal length within the range of 185mm to 235mm.

Join our newsletter to stay updated

Thank you for signing up!

Stay informed and ahead with the latest updates, insights, and exclusive content delivered straight to your inbox.

By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.

Filter by
Filter by
Language
Filter by Languages
Language
Devices
Devices
Applicable Fields
Applicable Fields
More
Applicable Scenarios
Applicable Scenarios
More