Multimodal

Search our off-the-shelf datasets.

Filter by
Language
Filter by Languages
Language
Devices
Devices
Applicable Fields
Applicable Fields
More
Applicable Scenarios
Applicable Scenarios
More
High-Definition Dance Video Corpus
Product Features: This dataset has collected 100,000 dance videos, each averaging 30 seconds in length, at 4K resolution, including adults and teenagers with a foundation in dance, with a balanced gender ratio. It includes both solo and group dances, with high richness in videos from various angles such as front, side, back, and turning. Dance types include folk dance, jazz, street dance, and more. Application Fields: This dataset can be applied to virtual humans, VR, dance education, video production, and other fields, promoting the application and development of multimodal technology in the corresponding areas.
Telephoto Landscape Corpus
【Product Features】 High-quality images of architecture and plants, with no blurring within the full size of the image, ensuring that both the foreground and background show clear textures even when enlarged; no more than 5 images of the same subject from different angles to ensure diversity in the content captured. 【Image Specifications】 Resolution above 4k (shoot in the highest quality mode with the camera); focal length within the range of 185mm to 235mm.
DMS Diverse Drivers Corpus
This product library is a cabin DMS (Driver Monitoring System) for foreign adult data collection, solely capturing IR (Infrared) videos and images. The DMS captured 700 foreign adults, with 20% Blacks and 80% Whites. The shooting mode is individual, with 25 fixed cameras arranged inside the cabin for synchronized recording, and an additional camera for supporting shots. Props include hats (20%), regular glasses (25%), sunglasses (25%), masks (20%), with their configurations randomly overlapping. The vehicle models are 5-seater passenger cars (consisting of smart, BYD Dolphin, and BYD Song, totaling three vehicles), and the vehicles were stationary during the shooting. The lighting conditions of capturing include frontal lighting, back lighting, side lighting, interior car lighting, streetlights, shade under trees, oncoming headlights, cloudy days, and rainy days. The capturing content consists of both action and gaze, with videos for action capture and images for gaze capture, along with calibration data. The video action capture scenarios are divided into 18 basic scenes and 8 additional scenes, totaling 26 categories. All data include the first 18 basic actions, and some data also include the additional 8 scenes. During the action scene capture, the collectors will simultaneously collect head movements, which are divided into two actions. The collectors randomly perform one of the two actions, with each action accounting for 50% overall. The DMS product library captured approximately 730,000 video segments, with each segment lasting around 30 seconds (700 * 18 * 25 * [lighting conditions]). It also captured approximately 204,554,000 gaze and action images (700 * (11 * 25 * 19 + 19 * 26 * 19) * [lighting conditions]), 700,000 original calibration images (700 * 40 * 25), and another 700,000 calibration output images (700 * 40 * 25), which are stored in the images, detection, and reprojection folders respectively.
Lip-movement Video Corpus
Product Features: This dataset was captured using high-definition cameras to record 208 people's lip and speech videos. It was collected in a quiet indoor environment, simulating various lighting conditions, including normal light, strong light, backlight, and dim light, with shooting distances of 0.5m and 1m, primarily 0.5m, accounting for about 90%. It includes both solo and group recordings. The subjects are mainly Mandarin speakers, with ages ranging from 7 to over 60 years old, mainly children and young to middle-aged individuals, with a balanced gender ratio. Audio was collected simultaneously with the video recording. Application Fields: This dataset can be applied to virtual humans, VR, and other fields, promoting the application and development of lip-reading technology in the corresponding areas.
Lip-reading Speech Video Corpus
This dataset covers 250 individuals, with each person recording no less than 600 short sentences, and the effective video duration for each individual is half an hour, which can be used for tasks such as face recognition and object detection.
Multimodal 3D Sign Language Corpus
A total of 8,264 groups of data were collected for national general sign language and sports, of which 8,189 groups were repaired for action. Among them, 75 groups of data were not repaired. The rest of the categories were not repaired (5,366 groups).

Join our newsletter to stay updated

Thank you for signing up!

Stay informed and ahead with the latest updates, insights, and exclusive content delivered straight to your inbox.

By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.

Filter by
Filter by
Language
Filter by Languages
Language
Devices
Devices
Applicable Fields
Applicable Fields
More
Applicable Scenarios
Applicable Scenarios
More