Multimodal

AIGC-Portrait image dataset

The images of human figures in this dataset have been transformed into four different styles based on the original human images, including 3D cartoon, comic, watercolor painting, and sketch. There are four skin colors included: dark, fair, brown, and light. Each style-generated image contains all four skin colors.

Image High Resolution cv

Android Front-Facing Multi-Skin-Tone Face Collection Corpus

Product Features: Captured using the front-facing camera of Oppo series smartphones released after 2023. Models maintain eye contact with the camera, with a shooting distance of 20–50 cm (selfie unlock distance). Includes six lighting conditions (normal light, side light, backlight, front light, low light, warm light). For each participant: at least 15 photos for Black, White, and Asian (Chinese) individuals; at least 25 photos for Brown individuals. The dataset includes 25 pairs/groups of twins or look-alike individuals. Ethnicities Collected: Black, White, Asian (Chinese), Brown. Age Range: All adults. Image Specifications: 1080P resolution or higher.

Multimodal cv Android

Conference video action collection dataset

This dataset is a simulated video dataset of meeting actions. It captures various skin colors including dark, brown, fair, and light, recorded in a bright meeting room environment, with a single person's frontal video, using 4 different collection devices. Each person has 4 video clips and 2 pictures collected.

Multimodal Conference Video Action

General Knowledge Text-Image Pair Corpus

Product Features: This corpus includes data from 23 categories such as cuisine, landscapes, architecture, cities, countryside, health, sports, medical, automobiles, backgrounds, finance, education, oil paintings, illustrations, watercolors, travel, fashion, romance, animals, plants, space, and technology.

Sports Food Multimodal

Handheld Object Portrait Corpus

Data collection covers both indoor and outdoor environments, including offices, meeting rooms, parking lots, gardens, and other common work and daily-life scenarios. Lighting conditions include normal lighting, low-light, and backlit scenarios commonly encountered in real-world settings. Twenty video clips per participant, with each clip corresponding to the presentation of a single object and accompanied by three close-up images of the object from different angles. The subject appears in upper-body or full-body views, holding one object with one or both hands, recorded in standing or seated postures. The object is moved according to predefined actions. Each video clip includes a brief verbal description of the object provided by the model. The subject is clearly visible, and the face is not occluded for extended periods during recording.

High-Definition Dance Video Corpus

Product Features: This dataset has collected 100,000 dance videos, each averaging 30 seconds in length, at 4K resolution, including adults and teenagers with a foundation in dance, with a balanced gender ratio. It includes both solo and group dances, with high richness in videos from various angles such as front, side, back, and turning. Dance types include folk dance, jazz, street dance, and more. Application Fields: This dataset can be applied to virtual humans, VR, dance education, video production, and other fields, promoting the application and development of multimodal technology in the corresponding areas.

Dance Video Virtual human Dance Education

Lip speech video was collected for 250 people

This dataset uses six cameras and two microphones to simultaneously collect and record the lip-voiced video data of the speaker. The shooting scene is simulated in a cockpit environment, and the shooting angles and lighting conditions are diverse.

Facial recognition Object Detection

Lip-movement dataset

This dataset uses cameras to collect audio and video data of lip movements. The shooting scene is an indoor quiet environment, and various light conditions are simulated, including normal light, strong light, backlight, and weak light, with the shooting distance ranging from 0.5m to 1m, with 0.5m accounting for approximately 90%. The shooting angle is frontal, and the image size mainly covers the upper body. In addition to single-person collection, the collection also simulates queuing scenarios. Approximately 30% of the data in each person's video is collected by multiple people, and in the multi-person scenarios, the number of people on screen is mostly 2. The collectors mainly speak Mandarin at a normal speed, and some collectors may have slight local accents. Each person records 10 sentences of text, with an average of 10 to 15 words per sentence. The ages of the collectors cover multiple age groups, with a majority being children and middle-aged people, and the gender ratio is balanced. While the video is being recorded, there is also a front-facing interface microphone recording synchronously, and the audio file comes from the collected video.

Virtual human VR Video Collection

Multi-pose facial video dataset

This dataset collects video data of the head posture and expressions of human figures. The collection was conducted in various indoor living and working scenarios such as offices, meeting rooms, homes, dormitories, and corridors. Each person was filmed for a video, with the human figure in the video approximately the size of a headshot. The video content included raising and lowering the head, moving left and right, shaking the head, opening and closing the mouth, and various combinations of posture and movements. Each video was approximately 1 minute long. The lighting conditions included normal, low light, and backlighting, and the human figures were clearly visible.

Multimodal Head Pose

Filter by

AIGC-Portrait image dataset

Android Front-Facing Multi-Skin-Tone Face Collection Corpus

Conference video action collection dataset

General Knowledge Text-Image Pair Corpus

Handheld Object Portrait Corpus

High-Definition Dance Video Corpus

Lip speech video was collected for 250 people

Lip-movement dataset

Multi-pose facial video dataset

Get started

Filter by

Filter by

Multimodal

Filter by

AIGC-Portrait image dataset

Android Front-Facing Multi-Skin-Tone Face Collection Corpus

Conference video action collection dataset

General Knowledge Text-Image Pair Corpus

Handheld Object Portrait Corpus

High-Definition Dance Video Corpus

Lip speech video was collected for 250 people

Lip-movement dataset

Multi-pose facial video dataset

Get started

Join our newsletter to stay updated

Filter by

Filter by