The corpus uses high-definition cameras to capture lip speech video data from approximately 208 individuals. The capture scenario is an indoor quiet environment, simulating various types of lighting, including normal light, strong light, backlight, and weak light. The shooting distance includes 0.5m and 1m, with a primary focus on 0.5m, accounting for about 90% of the recordings. The shooting angle is frontal, with the imaging size focusing mainly on the upper body. In addition to solo collections, the collection also simulates queue scenarios, with about 30% of each person's video data being collected in multi-person scenarios, where the number of people appearing in the multi-person scenes is mostly two. The collectors primarily speak Mandarin (prioritizing northern pronunciation individuals, with some collectors having better southern Mandarin pronunciation), some collectors may have a slight local accent, speaking at a normal pace, recording 10 sentences per person, with an average of 10 to 15 characters per sentence. The collectors' ages range from 7 to over 60 years old, mainly children and middle-aged and young people, with a balanced gender ratio. While the video is being recorded, there is also a front-facing interface microphone recording synchronized with the collector, and the other audio file comes from the collected video.
Product Features: Captured in a conference setting, with participants maintaining a neutral facial expression throughout, slowly walking around the room, without side glances or looking up/down, and with faces unobstructed. Each participant records 1–2 sets of videos (standing/sitting) using four cameras simultaneously (Logitech Rally, Aver CAM 550, Yealink UVC86, Poly E60). Additionally, two sets of photos are collected: one taken on the day using a computer, and one personal photo taken within the past two years using a phone.
Ethnicities Collected: Black, White, Asian (non-Chinese), Brown.
Age Range: All age groups, with balanced gender ratio.
Video/Image Specifications: Video resolution 1080P or higher, photo resolution 720P; each video is approximately 1 minute long.
Android Front-Facing Multi-Skin-Tone Face Collection Corpus
Product Features: Captured using the front-facing camera of Oppo series smartphones released after 2023. Models maintain eye contact with the camera, with a shooting distance of 20–50 cm (selfie unlock distance). Includes six lighting conditions (normal light, side light, backlight, front light, low light, warm light). For each participant: at least 15 photos for Black, White, and Asian (Chinese) individuals; at least 25 photos for Brown individuals. The dataset includes 25 pairs/groups of twins or look-alike individuals.
Ethnicities Collected: Black, White, Asian (Chinese), Brown.
Age Range: All adults.
Image Specifications: 1080P resolution or higher.
Product Features: AIGC-generated portrait data covering four styles—3D cartoon, comic, watercolor, and sketch—with each style including four ethnicities. The original portrait images are shared across styles, with about 10 images generated per person for each style.
Ethnicities Collected: Black, White, Asian (non-Chinese), and Brown.
Age Range: All adults, with a balanced gender ratio.
Image Specifications: Resolution of 1080P or higher.
Product Features: This corpus includes data from 23 categories such as cuisine, landscapes, architecture, cities, countryside, health, sports, medical, automobiles, backgrounds, finance, education, oil paintings, illustrations, watercolors, travel, fashion, romance, animals, plants, space, and technology.