The corpus uses six cameras and two microphone arrays to simultaneously capture the lip speech video data of speakers. The capture and filming scenario simulates the interior of a cockpit, with diverse shooting angles and lighting. Data is collected from 250 individuals, all of whom are adults, primarily middle-aged and young people. Each person's target effective recording time is approximately 0.5 hours, with an average of about 600 short sentences per person. The product library also extracts audio from any one of the six video routes captured for each ID, saving it as a separate audio file. The results from the six cameras will be aligned with an error of less than 30 milliseconds, and the two microphone results will also be synchronized with the camera results.
Images are captured by professional photographers.
Composition types include rule-of-thirds, horizontal, diagonal, triangular, and central composition.
All images are evaluated and annotated by personnel with high aesthetic standards.
Each image meets at least one composition type and at most three composition types.
Data collection covers both indoor and outdoor environments, including offices, meeting rooms, parking lots, gardens, and other common work and daily-life scenarios.
Lighting conditions include normal lighting, low-light, and backlit scenarios commonly encountered in real-world settings.
Twenty video clips per participant, with each clip corresponding to the presentation of a single object and accompanied by three close-up images of the object from different angles.
The subject appears in upper-body or full-body views, holding one object with one or both hands, recorded in standing or seated postures. The object is moved according to predefined actions.
Each video clip includes a brief verbal description of the object provided by the model.
The subject is clearly visible, and the face is not occluded for extended periods during recording.