Go back

Japanese Visual Speech Dataset

This Multi-Modal dataset offers a rich collection of unscripted, high-definition videos featuring native Japanese speakers responding to open-ended questions. The videos capture a wide range of emotions and come with detailed metadata for comprehensive analysis.

Total Volume

1,000+ Videos

Last updated

Aug 2024

Number of participants

200+

Get this AI Dataset

Request Custom Collection

About This OTS Dataset

Introduction

Welcome to the Japanese Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.

Dataset Content

This visual speech dataset contains 1000 videos in Japanese language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.

•Participant Diversity:

•

Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of Japan.

•

Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•

Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data

While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.

•Recording Details:

•

File Duration: Average duration of 30 seconds to 3 minutes per video.

•

Formats: Videos are available in MP4 or MOV format.

•

Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•

Device: Both the latest Android and iOS devices are used in this collection.

•

Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•

Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•

Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•

Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•

Face Orientation: Contains straight face and tilted face angles.

•

Participant Positions: Records participants in both standing and seated positions.

•

Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•

Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•

Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•

Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy

•Sad

•Excited

•Angry

•Annoyed

•Normal

•

Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata

The dataset provides comprehensive metadata for each video recording and participant:

•

Participant Metadata: Unique identifier, age, gender, region.

•

File Metadata: Proper naming of each video file, format, resolution, fps, duration.

•

Recording Environment: Indoor or outdoor, recording time.

•

Recording Style: Handheld or fixed, straight or tilted face, portrait or landscape, standing or seated, stationary or moving, occluded or not.

•

Emotion: One of happy, sad, excited, angry, annoyed, normal.

This metadata is a powerful tool for understanding and characterising the data, enabling informed decision-making in the development of Japanese language visual speech models.

Usage and Applications

The Japanese Language Visual Speech Dataset serves various applications across different domains:

•

Visual Speech Recognition: Enhancing audio-based speech recognition systems with visual information.

•

Emotion Recognition: Training and evaluating models for automatic emotion recognition from visual cues.

•

Lip-Reading Systems: Improving the accuracy of automatic lip-reading systems, which can be particularly useful in noisy environments where traditional audio-based speech recognition struggles.

•

Virtual Reality and Augmented Reality: Enhancing VR and AR experiences by incorporating realistic speech and emotion recognition.

•

Generative AI: Training generative AI models for applications such as text-to-video and synthetic data generation.

Secure and Ethical Collection

•The data collection process adhered to strict ethical guidelines, ensuring the privacy and obtaining written signed consent of all participants.

•The dataset does not include any personally identifiable information about any participant, making it safe to use.

•The videos do not contain any copyrighted material.

Updates and Customization

We understand the importance of evolving datasets to meet diverse research needs. Therefore, our dataset is regularly updated with new videos in various real-world conditions.

•Customization & Custom Collection Options:

•

Transcription: Audio transcription of each file can be made available upon request.

•

Custom Language: Similar dataset can be prepared in any specific language.

•

Device Specific Collection: Similar dataset can be collected through specific mobile operating systems or mobile brands. Apart from that it can be collected through other devices like laptop webcam as well.