British English Visual Speech Dataset

This Multi-Modal dataset offers a rich collection of unscripted, high-definition videos featuring native British English speakers responding to open-ended questions. The videos capture a wide range of emotions and come with detailed metadata for comprehensive analysis.

Category

Visual Speech Dataset

Total Volume

1,000+ Videos

Last updated

Aug 2024

Number of participants

200+

Get this AI Dataset

Get Dataset Btn

About This OTS Dataset

About Gradiet Line

Introduction

Welcome to the UK English Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.

Dataset Content

This visual speech dataset contains 1000 videos in UK English language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.

  • Participant Diversity:
  • Speakers: The dataset includes visual speech data from more than 200 participants from different regions of United Kingdom.
  • Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.
  • Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
  • Video Data

    While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.

  • Recording Details:
  • File Duration: Average duration of 30 seconds to 3 minutes per video.
  • Formats: Videos are available in MP4 or MOV format.
  • Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.
  • Device: Both the latest Android and iOS devices are used in this collection.
  • Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:
  • Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.
  • Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.
  • Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.
  • Face Orientation: Contains straight face and tilted face angles.
  • Participant Positions: Records participants in both standing and seated positions.
  • Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.
  • Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.
  • Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.
  • Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:
  • Happy
  • Sad
  • Excited
  • Angry
  • Annoyed
  • Normal
  • Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.
  • Metadata

    The dataset provides comprehensive metadata for each video recording and participant:

  • Participant Metadata: Unique identifier, age, gender, region.
  • File Metadata: Proper naming of each video file, format, resolution, fps, duration.
  • Recording Environment: Indoor or outdoor, recording time.
  • Recording Style: Handheld or fixed, straight or tilted face, portrait or landscape, standing or seated, stationary or moving, occluded or not.
  • Emotion: One of happy, sad, excited, angry, annoyed, normal.
  • This metadata is a powerful tool for understanding and characterising the data, enabling informed decision-making in the development of UK English language visual speech models.

    Usage and Applications

    The UK English Language Visual Speech Dataset serves various applications across different domains:

  • Visual Speech Recognition: Enhancing audio-based speech recognition systems with visual information.
  • Emotion Recognition: Training and evaluating models for automatic emotion recognition from visual cues.
  • Lip-Reading Systems: Improving the accuracy of automatic lip-reading systems, which can be particularly useful in noisy environments where traditional audio-based speech recognition struggles.
  • Virtual Reality and Augmented Reality: Enhancing VR and AR experiences by incorporating realistic speech and emotion recognition.
  • Generative AI: Training generative AI models for applications such as text-to-video and synthetic data generation.
  • Secure and Ethical Collection

  • The data collection process adhered to strict ethical guidelines, ensuring the privacy and obtaining written signed consent of all participants.
  • The dataset does not include any personally identifiable information about any participant, making it safe to use.
  • The videos do not contain any copyrighted material.
  • Updates and Customization

    We understand the importance of evolving datasets to meet diverse research needs. Therefore, our dataset is regularly updated with new videos in various real-world conditions.

  • Customization & Custom Collection Options:
  • Transcription: Audio transcription of each file can be made available upon request.
  • Custom Language: Similar dataset can be prepared in any specific language.
  • Device Specific Collection: Similar dataset can be collected through specific mobile operating systems or mobile brands. Apart from that it can be collected through other devices like laptop webcam as well.
  • Licence

    This UK English Language Image Captioning Dataset, created by FutureBeeAI, is available for commercial use.

    Use Cases

    Usecase Image

    Visual Speech Model

    Usecase Image

    AR/VR Apps

    Usecase Image

    Lip Reading Models

    Usecase Image

    Emotion Recognition

    Usecase Image

    Generative AI

    Dataset Sample(s)

    Sample Line

    Samples will be available soon!

    Contact us to get the samples immediately for this dataset.

    Contact Us

    Audio Arrow BtnAudio Arrow Btn Black
    Audio Promp 2 Bg

    Dataset Details

    Details Headline

    Dataset Type

    Audio Visual Speech Dataset

    Language

    English

    Language code

    en-gb

    Country

    UK

    Accents

    English - East and C,...more

    Gender Distribution

    M:60, F:40

    Image File Details

    Details Headline

    Video Format

    MP4, MOV

    Environment

    Silent

    Video Duration

    30 Sec to 3 Min

    FPS

    30+ FPS

    Device

    Android & iOS

    Recording Condition

    Diverse

    Need datasets for a specific AI/ML use case? Don’t worry, we’ve got you covered! 👍

    Contact Us

    Arrow BtnArrow Btn Black
    Promp 2 Bg