Tamil (India) In-car Speech Dataset

The audio dataset comprises recordings of wake words and commands specific to in-car activities, featuring native Tamil speakers from India. It includes speech data, detailed metadata, and accurate transcriptions.

About this Off-the-shelf Speech Dataset

Introduction

Welcome to the Tamil Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.

Speech Data

This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.

•Participant Diversity:

•

Speakers: 50+ native Tamil speakers from the FutureBeeAI Community.

•

Regions: Ensures a balanced representation of Tamil accents, dialects, and demographics.

•

Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

•

Recording Nature: Scripted wake word and command type of audio recordings.

•

Duration: Average duration of 5 to 20 seconds per audio recording.

•

Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.

Dataset Diversity

Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.

•

Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.

•

Different Cars: Data collection was carried out in different types and models of cars.

•Different Types of Voice Commands:

•Navigational Voice Commands

•Mobile Control Voice Commands

•Car Control Voice Commands

•Multimedia & Entertainment Commands

•General, Question Answer, Search Commands

•

Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.

•Morning

•Afternoon

•Evening

•

Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:

•

Noise Level: Silent, Low Noise, Moderate Noise, High Noise

•

Parking Location: Indoor, Outdoor

•

Car Windows: Open, Closed

•

Car AC: On, Off

•

Car Engine: On, Off

•

Car Movement: Stationary, Moving

Metadata

The dataset provides comprehensive metadata for each audio recording and participant:

•

Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.

•

Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.

This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Tamil voice assistant speech recognition models.

Usage and Applications

This In-car Speech Dataset is a valuable resource for various applications in the field of in-car voice recognition and AI-driven voice technology. This dataset can be leveraged to enhance the performance and functionality of voice-activated systems across different domains.

•

Speech Recognition Model Training: Provides high-quality audio data for training models to accurately recognize and respond to in-car voice commands.

•

Safety and Emergency Response: Supports the development of systems that recognize and respond to emergency commands and safety alerts.

•

Driver Assistance: Facilitates the creation of advanced driver-assistance systems (ADAS) that leverage voice commands for hands-free operation.

Secure and Ethical Collection

•Our proprietary data collection platform, “Yugo,” was used throughout the process of this dataset creation.

•Throughout the data collection process, the data remained within our secure platform and did not leave our environment, ensuring data security and confidentiality.

•The data collection process adhered to strict ethical guidelines, ensuring the privacy and consent of all participants.

•It does not include any personally identifiable information about any participant, which makes the dataset safe to use.

Updates and Customization

Understanding the importance of diverse environments for robust voice assistant models, our in-car voice dataset is regularly updated with new audio data captured in various real-world conditions.

•Customization & Custom Collection Options:

•

Environmental Conditions: Custom collection in specific environmental conditions upon request.

•

Sample Rates: Customizable from 8kHz to 48kHz.

•

Diverse Pace: Custom collection can be done at a diverse pace upon request.

•

Device Specific: Recording can be done with the specific mobile brand or operating system.