Go back

Hindi Visual Question-Answer Dataset

The multimodal dataset features a diverse array of images, each accompanied by various types of question-answer pairs specific to the image. It includes image data, text-based question-answer pairs, and comprehensive metadata.

Total Volume

35,000+ Question & Answers

Last updated

Aug 2024

Number of participants

100+

Get this AI Dataset

Request Custom Collection

About This OTS Dataset

Introduction

Welcome to the Hindi Language Visual Question Answer Dataset. The dataset includes 5000 diverse images and total 35000+ question-answer pairs associated with it. This meticulously curated dataset advances AI models for multimodal data understanding and development of Hindi language visual question-answering (VQA) models.

Image Data

This image question-answer training dataset comprises over 5,000 high-resolution images across diverse categories and scenes. Each image is carefully selected to represent a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust VQA models.

•

Image Data Information: The images in this dataset were collected through a rigorous and ethical process.

•

Clarity: Each image was checked for visual clarity and appropriateness.

•

Relevance: Images were selected based on their relevance to language dominance region and potential VQA scenarios, ensuring they depict a wide range of real-world contexts and objects.

•

Copyright-Free: The images in the dataset are free from any copyright issues.

•

Format: Images in the dataset are available in various formats like JPEG, PNG, and HEIC.

•

Type: Dataset contains images that have graphical as well as textual content in it.

•

Categories and Topics: The dataset spans a wide range of categories and topics to ensure thorough training, fine-tuning, and testing of VQA models. Topics include:

•

Daily Life: Images about household objects, activities, and daily routines.

•

Nature and Environment: Images related to natural scenes, plants, animals, and weather.

•

Technology and Gadgets: Images about electronic devices, tools, and machinery.

•

Human Activities: Images about people, their actions, professions, and interactions.

•

Geography and Landmarks: Images related to specific locations, landmarks, and geographic features.

•

Food and Dining: Images about different foods, meals, and dining settings.

•

Education: Images related to educational settings, materials, and activities.

•

Sports and Recreation: Images about various sports, games, and recreational activities.

•

Transportation: Images about vehicles, travel methods, and transportation infrastructure.

•

Cultural and Historical: Images about cultural artifacts, historical events, and traditions.

Question and Answer Pairs

The dataset includes more than 35,000 Hindi-language question and answer pairs, which means around 7-10 question answers for each image. It is thoughtfully crafted to cover various levels of complexity and types of questions. These pairs are designed to test and improve the model's ability to understand and respond to visual inputs in natural language.

•

Types of Questions: The dataset includes a diverse set of question types to ensure comprehensive model training:

•

Descriptive Questions: These questions seek detailed descriptions of objects, people, or scenes within the image.

•

Counting Questions: These questions involve counting the number of specific objects or elements present in the image.

•

Yes/No Questions: These questions require a binary yes or no answer based on the visual content.

•

Location-Based Questions: These questions focus on identifying the location of objects or elements within the image.

•

Object Recognition Questions: These questions ask for the identification or naming of objects in the image.

•

Action-Based Questions: These questions pertain to actions or activities occurring in the image.

•

Comparison Questions: These questions involve comparing attributes of different objects or elements within the image.

•

Reasoning Questions: These questions require inference or deduction from the visual information.

•

Sentiment Questions: These questions focus on the emotions or sentiments of people depicted in the image.

•

Hypothetical Questions: These questions ask the model to imagine a scenario or predict outcomes based on the current image.

•

Detail-Specific Questions: These questions focus on very specific details within the image, testing attention to fine details.

•

Functionality Questions: These questions ask about the purpose or function of objects within the image.

•

Contextual Questions: These questions require understanding the broader context or background information related to the image.

•

Types of Answers: The dataset includes a diverse set of Answers types to ensure unbiased model training:

•

Single-Word Answers: These are concise answers consisting of one word, often used for object names, locations, or yes/no responses.

•

Short Phrase Answers: These answers provide a brief explanation or description, typically two to three words long.

•

Full Sentence Answers: These answers are single-sentence answers that provide detailed information.

•

Descriptive Answers: These answers provide detailed descriptions of objects, people, or scenes within the image.

Metadata

Each image-question-answer pair is accompanied by comprehensive metadata to facilitate informed decision-making in model development:

•

Image Metadata: File name and category.

•

Question Metadata: Question type and Question

•

Answer Metadata: Correct Answer and Answer type

Usage and Applications

The Hindi Language Visual Question Answer Dataset is designed to support a wide range of applications, including but not limited to:

•

Training VQA Models: Providing high-quality data for training models to understand and answer questions based on visual inputs.

•

Accessibility Tools: Creating tools to assist visually impaired individuals by translating visual information into textual descriptions or answering questions about images, making digital content more accessible.

•

Augmented Reality (AR) and Virtual Reality (VR): Enhancing AR and VR applications by enabling users to interact with virtual environments through visual queries.

•

Content Moderation: Improving automated content moderation systems by training them to accurately interpret and respond to visual content.

Secure and Ethical Collection

•Our proprietary platform “Yugo” was used throughout the process of this dataset creation.

•Throughout the dataset creation process, the data remained within our secure platform and did not leave our environment, ensuring data security and confidentiality.

•It does not include any personally identifiable information, which makes the dataset safe to use.

•The content included in the dataset does not infringe upon any copyrights or intellectual property rights.

Updates and Customization

We understand the importance of evolving datasets to meet diverse research needs. Therefore, our dataset is regularly updated with new images and question-answer pairs captured in various real-world conditions.

•Customization & Custom Collection Options:

•

Image Categories: Addition of new images in any specific categories can be added and question-answer pairs can be generated as per requirement.

•

Custom Language: Similar dataset can be prepared in any specific language.

•

Annotation: Various types of image annotations, such as Object Classification, Bounding Box Annotation, Key Point Annotation, and Semantic Annotation, along with text annotations on question-answer pairs, including Part-of-Speech tagging, Named Entity Recognition (NER), or other application-specific annotations, can be made available upon request.