Go back

Newspaper, Magazine, and Books Image OCR Dataset with Bahasa Text

This OCR dataset consists of diverse types of images with text in the Bahasa language from newspapers, magazines, and books. Along with images, this dataset consists of detailed metadata as well.

Volume

5K+ images

Last Updated

Aug 2023

Types

Diverse types

Get this AI Dataset

Bahasa OCR dataset with newspaper, books and magazine images

Request Custom Collection

About This OTS Dataset

What’s Included

Introducing the Bahasa Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bahasa language.

Dataset Contain & Diversity:

Containing a total of 5000 images, this Bahasa OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Bahasa text.

Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

All these images were captured by native Bahasa people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

Metadata:

Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bahasa text recognition models.

Update & Custom Collection:

We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bahasa crowd community.

If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

License:

This Image dataset, created by FutureBeeAI, is now available for commercial use.

Conclusion:

Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bahasa language. Your journey to enhanced language understanding and processing starts here.

Use Cases

Data extraction

OCR

Text Recognition

Document processing

Dataset Sample(s)

Samples will be available soon!

Dataset Details

Dataset type

Printed newspaper, magazine & book

Volume

5K+ images

Media type

Image

Language

Arabic

Type

Diverse types

Image File Details

Environment

Indoor & Outdoor

Diversity

Different lightening...more

Format

JPEG, HEIC

Device

Android & iOS

Annotation

Type

Printed

Read the License Terms

Browse FAQs

Similar to Newspaper, Magazine & Books Image Datasets

Korean OCR dataset with newspaper, books and magazine images

Korean Newspaper, Magazine & Books Image Dataset

Korean OCR dataset with newspaper, magazine and book images.

5K+ images

Diverse types

Data extraction

OCR

Gujarati OCR dataset with newspaper, books and magazine images

Gujarati Newspaper, Magazine & Books Image Dataset

Gujarati OCR dataset with newspaper, magazine and book images.

5K+ images

Diverse types

Data extraction

OCR

Tamil OCR dataset with newspaper, books and magazine images

Tamil Newspaper, Magazine & Books Image Dataset

Tamil OCR dataset with newspaper, magazine and book images.

5K+ images

Diverse types

Data extraction

OCR

Malayalam OCR dataset with newspaper, books and magazine images

Malayalam Newspaper, Magazine & Books Image Dataset

Malayalam OCR dataset with newspaper, magazine and book images.

5K+ images

Diverse types

Data extraction

OCR

View All

Need datasets for a specific AI/ML use case? Don’t worry, we’ve got you covered! 👍

Newspaper, Magazine, and Books Image OCR Dataset with Bahasa Text

Category

Volume

Last Updated

Types

Get this AI Dataset

Request Custom Collection

About This OTS Dataset

What’s Included

Use Cases

Data extraction

OCR

Text Recognition

Document processing

Dataset Sample(s)

Samples will be available soon!

Dataset Details

Dataset type

Volume

Media type

Language

Type

Image File Details

Environment

Diversity

Format

Device

Annotation

Type

Similar to Newspaper, Magazine & Books Image Datasets

Korean Newspaper, Magazine & Books Image Dataset

Gujarati Newspaper, Magazine & Books Image Dataset

Tamil Newspaper, Magazine & Books Image Dataset

Malayalam Newspaper, Magazine & Books Image Dataset

More in Bahasa

Punjabi Newspaper, Magazine & Books Image Dataset

English Product Image OCR Dataset

German Newspaper, Magazine & Books Image Dataset

Korean Newspaper, Magazine & Books Image Dataset

Need datasets for a specific AI/ML use case? Don’t worry, we’ve got you covered! 👍

Newspaper, Magazine, and Books Image OCR Dataset with Bahasa Text

Category

Volume

Last Updated

Types

Get this AI Dataset

Request Custom Collection

About This OTS Dataset

What’s Included

Use Cases

Data extraction

OCR

Text Recognition

Document processing

Dataset Sample(s)

Samples will be available soon!

Dataset Details

Dataset type

Volume

Media type

Language

Type

Image File Details

Environment

Diversity

Format

Device

Annotation

Type

Similar to Newspaper, Magazine & Books Image Datasets

Korean Newspaper, Magazine & Books Image Dataset

Gujarati Newspaper, Magazine & Books Image Dataset

Tamil Newspaper, Magazine & Books Image Dataset

Malayalam Newspaper, Magazine & Books Image Dataset

More in Bahasa

Punjabi Newspaper, Magazine & Books Image Dataset

English Product Image OCR Dataset

German Newspaper, Magazine & Books Image Dataset

Korean Newspaper, Magazine & Books Image Dataset

Need datasets for a specific AI/ML use case? Don’t worry, we’ve got you covered! 👍

We Use Cookies!!!