Creating Synthetic Chat Dataset for Content Moderation Models

01 November 2024

Custom Collection of Scripted Utterance Speech Dataset

Client's Challenge & Our Solution

A leading technology company sought to develop a content moderation model focused on detecting and addressing cyberbullying and online sexual harassment involving children. The challenge was to collect authentic yet sensitive data, ensuring ethical compliance and maintaining privacy. During the consultation phase, both the client and FutureBeeAI agreed that synthetic data generation was the best approach to address these complexities.

FutureBeeAI collaborated with the client to define data diversity requirements based on extensive initial research. We then generated 2,000 synthetic chats 50-150 turns long in English language, simulating real-world scenarios of cyberbullying and harassment. These conversations were meticulously crafted to capture the nuances of such interactions, ensuring relevance and utility for training content moderation algorithms.