In the competition of AI large language models, datasets play a crucial role, AI large language models require large-scale, high-quality data, and effective data handling is a key factor in the success of large language models. However, as the scale of datasets continues to grow, the complexity of data management is also increasing, leading to a range of issues such as a shortage of high-quality data, data security risks, and data compliance.
The Harm of ‘Dirty’ Data
Large language model pre-training requires learning extensive knowledge from massive text data stored in its model parameters. The data used for pre-training can be divided into two categories.
One is web data, which is the most readily available type of data, with various data-related companies like Baidu, Google, etc., crawling large volumes of web content daily. It is characterized by a massive quantity of diverse, often containing various forms of ‘dirty’ data. The second is proprietary data, specific to a particular domain, language, or industry, such as dialogues, books, code, technical reports, research papers, and exams. This type of data is relatively scarce, cleaner, and of high professional quality.
Most of the data used by large language models is sourced from web crawling, leading to a significant amount of ‘dirty’ data. Dirty data can hinder the performance of large language models, and an equivalent amount of carefully annotated data may achieve better performance, as ‘dirty’ data cannot break through this performance bottleneck.
Additionally, models like ChatGPT, large generative language models, use training data primarily from open-access web repositories, and their generated content may include sensitive information found on the internet. This results in various security vulnerabilities and potential misinformation in the content they generate.
A blogger on YouTube demonstrates how to use ChatGPT to crack 95 CD-KEY, highlighting the significant security concerns associated with such models. Consequently, data cleaning is an essential step in organizing the training data for large language models.
What is Data Cleaning
Data cleaning, also known as data preprocessing, is a critical step in data analysis and machine learning. It involves inspecting, transforming, and repairing raw data to ensure its quality, accuracy, and consistency.
The primary goal of data cleaning is to eliminate or correct errors, noise, missing values, duplicates, inconsistencies, and other imperfections in the data, making it suitable for further analysis, modeling, and mining. The most crucial part of a dataset is data that is highly relevant to the model’s task, diverse, and of high quality.
Considering that collected data can have issues like missing values, noise, duplicates, etc., massive datasets cannot be directly used for large language models. Instead, they need to go through processes like cleaning and labeling to generate datasets suitable for large language models, combined with algorithms, computing power, and more, to be truly useful for large language models. Taking GPT-3 as an example, its raw data size is 45TB, but the high-quality data after cleaning is only 570GB. To put it in perspective, after cleaning, only around 1% of the raw data becomes part of the corpus.
How to clean data
Step1 Using traditional code scripts to set up regularized filtering, select useful data, and clean out “dirty” data
When training large language models, data cleaning is of utmost importance as the quality and performance of the model depend on high-quality and clean data. Data cleaning methods include selecting high-quality data sources, normalizing text, removing HTML and tags, filtering stop words and noise words, addressing spelling errors, clearing sensitive information, detecting and handling outliers, evaluating data quality, and recording the cleaning process. These steps help improve the accuracy, stability, and security of the model, ensuring that the generated output is more reliable and of high quality.
Step2 Using AI models to filter out harmful data
Currently, one of the mainstream data cleaning methods is to leverage AI models for data cleaning. This involves using machine learning techniques to train intelligent agents to automatically identify and clean data, optimizing the allocation of work between humans and machines in data cleaning. This way, some human tasks such as categorization, filtering, and labeling can be performed by machines, often with high accuracy.
Additionally, the Bayesian classification algorithm is also employed for data cleaning. It is an algorithm that uses probability and statistical knowledge for classification and is known for its high accuracy and speed, making it suitable for data organization and statistical analysis. Bayesian-related algorithms and techniques are effective in distinguishing between clean and unclean data and have become an important part of data cleaning.
Moreover, AI capabilities using text recognition algorithms and identification technologies are being experimented with for data cleaning. For example, decision tree and random forest algorithms have the ability to judge poor-quality data based on features. These algorithmic identification methods enhance the data analysis capabilities in specific domains, leading to faster deployment.
It is noteworthy that OpenAI has employed the classic ChatGPT training dataset to develop a dedicated filtering model tailored for the precise detection of deleterious content. This strategic initiative aims to establish robust oversight of both input and output data generated by the model while ensuring compliance with usage policies.
Step 3 Human Review
After completing the above steps, further data cleaning can be performed through human review.
OpenAI found that cleaning such harmful content is not a simple task and even a large team would take decades to manually review extensive datasets. Therefore, OpenAI is dedicated to establishing AI-driven safety mechanisms to tackle this challenge. To achieve this goal, OpenAI hires staff specifically for building detection systems for harmful language and looks to employ workers to clean up such content.
Deep learning is an engineering powered by data, and data itself represents our productivity. Our demand for data encompasses not only quantity but also data quality. High-quality data is not only the key to the effectiveness of models but also a barrier to the development of AI technology within enterprise teams. Pre-training large language models on a large volume of cleaned web data and fine-tuning on annotated data to adapt large language models to various domains is the future trend in the application and development of large language models.