AI Alignment: Navigating Complex Challenges

Blog
August 30, 2023

Since the release and launch of chatGPT, artificial intelligence (AI) has officially entered the era of large-scale models. Major top-tier internet companies worldwide are gearing up for general artificial intelligence models. OpenAI has progressively released chatGPT-4, has unveiled Llama 2, and Baidu has introduced Wenyan Yixin… These models possess a variety of abilities that surpass human capabilities, including writing, drawing, coding, designing advertisements, and more. Their autonomy, versatility, and user-friendliness have rapidly improved with the complexity of model structures and the accumulation of training data, making large models a new technological cornerstone driving economic and societal progress. As people immerse themselves in the benefits brought by large models, concerns about the ethical and security risks they bring are also on the rise.

Safety Risks of General Large Models

• Data Security Issue: Generative language models like chatGPT mostly train on open-source information from the internet, which means their generated content could carry private accounts, passwords, or other sensitive information from the web. For instance, there’s a YouTuber who guides viewers step-by-step on how to use ChatGPT to crack 95 CD-KEYs. This clearly exposes data security issues when faced with large models.

• Misuse Issue: Due to the immense capabilities of these models, individuals with malicious intent might use them as tools for illegal activities. For example, users can abuse ChatGPT to write scam messages, phishing emails, or even generate malicious code and ransomware, without requiring specialized coding knowledge or criminal experience. These generative large models often fail to consider different regional legal norms, potentially leading to violations of local regulations during usage and output. Establishing a robust local regulatory system to detect conflicts with local laws becomes crucial. The security of models like ChatGPT in the ambiguous territory between safety and danger still needs improvement. For example, ChatGPT might generate persuasive statements, some of which could have negative impacts on individuals with depression, or even lead to thoughts of suicide. Addressing these issues requires further technological innovation and stringent security strategies.

• Social Ethical Issue: Large language models carry various types of informational harm in their text output, such as reproducing biases, discrimination, toxic content from training data into predictive text, leaking privacy and sensitive information from training data into generated text, and producing low-quality, false, and misleading information.

• Intellectual Property Issue: Since most generated content is reconstructed from network data analysis, whether this constitutes intellectual property infringement remains uncertain.

Regarding the aforementioned issues, before the emergence of large models like chatGPT, efforts primarily focused on improving model performance to transition from artificial stupidity to artificial intelligence. Nowadays, large models can be considered as “intellects,” ing the need to address the issue of aligning artificial intelligence with human values.

What is AI Alignment

In 2014, Professor Stuart Russell, the author of “Artificial Intelligence: A Modern Approach,” first introduced the concept of “Value Alignment Problem.” It suggests that we are not constructing pure intelligence, but rather intelligence aligned with human values. The value alignment problem is considered an inherent aspect of artificial intelligence, comparable to a safety shell around a nuclear fusion reactor. In short, “AI value alignment refers to ensuring that AI systems’ ives and behaviors are consistent with human values and intentions.” Although the concept of AI alignment has been around for a while, the specifics of how to achieve value alignment and the standards for alignment remain undetermined.

Issues to Address in AI Value Alignment

While there are no specific standards to dictate how to achieve AI value alignment, Gordon Seidoh Worley [1] has summarized some of the issues that researchers propose need to be addressed for AI alignment:

• Preventing Reward Exploitation and Gaming: Ensuring that AI agents don’t exploit vulnerabilities in reward functions to repeatedly gain rewards, disregarding the actual goals.

• Scalable Supervision: Expanding supervision of AI agents in complex tasks with limited information or tasks where human judgment is difficult, even as large-scale language models surpass human-level performance across multiple tasks.

• Robustness to Distributional Shifts: Ensuring that AI agents perform as expected in new domains and environments, particularly situations not anticipated by human designers, to avoid producing destructive consequences.

• Robustness to Adversaries: Ensuring that AI agents are robust against adversarial attacks, maintaining alignment despite attacks. For instance, in large language models, injecting misaligned instruction data should not disrupt alignment.

• Safe Exploration: Allowing AI agents to explore new behaviors without generating harmful outcomes. For instance, a cleaning robot could try using a wet cloth, but not on a power outlet.

• Safe Interruptibility: Ensuring that AI agents can be safely interrupted by operators at any time, preventing tendencies to avoid human interruption.

• Self-modification: Safely allowing AI agents to modify themselves in modifiable environments while maintaining alignment with human values.

• Ontology: AI agents modeling the world and recognizing themselves as part of it.

• Idealized Decision Theory and Logical Uncertainty: Enabling AI agents to make idealized decisions even in uncertain environments.

• Vingean Reflection: How to predict behaviors of AI agents smarter than humans, ensuring their alignment with human values? This raises questions about whether humans can maintain equivalent or higher intelligence in predictions.

• Corrigibility: Allowing AI agents to be corrected or reprogrammed when necessary, rather than blocking or deceiving operators, to genuinely facilitate corrections/rewrites.

• Value Learning: Enabling AI agents to learn human values.

How to Ensure AI Value Alignment

Firstly, intervene effectively in training data. Many issues with large models (such as hallucinations and algorithmic bias) stem from training data, making it a feasible starting point. Recording training data can help identiy problems related to representation or diversity. By employing methods such as manual or automated filtering and detection, harmful biases can be identified and eliminated. Building specialized value-aligned datasets is also possible.

Secondly, conduct adversarial testing. In essence, this involves inviting internal or external professionals (red team testers) to subject models to various adversarial attacks before their release, identifying potential problems and resolving them. For example, before releasing GPT-4, OpenAI employed over 50 scholars and experts from diverse fields to test their model. The task of these red team testers was to pose probing or hazardous questions to the model to test its reactions.

Furthermore, implement content filtering tools. OpenAI has specially trained an AI model (filtering model ) to filter out harmful content, identifying harmful user inputs and model outputs (content that violates usage policies), thereby gaining control over model input and output data. Finally, there’s a strong push towards advancing research in model interpretability and comprehensibility.
 

Within the safety risks of AI, the contamination and misguidance of training data during the training process are core reasons behind biased, discriminatory, and low-quality content generation by AI large models. Therefore, filtering and cleansing data during the training process are essential. DataOcean AI is a professional data provider with a highly skilled team for data collection, cleaning, and annotation. They can provide high-quality, finely annotated data for training large models. 

Reference:

[1] https://www.lesswrong.com/users/gordon-seidoh-worley

[2] https://zhuanlan.zhihu.com/p/643161870

Share this post

Related articles

WX20241211-122704@2x
Dataocean AI New Datasets - December
cover
Dataocean AI: An Expert in Content Moderation for a Safe and Reliable Network Environment
WX20240929-172037@2x
Dataocean AI New Datasets - September

Join our newsletter to stay updated

Thank you for signing up!

Stay informed and ahead with the latest updates, insights, and exclusive content delivered straight to your inbox.

By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.