Start your day with intelligence. Get The OODA Daily Pulse.

OpenAI, Hugging Face and DEFCON 31 (August 2023) on Red Teaming ChatGPT, Large Language Models, and Neural Language Models

The OODA LLC team and members of the OODA Network have participated in or led hundreds of red teams across many divergent disciplines, ranging from strategic and tactical cyber to physical security threats – like infectious diseases or nuclear power plant targeting – to more abstract items like Joint Operating Concepts.

We recently looked for patterns in the exponential adoption of ChatGPT, large language models (LLMs), and neural language models (NLMs)  – and how it has intersected with the red teaming discipline.

Here is what we found and the results are encouraging, including plans for a grand challenge at the AI Village at the upcoming DEFCON31 in August in Las Vegas, which will include access to models made available from  Anthropic, Google, Hugging Face, Microsoft, NVIDIA, OpenAI, and Stability AI.

Red-Teaming Large Language Models

ChatGPT and DEFCON31 need no introduction to our readership, but Hugging Face requires a bit more context.

Hugging Face has been included in our previous analysis of large language models (LLMs) and neural language models (NLMs), including:

  • The use of the company’s Cerebras’s open-source GPT models trained at wafer scale in some recent on-chip benchmarking:   “The trained models Cerebras has released, available under the permissive Apache 2.0 license, have been downloaded more than 200,000 times from HuggingFace at the time of writing (about two weeks after release). They are trained on the public PILE dataset from Eleuther.  Training seven models of different sizes allowed Cerebras to derive a scaling law linking the performance of the model (prediction accuracy) to the amount of compute required for training. This will allow the forecasting of model performance based on training budgets. While other companies have published scaling laws, this is the first using a public dataset, the company said.”; and
  •  The BigScience initiative:  A consortium led by AI company Hugging Face, made up of roughly 500 researchers (some of whom are from big tech firms). The group of volunteer researchers will be building and studying an open-source language model, which was discussed in 2021 in the context of The Current AI Innovation Hype Cycle: Large Language Models, OpenAI’s GPT-3 and DeepMind’s RETRO.

Although not mentioned directly in the leaked Google document breaking Down the Exponential Future of Open Source LLMs, Hugging Face’s open-source models will most definitely be in the mix.

Alpha Signal had a recent headline about Hugging Face:  “…well-known player in the open-source AI development arena, has unveiled open-source 30B chatbot alternative to ChatGPT, named HuggingChat. The new platform boasts a user interface that lets users interact with an open-source chat assistant called Open Assistant, created by the nonprofit organization LAION. HuggingChat, which has a web interface for testing and an API for integration with third-party apps and services, can write code, emails, and even rap lyrics, just like ChatGPT. The emergence of HuggingChat, which is still in its infancy, is a significant milestone for Hugging Face, as it reaffirms their commitment to open-source development and accessibility. However, HuggingChat’s economic viability remains to be seen, as licensing issues still need to be resolved.

From an AI industry perspective, HuggingChat’s arrival carries significant implications. Hugging Face‘s CEO, Clem Delangue, has stated that “we need open-source alternatives to ChatGPT for more transparency, inclusivity, accountability, and distribution of power.” HuggingChat’s open-source platform provides transparency, accessibility, and inclusivity that closed-source models lack, paving the way for a more democratic AI landscape. In addition, HuggingChat’s potential for third-party app integration has sparked talk of it becoming the Android App Store equivalent. This would mean that HuggingChat could level the playing field for smaller AI developers and reduce the monopoly of larger companies in the industry.” (1)

In a recent post from the Huggin Face team (which is so good we include it at the end of this post in its entirety, the working definition of red-teaming in an LLM context is a great preface for  us here:

Red teaming is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors. Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. Microsoft’s Chatbot Tay launched in 2016 and the more recent Bing’s Chatbot Sydney are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be. The origins of the idea of a red team traces back to adversary simulations and wargames performed by militaries.

The goal of red-teaming language models is to craft a prompt that would trigger the model to generate text that is likely to cause harm. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called adversarial attacks. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate content that would be undesirable in a real-world use case. However, adversarial attacks can be unintelligible to humans, for example, by prefixing the string “aaabbbcc” to each prompt because it deteriorates model performance. Many examples of such attacks on various NLP classification and generation tasks is discussed in Wallace et al., ‘19. Red-teaming prompts, on the other hand, look like regular, natural language prompts.

Red-teaming can reveal model limitations that can cause upsetting user experiences or enable harm by aiding violence or other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) are generally used to train the model to be less likely to cause harm or steer it away from undesirable outputs.

Open AI’s ‘red team’

“We don’t really know what it is they’re doing or who they’re communicating with. And that’s what we need to worry about. That’s already happening.”

This was a recent discussion on the Financial Times podcast FT News Briefing between FT’s Sonja Hutson and Madhumita Murgia (FT’s artificial intelligence editor).

Sonja Hutson: The newest version of the artificial intelligence chatbot ChatGPT is so powerful it’s creating new fears about how it can be used. So the company behind it has tried to counter those concerns by creating a team of experts to test out what the chatbot can do. The FT’s artificial intelligence editor Madhumita Murgia has been reporting on OpenAI’s “red team”. She joins me now. Hi, Madhu.

Madhumita Murgia: Hi.

Hutson: Can you tell me a little bit about this team that OpenAI put together to look into dangerous uses for ChatGPT-4?

Murgia:  So I spent quite a long time reporting this piece that we have out on the “red team”, which is essentially a term for people who test out something before it is released. But what they’re really trying to do is break it. They’re trying to push it to its limits and see all the bad things that it could say by asking it lots of different types of questions in their areas of expertise. And they were a really eclectic bunch. So there was a writing teacher at a community college, there was a chemist, you know, who worked on designing drugs, all these different types of people who look at safety in different aspects. And, you know, they all were really worried by when they tested GPT-4 out, the kind of outputs they were seeing.

Hutson:  But now that ChatGPT-4 is out in the real world, how do they feel about it? Do they feel like the company addressed the concerns that they had?

Murgia:  So I think now that it’s out in the real world, they continue to be concerned because while some of the some of what they fed back were taken on board and improved in the final version of GPT-4, they feel that there are lots of ways in which you can still elicit harmful responses, whether that’s biased responses, discriminatory towards certain, you know, marginalized groups or genders. There are ways in which you can elicit how-tos for cybercrimes. The chemist that I spoke to, Andrew White, was really concerned because he managed to find a way to get GPT to not only suggest an entirely new type of chemical that could be harmful to human health, but he got it to find where he could order it from on the internet as well.

Hutson:  So what makes this newest version of ChatGPT more concerning than previous versions?

Murgia:  So I don’t think it’s necessarily more nefarious. It’s just better, right? It’s more powerful. And all of the people I interviewed for my piece had tested both versions and they just found that, you know, GPT-4 was capable of more nuance. So the real difference is that it’s a larger model, it’s a more powerful model. And along with that, that means it has more sorts of misuses that come with it as well.

Hutson:  Madhu, what’s your takeaway here? How worried should we be about what ChatGPT-4 can do?

Murgia:  So having spoken to quite a lot of people who’ve been thinking about the downsides, I have come away concerned about how this tool could be exploited. There’s a whole community of people out there, you know, the open-source community of coders and developers who are kind of breaking open models like GPT-4 and adapting it and changing it and, you know, connecting it to the internet, for example, and allowing it to perform autonomous actions, kind of breaking it open from the box that it currently sits in. And that really reduces our human oversight of these types of technologies. And then we don’t really know what it is they’re doing or who they’re communicating with. And that’s what we need to worry about. That’s already happening.

Murgia’s entire piece on OpenAI’s red team efforts can be Murgia found at    OpenAI’s red team: the experts hired to ‘break’ ChatGPT | Financial Times (subscription required).

For additional great coverage, see:

Members of the GPT-4 ‘red team’ interviewed by the Financial Times

Microsoft-backed company asked an eclectic mix of people to ‘adversarially test’ GPT-4, its powerful new language model

Paul Röttger Oxford Internet Institute, UK
PhD student focusing on the use of AI to detect online hate speech

Anna Mills English instructor, College of Marin, US
Writing teacher at a community college, testing for learning loss

Maarten Sap Carnegie Mellon University, US
Assistant professor, specialises in toxicity of large language model outputs

Sara Kingsley Carnegie Mellon University, US
PhD researcher who specialises in online labour markets and impact of tech on work

Boru Gollo TripleOKlaw LLP, Kenya
Lawyer who has studied opportunities for AI in Kenya

Andrew White University of Rochester, US
Associate professor, computational chemist, interested in AI and drug design

José Hernández-Orallo Professor, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
AI researcher working on evaluation and accuracy of AI software

Lauren Kahn Council on Foreign Relations, US
Research fellow, focusing on how AI use in military systems alters risk dynamics on battlefields, raises the risk of unintended conflict and inadvertent escalation

Aviv Ovadya Berkman Klein Center for Internet & Society, Harvard University, US
Focus on impacts of AI on society and democracy

Nathan Labenz Co-founder of Waymark, US
Founder of Waymark, an AI-based video editing start-up

Lexin Zhou VRAIN, Universitat Politècnica de València, Spain
Junior researcher working on making AI more socially beneficial

Dan Hendrycks Director of the Center for AI Safety at University of California, Berkeley, US
Specialist in AI safety and reducing societal-scale risks from AI

Roya Pakzad Founder, Taraaz, US/Iran
Founder and director of Taraaz, a non-profit working on tech and human rights

Heather Frase Senior Fellow, Georgetown’s Center for Security and Emerging Technology, US
Expertise in the use of AI for intelligence purposes and operational tests of major defence systems

What Next?

The Hugging Face team is working on these Future directions:

  1. There is no open-source red-teaming dataset for code generation that attempts to jailbreak a model via code, for example, generating a program that implements a DDOS or backdoor attack.
  2. Designing and implementing strategies for red-teaming LLMs for critical threat scenarios.
  3. Red-teaming can be resource intensive, both compute and human resources and so would benefit from sharing strategies, open-sourcing datasets, and possibly collaborating for a higher chance of success.
  4. Evaluating the tradeoffs between evasiveness and helpfulness.
  5. Enumerate the choices based on the above tradeoff and explore the pareto front for red-teaming (similar to Anthropic’s Constitutional AI work).

Among the future risks posed by these models are:

  • Using them to create and spread disinformation;
  • To write malware;
  • To create phishing emails;
  • To provide harmful knowledge not widely available to the public, such as instructions on how to create toxins;
  • Biases that are difficult to test for;
  • The emergence of unexpected model properties and what industry researchers refer to as “hallucinations” — when an AI model gives a confident response to a query that isn’t grounded in reality. (3)

Coming to DEF CON 31: Hacking AI models

A group of prominent AI companies committed to opening their models to attack at this year’s DEF CON hacking conference in Las Vegas.

At DEF CON this year, Anthropic, Google, Hugging Face, Microsoft, NVIDIA, OpenAI, and Stability AI will all open up their models for attack:  The DEF CON event will rely on an evaluation platform developed by Scale AI, a California company that produces training for AI applications. Participants will be given laptops to use to attack the models. Any bugs discovered will be disclosed using industry-standard responsible disclosure practices.

As reported by CYBERSCOOP:

A group of leading artificial intelligence companies in the U.S. committed…to open their models to red-teaming at this year’s DEF CON hacking conference as part of a White House initiative to address the security risks posed by the rapidly advancing technology.

Attendees at the premier hacking conference held annually in Las Vegas in August will be able to attack models from Anthropic, Google, Hugging Face, Microsoft, NVIDIA, OpenAI, and Stability AI in an attempt to find vulnerabilities. The event hosted at the AI Village is expected to draw thousands of security researchers.

A senior administration official speaking to reporters on condition of anonymity ahead of the announcement said the red-teaming event is the first public assessment of large language models. “Red-teaming has been really helpful and very successful in cybersecurity for identifying vulnerabilities,” the official said. “That’s what we’re now working to adapt for large language models.”

The announcement…came ahead of a meeting at the White House later in the day between Vice President Kamala Harris, senior administration officials, and the CEOs of Anthropic, Google, Microsoft, and OpenAI.

This won’t be the first time Washington has looked to the ethical hacking community at DEF CON to help find weaknesses in critical and emerging technologies. The U.S. Air Force has held capture-the-flag contests there for hackers to test the security of satellite systems and the Pentagon’s Defense Advanced Program Research Agency brought a new technology to the conference that could be used for more secure voting.

Rapid advances in machine learning in recent years have resulted in a slew of product launches featuring generative AI tools. But in the rush to launch these models, many AI experts are concerned that companies are moving too quickly to ship new products to market without properly addressing the safety and security concerns.

Advances in machine learning have historically occurred in academic communities and open research teams, but AI companies are increasingly closing off their models to the public, making it more difficult for independent researchers to examine potential shortcomings.

“Traditionally, companies have solved this problem with specialized red teams. However this work has largely happened in private,” AI Village founder Sven Cattell said in a statement. “The diverse issues with these models will not be resolved until more people know how to red team and assess them.”

As mentioned above, the entire case study provided by the Hugging Face Team can be found below, spliced between further OODA Loop resources on red teaming and AI security.

https://oodaloop.com/ooda-original/2015/10/22/10-red-teaming-lessons-learned-over-20-years/

https://oodaloop.com/resource/2014/03/13/red-team-handbook/

https://oodaloop.com/archive/2020/09/03/oodacast-jim-miller-on-managing-policy-in-an-age-of-constant-disruption-and-dynamic-threats/

https://oodaloop.com/ooda-original/2017/02/21/2017-red-teamers-bookshelf/

https://oodaloop.com/archive/2020/05/08/oodacast-a-conversation-with-lou-manousos-ceo-of-riskiq/

Red-Teaming Large Language Models (huggingface.co)

As mentioned, the following is the entire case study provided by the Hugging Face Team.

Warning: This article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting.

Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to exhibit sexist behaviors (see below) and biases against Muslims,

Once we uncover such undesirable outcomes when using an LLM, we can develop strategies to steer it away from them, as in Generative Discriminator Guided Sequence Generation (GeDi) or Plug and Play Language Models (PPLM) for guiding generation in GPT3. Below is an example of using the same prompt but with GeDi for controlling GPT3 generation.

Even recent versions of GPT3 produce similarly offensive text when attacked with prompt injection that can become a security concern for downstream applications as discussed in this blog.

Red-teaming is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors. Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. Microsoft’s Chatbot Tay launched in 2016 and the more recent Bing’s Chatbot Sydney are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be. The origins of the idea of a red-team traces back to adversary simulations and wargames performed by militaries.

The goal of red-teaming language models is to craft a prompt that would trigger the model to generate text that is likely to cause harm. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called adversarial attacks. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate content that would be undesirable in a real-world use case. However, adversarial attacks can be unintelligible to humans, for example, by prefixing the string “aaabbbcc” to each prompt because it deteriorates model performance. Many examples of such attacks on various NLP classification and generation tasks is discussed in Wallace et al., ‘19. Red-teaming prompts, on the other hand, look like regular, natural language prompts.

Red-teaming can reveal model limitations that can cause upsetting user experiences or enable harm by aiding violence or other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) are generally used to train the model to be less likely to cause harm or steer it away from undesirable outputs.

Since red-teaming requires creative thinking of possible model failures, it is a problem with a large search space making it resource intensive. A workaround would be to augment the LLM with a classifier trained to predict whether a given prompt contains topics or phrases that can possibly lead to offensive generations and if the classifier predicts the prompt would lead to a potentially offensive text, generate a canned response. Such a strategy would err on the side of caution. But that would be very restrictive and cause the model to be frequently evasive. So, there is tension between the model being helpful (by following instructions) and being harmless (or at least less likely to enable harm).

The red team can be a human-in-the-loop or an LM that is testing another LM for harmful outputs. Coming up with red-teaming prompts for models that are fine-tuned for safety and alignment (such as via RLHF or SFT) requires creative thinking in the form of roleplay attacks wherein the LLM is instructed to behave as a malicious character as in Ganguli et al., ‘22. Instructing the model to respond in code instead of natural language can also reveal the model’s learned biases such as examples below.

See this tweet thread for more examples.

Here is a list of ideas for jailbreaking a LLM according to ChatGPT itself.

Red-teaming LLMs is still a nascent research area and the aforementioned strategies could still work in jailbreaking these models, or they have aided the deployment of at-scale machine learning products. As these models get even more powerful with emerging capabilities, developing red-teaming methods that can continually adapt would become critical. Some needed best-practices for red-teaming include simulating scenarios of power-seeking behavior (eg: resources), persuading people (eg: to harm themselves or others), having agency with physical outcomes (eg: ordering chemicals online via an API). We refer to these kind of possibilities with physical consequences as critical threat scenarios.

The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they are capable of because they are not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). Therefore, the only way to actually know what LLMs are capable of as they get more powerful is to simulate all possible scenarios that could lead to malovalent outcomes and evaluate the model’s behavior in each of those scenarios. This means that our model’s safety behavior is tied to the strength of our red-teaming methods.

Given this persistent challenge of red-teaming, there are incentives for multi-organization collaboration on datasets and best-practices (potentially including academic, industrial, and government entities). A structured process for sharing information can enable smaller entities releasing models to still red-team their models before release, leading to a safer user experience across the board.

Open source datasets for Red-teaming:

  1. Meta’s Bot Adversarial Dialog dataset
  2. Anthropic’s red-teaming attempts
  3. AI2’s RealToxicityPrompts

Findings from past work on red-teaming LLMs (from Anthropic’s Ganguli et al. 2022 and Perez et al. 2022)

  1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs.
  2. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale.
  3. Models may learn to be harmless by being evasive, there is a tradeoff between helpfulness and harmlessness.
  4. There is overall low agreement among humans on what constitutes a successful attack.
  5. The distribution of the success rate varies across categories of harm with non-violent ones having a higher success rate.
  6. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant.

Future directions:

  1. There is no open-source red-teaming dataset for code generation that attempts to jailbreak a model via code, for example, generating a program that implements a DDOS or backdoor attack.
  2. Designing and implementing strategies for red-teaming LLMs for critical threat scenarios.
  3. Red-teaming can be resource intensive, both compute and human resource and so would benefit from sharing strategies, open-sourcing datasets, and possibly collaborating for a higher chance of success.
  4. Evaluating the tradeoffs between evasiveness and helpfulness.
  5. Enumerate the choices based on the above tradeoff and explore the pareto front for red-teaming (similar to Anthropic’s Constitutional AI work)

These limitations and future directions make it clear that red-teaming is an under-explored and crucial component of the modern LLM workflow. This post is a call-to-action to LLM researchers and HuggingFace’s community of developers to collaborate on these efforts for a safe and friendly world 🙂

Reach out to us (@nazneenrajani @natolambert @lewtun @TristanThrush @yjernite @thomwolf) if you’re interested in joining such a collaboration.

Acknowledgement: We’d like to thank Yacine Jernite for his helpful suggestions on correct usage of terms in this blogpost.

https://oodaloop.com/archive/2023/04/21/the-current-on-chip-innovation-and-physical-layer-market-dynamics-of-gpt-model-training-crypto-mining-and-the-metaverse/

https://oodaloop.com/archive/2023/05/05/we-have-no-moat-and-neither-does-openai-leaked-google-document-breaks-down-the-exponential-future-of-open-source-llms/

https://oodaloop.com/archive/2022/11/29/gpt-3-neural-language-models-and-the-risks-of-radicalization/

https://oodaloop.com/archive/2022/03/16/open-source-natural-language-processing-eleutherais-gpt-j/

https://oodaloop.com/archive/2023/03/15/openai-releases-gpt-4/

https://oodaloop.com/archive/2022/02/07/the-current-ai-innovation-hype-cycle-large-language-models-openais-gpt-3-and-deepminds-retro/

https://oodaloop.com/archive/2022/06/15/opportunities-for-advantage-natural-language-processing-meta-ai-builds-a-huge-gpt-3-model-and-makes-it-available-for-free/

 

Daniel Pereira

About the Author

Daniel Pereira

Daniel Pereira is research director at OODA. He is a foresight strategist, creative technologist, and an information communication technology (ICT) and digital media researcher with 20+ years of experience directing public/private partnerships and strategic innovation initiatives.