Start your day with intelligence. Get The OODA Daily Pulse.

Two Emergent and Sophisticated Approaches to LLM Implementation in Cybersecurity

Google Security Engineering and The Carnegie Mellon University Software Engineering Institute (in collaboration with OpenAI) have sorted through the hype – and done some serious thinking and formal research on developing “better approaches for evaluating LLM cybersecurity” and AI-powered patching: the future of automated vulnerability fixes. This is some great formative framing of the challenges ahead as we collectively sort out the implications of the convergence of generative AI and future cyber capabilities (offensive and defensive).

OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity

Research from Jeff Gennari, Shing-hon Lau, and Samuel J. Perl of the SEI;   Joel Parish and Girish Sastry from OpenAI also contributed to this research effort. 

From the SEI post summarizing the research: 

Large language models (LLMs) have shown a remarkable ability to ingest, synthesize, and summarize knowledge while simultaneously demonstrating significant limitations in completing real-world tasks. One notable domain that presents both opportunities and risks for leveraging LLMs is cybersecurity. LLMs could empower cybersecurity experts to be more efficient or effective at preventing and stopping attacks. However, adversaries could also use generative artificial intelligence (AI) technologies in kind. We have already seen evidence of actors using LLMs to aid in cyber intrusion activities (e.g., WormGPTFraudGPT, etc.). Such misuse raises many important cybersecurity-capability-related questions including:

  • Can an LLM like GPT-4 write novel malware?
  • Will LLMs become critical components of large-scale cyber-attacks?
  • Can we trust LLMs to provide cybersecurity experts with reliable information?

Recently, a team of researchers in the SEI CERT Division worked with OpenAI to develop better approaches for evaluating LLM cybersecurity capabilities. This SEI Blog post, excerpted from a recently published paper that we coauthored with OpenAI researchers Joel Parish and Girish Sastry, summarizes 14 recommendations to help assessors accurately evaluate LLM cybersecurity capabilities.

The Challenge of Using LLMs for Cybersecurity Tasks

Without a clear understanding of how an LLM performs on applied and realistic cybersecurity tasks, decision makers lack the information they need to assess opportunities and risks. We contend that practical, applied, and comprehensive evaluations are required to assess cybersecurity capabilities. Realistic evaluations reflect the complex nature of cybersecurity and provide a more complete picture of cybersecurity capabilities.

Recommendations for Cybersecurity Evaluations

To properly judge the risks and appropriateness of using LLMs for cybersecurity tasks, evaluators need to carefully consider the design, implementation, and interpretation of their assessments. Favoring tests based on practical and applied cybersecurity knowledge is preferred to general fact-based assessments. However, creating these types of assessments can be a formidable task that encompasses infrastructure, task/question design, and data collection. The following list of recommendations is meant to help assessors craft meaningful and actionable evaluations that accurately capture LLM cybersecurity capabilities. The expanded list of recommendations is outlined in our paper.

Define the real-world task that you would like your evaluation to capture.

Starting with a clear definition of the task helps clarify decisions about complexity and assessment. The following recommendations are meant to help define real-world tasks:

  1. Consider how humans do it: Starting from first principles, think about how the task you would like to evaluate is accomplished by humans, and write down the steps involved. This process will help clarify the task.
  2. Use caution with existing datasets: Current evaluations within the cybersecurity domain have largely leveraged existing datasets, which can influence the type and quality of tasks evaluated.
  3. Define tasks based on intended use: Carefully consider whether you are interested in autonomy or human-machine teaming when planning evaluations. This distinction will have significant implications for the type of assessment that you conduct.

Represent tasks appropriately.

Most tasks worth evaluating in cybersecurity are too nuanced or complex to be represented with simple queries, such as multiple-choice questions. Rather, queries need to reflect the nature of the task without being unintentionally or artificially limiting. The following guidelines ensure evaluations incorporate the complexity of the task:

  1. Define an appropriate scope: While subtasks of complex tasks are usually easier to represent and measure, their performance does not always correlate with the larger task. Ensure that you do not represent the real-world task with a narrow subtask.
  2. Develop an infrastructure to support the evaluation: Practical and applied tests will generally require significant infrastructure support, particularly in supporting interactivity between the LLM and the test environment.
  3. Incorporate affordances to humans where appropriate: Ensure your assessment mirrors real-world affordances and accommodations given to humans.
  4. Avoid affordances to humans where inappropriate: Evaluations of humans in higher education and professional-certification settings may ignore real-world complexity.

Make your evaluation robust.

Use care when designing evaluations to avoid spurious results. Assessors should consider the following guidelines when creating assessments:

  1. Use preregistration: Consider how you will grade the task ahead of time.
  2. Apply realistic perturbations to inputs: Changing the wording, ordering, or names in a question would have minimal effects on a human but can result in dramatic shifts in LLM performance. These changes must be accounted for in assessment design.
  3. Beware of training data contamination: LLMs are frequently trained on large corpora, including news of vulnerability feeds, Common Vulnerabilities and Exposures (CVE) websites, and code and online discussions of security. These data may make some tasks artificially easy for the LLM.

Frame results appropriately.

Evaluations with a sound methodology can still misleadingly frame results. Consider the following guidelines when interpreting results:

  1. Avoid overgeneralized claims: Avoid making sweeping claims about capabilities from the task or subtask evaluated. For example, strong model performance in an evaluation measuring vulnerability identification in a single function does not mean that a model is good at discovering vulnerabilities in a real-world web application where resources, such as access to source code may be restricted.
  2. Estimate best-case and worst-case performance: LLMs may have wide variations in evaluation performance due to different prompting strategies or because they use additional test-time compute techniques (e.g., Chain-of-Thought prompting). Best/worst case scenarios will help constrain the range of outcomes.
  3. Be careful with model selection bias: Any conclusions drawn from evaluations should be put into the proper context. If possible, run tests on a variety of contemporary models, or qualify claims appropriately.
  4. Clarify whether you are evaluating risk or evaluating capabilities. A judgment about the risk of models requires a threat model. In general, however, the capability profile of the model is only one source of uncertainty about the risk. Task-based evaluations can help understand the capability of the model.

For further insights and recommendations from the SEI/OpenAI collaborators, find the full research paper at:  Considerations for Evaluating Large Language Models for Cybersecurity Tasks by Jeffrey Gennari, Shing-hon Lau, Samuel Perl, Joel Parish (Open AI), and Girish Sastry (Open AI).

AI-powered patching: the future of automated vulnerability fixes

Jan Keller and Jan Nowakowski from Google Security Engineering have released a Technical Report on the automation of vulnerability fixes with generative AI – the possibilities and pitfalls of it all. 
 
As AI continues to advance at rapid speed, so has its ability to unearth hidden security vulnerabilities in all types of software. Every bug uncovered is an opportunity to patch and strengthen code—but as detection continues to improve, we need to be prepared with new automated solutions that bolster our ability to fix those bugs. That’s why our Secure AI Framework (SAIF) includes a fundamental pillar addressing the need to “automate defenses to keep pace with new and existing threats.” This paper shares lessons from our experience leveraging AI to scale our ability to fix bugs, specifically those found by sanitizers in C/C++, Java, and Go code.
 
By automating a pipeline to prompt Large Language Models (LLMs) to generate code fixes for human review, we have harnessed our Gemini model to successfully fix 15% of sanitizer bugs discovered during unit tests, resulting in hundreds of bugs patched. Given the large number of sanitizer bugs found each year, this seemingly modest success rate will with time save significant engineering effort. We expect this success rate to continually improve and anticipate that LLMs can be used to fix bugs in various languages across the software development lifecycle.
 
From the paper: 
 

An LLM-powered pipeline

An end-to-end solution needs a pipeline to:
 
1. Find vulnerabilities
2. Isolate and reproduce them
3. Use LLMs to create fixes
4. Test the fixes
5. Surface the best fix for human review and submission
 

Results

At the time of writing, we’ve accepted several hundred of these LLM-generated commits into Google’s codebase,with another several hundred in the process of being validated and submitted. 
Instead of a software engineer spending an average of two hours to create each of these commits, the necessary patches are now automatically created in seconds. Perhaps unsurprisingly, we’ve seen the best success rate in fixing errors stemming from the use of an uninitialized value, a relatively simple fix. But the LLM-generated fixes didn’t target only simple errors.  They also,for example, effectively initialized matrices and images using the appropriate library methods. In order of the highest fix success rate, the most commonly fixed sanitizer errors fell into four types:
 
1. Using uninitialized values
2. Data races
3. Buffer overflows
4. Temporal memory errors(e.g. use-after-scope)
 
Though a 15% success rate might sound low,  many thousands of new bugs are found each year, and automating the fixes for even a small fraction of them saves months of engineeringeffort—meaning that potential security vulnerabilities are closed even faster.  We expect improvements to continue pushing that number higher. 
 

Looking ahead

While these initial results are promising, this is just a first step toward a future of AI-powered automated bug patching. We’re currently working on expanding capabilities to include multi-file fixe an to integrate multiple bug sources into the pipeline.
 
For more “Looking ahead” insights from the Google Security Engineering team, go to the full paper at:  AI-powered patching: the future of automated vulnerability fixes
 

What Next?

Applicable Insights from the OODA Almanac 2024 – Reorientation

 
The SEI research mirrors these sub-theme from this year’s OODA Almanac:
 

Reversion to First Principles is the Foundation of the Future

In thinking about the adoption of disruptive technologies, the best mental model is not one that layers these technologies on our existing stacks, but rather rethinks the whole of the system from first principles and seeks to displace and replace with new approaches.

The ability to adapt and also revert to first principles will be a necessity of governance as well.  First principles, the fundamental concepts or assumptions at the heart of any system, serve as the bedrock upon which the future is built. In government, this approach necessitates a return to the core values and constitutional tenets that define a nation’s identity and purpose. It’s about stripping down complex policy issues to their most basic elements and rebuilding them in a way that is both innovative and cognizant of the historical context. 

When it comes to economics and money, a first principles mindset could lead to a reevaluation of foundational economic theories, potentially fostering new forms of currency or novel financial instruments that could reshape markets. This is evident in the emergence of digital currencies and the underlying blockchain technology, which challenge traditional banking paradigms and redefine value exchange. 

In the realm of engineering, applying first principles thinking often results in breakthrough innovations. By focusing on the fundamental physics of materials and processes, engineers can invent solutions that leapfrog over incremental improvements, much like how the aerospace industry has evolved with the advent of composite materials and computer-aided design. These disciplines, when underpinned by first principles, are not just adapting to change; they are the architects of the future, sculpting the landscape of what is to come.

Cyber a Safe Haven for Attackers

Attacks in cyberspace seem to have no escalatory or deterrence consequences, especially in the realm of cybercrime as ransomware attacks doubled over the past year with increasing impacts on the global economy.  In an era dependent on technology for advantage, the importance of developing novel approaches to cybersecurity issues can not be overstated.

The escalation of cyber threats, particularly ransomware, underscores a stark reality: our collective security posture must evolve with an urgency that matches the ingenuity of our adversaries. The doubling of ransomware attacks is not merely a statistic; it is a clarion call for a paradigm shift in how we conceptualize and implement cybersecurity measures. New concepts for how we jurisdiction attacks and disrupt the economic incentives of the attackers are required. We must also embrace a more proactive stance, integrating advanced technologies like artificial intelligence and machine learning to predict and preempt attacks before they occur . Furthermore, the convergence of cybercrime with nation-state tactics necessitates a more nuanced understanding of the threat landscape, where strategic defense and risk management become as critical as tactical responses. 

The future of cybersecurity lies in our ability to outpace the adaptability of threat actors, ensuring that the defenses we construct are not only resilient but also intelligent, capable of learning from each attack to bolster our protective measures. This requires a commitment to continuous innovation and the development of cybersecurity strategies that are as dynamic as the threats they aim to thwart. As we’ve seen, attackers often exploit the weakest link, which may not be within our own organizations but within our supply chains, turning trusted partners into potential vulnerabilities.

Additional OODA Loop Resources

Cyber Risks

Corporate Board Accountability for Cyber Risks: With a combination of market forces, regulatory changes, and strategic shifts, corporate boards and their directors are now accountable for cyber risks in their firms. See: Corporate Directors and Risk

Geopolitical-Cyber Risk Nexus: The interconnectivity brought by the Internet has made regional issues affect global cyberspace. Now, every significant event has cyber implications, making it imperative for leaders to recognize and act upon the symbiosis between geopolitical and cyber risks. See The Cyber Threat

Ransomware’s Rapid Evolution: Ransomware technology and its associated criminal business models have seen significant advancements. This has culminated in a heightened threat level, resembling a pandemic in its reach and impact. Yet, there are strategies available for threat mitigation. See: Ransomware, and update.

Challenges in Cyber “Net Assessment”: While leaders have long tried to gauge both cyber risk and security, actionable metrics remain elusive. Current metrics mainly determine if a system can be compromised, without guaranteeing its invulnerability. It’s imperative not just to develop action plans against risks but to contextualize the state of cybersecurity concerning cyber threats. Despite its importance, achieving a reliable net assessment is increasingly challenging due to the pervasive nature of modern technology. See: Cyber Threat

Recommendations for Action

Decision Intelligence for Optimal Choices: The simultaneous occurrence of numerous disruptions complicates situational awareness and can inhibit effective decision-making. Every enterprise should evaluate their methods of data collection, assessment, and decision-making processes. For more insights: Decision Intelligence.

Proactive Mitigation of Cyber Threats: The relentless nature of cyber adversaries, whether they are criminals or nation-states, necessitates proactive measures. It’s crucial to remember that cybersecurity isn’t solely the responsibility of the IT department or the CISO – it’s a collective effort that involves the entire leadership. Relying solely on governmental actions isn’t advised given its inconsistent approach towards aiding industries in risk reduction. See: Cyber Defenses

The Necessity of Continuous Vigilance in Cybersecurity: The consistent warnings from the FBI and CISA concerning cybersecurity signal potential large-scale threats. Cybersecurity demands 24/7 attention, even on holidays. Ensuring team endurance and preventing burnout by allocating rest periods are imperative. See: Continuous Vigilance

Embracing Corporate Intelligence and Scenario Planning in an Uncertain Age: Apart from traditional competitive challenges, businesses also confront external threats, many of which are unpredictable. This environment amplifies the significance of Scenario Planning. It enables leaders to envision varied futures, thereby identifying potential risks and opportunities. All organizations, regardless of their size, should allocate time to refine their understanding of the current risk landscape and adapt their strategies. See: Scenario Planning

 
Daniel Pereira

About the Author

Daniel Pereira

Daniel Pereira is research director at OODA. He is a foresight strategist, creative technologist, and an information communication technology (ICT) and digital media researcher with 20+ years of experience directing public/private partnerships and strategic innovation initiatives.