AI Safety Institute releases LLM safety results

Testing methods were designed to assess whether LLMs can be used to undermine national security.
AI Safety Institute releases LLM safety results

The UK Government's Institute for AI Safety has published its first AI testing results. Five leading models were screened to assess their cyber, chemical, biological agent capabilities and the effectiveness of safeguards.

The institute has released partial results so far. The models tested were given colour pseudonyms — red, purple, green, blue and yellow — and are qualified as the products of "major labs" but their names and whether AISIT had access to their latest versions is unknown.

AISIT’s approach and findings are as follows:

“The Institute assessed AI models against four key risk areas, including how effective the safeguards that developers have installed actually are in practice. As part of the findings, the Institute’s tests have found that:

  • Several LLMs demonstrated expert-level knowledge of chemistry and biology. Models answered over 600 private expert-written chemistry and biology questions at similar levels to humans with PhD-level training.
  • Several LLMs completed simple cyber security challenges aimed at high-school students but struggled with challenges aimed at university students.
  • Two LLMs completed short-horizon agent tasks (such as simple software engineering problems) but were unable to plan and execute sequences of actions for more complex tasks.
  • All tested LLMs remain highly vulnerable to basic jailbreaks, and some will provide harmful outputs even without dedicated attempts to circumvent their safeguards."

The specific focus of the testing regards how models can be used to undermine national security; the results published so far do not touch on short-term risks such as bias or misinformation.

Saqib Bhatti MP, the Undersecretary of State for the Department of Science, Innovation and Technology, told me last week that legislation will happen "eventually" and will be informed by testing. He said that the UK's approach to regulation is "pro-innovation, pro-regulatory" and that it will differ from the EU's current framework.

Speculation about which versions of models were subject to testing has already begun; BBC Technology editor Zoe Kleinman tweeted: "I don't think they safety tested either GPT-4o or Google's Project Astra."

The results will likely be discussed at the upcoming Seoul Summit this week which will be co-hosted by the UK and the Republic of Korea.

The Institute has also announced will establish a new base in San Francisco — the home of Silicon Valley — this summer, and that it will collaborate with its Canadian counterpart to “deepen existing links between the two nations and inspire collaborative work on systemic safety research.”

DSIT commented on the expansion:

“By expanding its foothold in the US, the Institute will establish a close collaboration with the US, furthering the country’s strategic partnership and approach to AI safety, while also sharing research and conducting joint evaluations of AI models that can inform AI safety policy across the globe.”

Comments
  1. Would you like to write the first comment?

    Would you like to write the first comment?

    Login to post comments
Follow the developments in the technology world. What would you like us to deliver to you?
Your subscription registration has been successfully created.