Google’s Gemini, a large language model (LLM), has been found to have vulnerabilities that could lead it to reveal system instructions, create harmful content, and allow indirect attacks.
These issues were identified by HiddenLayer. They affect users of Gemini Advanced with Google Workspace and companies using the LLM API.
The first problem involves finding a way to get past security measures to reveal the system’s instructions (or a message), which are meant to guide the LLM in generating better responses. This can be done by asking the model to show its “basic instructions” in a markdown block.
According to Microsoft’s documentation on LLM prompt engineering, “A message can help the LLM understand the context, such as the type of conversation it’s having or the task it’s supposed to do. It helps the LLM generate more suitable responses.”
This is possible because models can be tricked using a synonym attack to avoid security and content restrictions.
A second set of vulnerabilities involves using “smart jailbreaking” methods to make Gemini models produce false information about topics like elections or even output potentially illegal and dangerous information (e.g., how to start a car without keys) by asking it to pretend to be in a made-up scenario.
HiddenLayer also identified a third flaw where the LLM could reveal information in the system instructions by repeatedly using uncommon tokens as input.
“Most LLMs are trained to respond to queries with a clear separation between the user’s input and the system instructions,” explained security researcher Kenneth Yeung in a report on Tuesday.
“By creating a series of random tokens, we can trick the LLM into thinking it’s time to respond, causing it to produce a confirmation message that usually includes the information in the instructions.”
Another test involves using Gemini Advanced and a specially designed Google document, which is connected to the LLM through the Google Workspace extension.
The instructions in the document could be set up to override the model’s instructions and perform a series of harmful actions, giving an attacker full control over a person’s interactions with the model.
These findings come as a group of academics from various universities revealed a new model-stealing attack that allows for the extraction of “specific, important information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2.”
However, it’s important to note that these vulnerabilities are not unique and exist in other LLMs in the industry. These findings highlight the need to test models for attacks on instructions, extraction of training data, manipulation of models, adversarial examples, data contamination, and data theft.
“To protect our users from vulnerabilities, we regularly conduct tests and train our models to defend against malicious behaviors like instruction attacks, jailbreaking, and other advanced attacks,” said a Google spokesperson to The Hacker News. “We’ve also implemented measures to prevent harmful or misleading responses, which we are continuously improving.”
The company also stated that it is restricting responses to queries related to elections as a precaution. This policy is expected to apply to questions about candidates, political parties, election results, voting information, and well-known public figures.