Revolutionary 'Bad Likert Judge' Exploit Amplifies LLM Vulnerabilities by Over 60% - Owlysec

Recent findings from cybersecurity experts unveil a groundbreaking method to circumvent large language models’ (LLMs) safety mechanisms, paving the way for the generation of harmful or malicious outputs.

Coined Bad Likert Judge, this innovative multi-turn (or many-shot) adversarial technique has been meticulously analyzed by Palo Alto Networks Unit 42 researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

This method cunningly manipulates the target LLM into acting as an evaluator, scoring the potential harm of a response based on the Likert scale—a widely-used psychometric tool for gauging agreement or disagreement levels. By steering the model to craft responses aligned with varying Likert scores, the approach exploits outputs rated with the highest potential for harm.

The Evolving Landscape of Prompt Injection Attacks

The meteoric rise of artificial intelligence has also birthed a new breed of cybersecurity threats known as prompt injection attacks. These attacks exploit vulnerabilities in machine learning models, bypassing intended safeguards through the deployment of meticulously designed input sequences.

Among these, the many-shot jailbreaking technique stands out. This approach leverages LLMs’ extended context windows and attention mechanisms to systematically orchestrate a series of prompts that subtly coerce the model into generating harmful content while evading internal defenses. Previous iterations of this method include exploits such as Crescendo and Deceptive Delight.

Unit 42’s Bad Likert Judge method builds upon this foundation by utilizing the model’s evaluative capacities. It assigns the LLM the role of determining harmfulness levels across diverse responses and subsequently extracting outputs tied to specific Likert ratings.

Unveiling Alarming Success Rates

Testing this methodology across six cutting-edge text-generation LLMs—developed by Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA—revealed a staggering increase in attack success rates. The Bad Likert Judge technique achieved an over 60% improvement in effectiveness compared to standard prompt attacks, spanning categories such as:

Hate speech and harassment
Self-harm content
Explicit material
Illicit weapon proliferation
Cybercrime facilitation
Malware generation
Leakage of internal system prompts

“By exploiting the model’s intricate comprehension of harmful content and its evaluative capabilities, this technique substantially heightens the probability of breaching safety frameworks,” the researchers highlighted.

Significantly, the incorporation of robust content filtering systems demonstrated a remarkable reduction in attack success rates—by an average of 89.2 percentage points across all tested models. This underscores the necessity of comprehensive filtration measures for ensuring safe deployment of LLMs in real-world scenarios.

Broader Implications for AI Reliability

These findings emerge amidst reports of deceptive practices targeting AI tools. For instance, a recent exposé by The Guardian revealed how ChatGPT’s search functionality can be duped into producing misleading summaries. Hidden content within web pages, devoid of explicit instruction, was found to influence the AI’s outputs. Notably, fake reviews—crafted to be excessively positive—were capable of skewing assessments toward unwarranted favorability.

“Such exploits can be weaponized to, for example, compel ChatGPT to provide glowing endorsements of products despite overwhelming negative feedback,” the report elaborated. “Even unintentional inclusion of concealed text can manipulate the AI’s conclusions.”

The Path Forward

As adversarial techniques like Bad Likert Judge continue to challenge the robustness of LLMs, these revelations emphasize the critical importance of advancing AI security protocols. The integration of multilayered safeguards, rigorous testing, and adaptive content filtering will be indispensable in mitigating these emerging threats and ensuring the responsible application of AI technologies.

Cyber security news for all

New PondRAT Malware Disguised in Python Packages Targets Software Developers

Discord Unveils DAVE Protocol for Comprehensive Encryption in Audio and Video Communication

GitLab Resolves Critical SAML Authentication Bypass Vulnerability in CE and EE Versions

SpyLoan Malware Embedded in Android Loan Apps Exposes 8 Million Users

Google Halts Risky Android App Sideloading in India, Elevating Fraud Prevention

Watering Hole Attack on Kurdish Platforms Unleashes Harmful APKs and Spyware

Revolutionary ‘Bad Likert Judge’ Exploit Amplifies LLM Vulnerabilities by Over 60%

The Evolving Landscape of Prompt Injection Attacks

Unveiling Alarming Success Rates

Broader Implications for AI Reliability

The Path Forward

Related

Recent Articles

The Hidden Risk of Non-Human Identities: Why Secrets Management Must Evolve

New Malware Campaigns Target Android and iOS Users with Fake Apps

Weekly Cyber Threat Insights: Windows 0-Day, VPN Exploits, AI Abuse, Antivirus Hijacking, and More

New BPFDoor Malware Controller Facilitates Hidden Lateral Movement on Linux Systems

New Vulnerabilities Found in Rack::Static Allow Unauthorized Access and Data Manipulation on Ruby Servers

Related Stories

EDITOR PICKS

The Hidden Risk of Non-Human Identities: Why Secrets Management Must Evolve

New Malware Campaigns Target Android and iOS Users with Fake Apps

Weekly Cyber Threat Insights: Windows 0-Day, VPN Exploits, AI Abuse, Antivirus Hijacking, and More

POPULAR POSTS

A complete OSINT tutorial on finding someone’s personal information

Chaos Computer Club discovered access to 5 million records

Mailto links can pose an unexpected security risk

ABOUT US

POPULAR CATEGORY

FOLLOW US