ChatGPT-5 has been jailbroken to provide harmful commands.

Because ChatGPT-5 speaks like a human, hackers can more easily “trick” it using the same techniques that fool people, leading it to unintentionally provide bomb-making instructions.

Two AI security firms, NeuralTrust and SPLX (previously SplxAI), tested the newly available model just one day after OpenAI unveiled GPT-5 and promptly found significant flaws.
The OpenAI team had always attempted to stop the model from responding in order to protect the chatbot, but shortly after its release, the NeuralTrust team used a jailbreak method called EchoChamber in conjunction with storytelling techniques to get GPT-5 to produce comprehensive instructions for creating a Molotov cocktail.

A third-party conversation looping technology called EchoChamber makes AIs unintentionally “narrate” harmful instructions. Image: Mojologic

The team claimed that instead of asking direct questions during the jailbreak process to get ChatGPT-5 to swear, they deftly inserted hidden elements into the conversation over several rounds, leading the model to follow the plot and ultimately voluntarily provide content that went against its principles without being able to activate the opt-out mechanism.

The researchers came to the conclusion that the key flaw in GPT-5 is that it places a high priority on preserving conversational context, even when that context is subtly manipulated to serve malevolent ends.
In the meantime, SPLX initiated a distinct kind of attack, concentrating on the StringJoin Obfuscation Attack, a quick obfuscation method. By inserting hyphens between each prompt character and overlaying the entire script with a “decryption” script, they were finally able to fool the content filtering system.

Chat-GPT runs “innocently” thanks to the widely used obfuscation technique, which blinds the source code target.
In one instance, the query “how to build a bomb” was supplied to the model in a misleadingly encrypted format following a long sequence of instructions. GPT-5 completely circumvented the opt-out mechanism for which it was intended by providing an instructive and humorous response to this harmful question.

Both techniques show how contextualized multi-talk attacks can compromise GPT-5’s present censoring mechanisms, which mostly concentrate on single prompts. After delving into a narrative or situation, the model develops bias and will keep deploying stuff that is appropriate for the context it was trained on, even if it is harmful or forbidden.

It is still possible to use ChatGPT-5 to make harmful things. Tue Minh took the picture.

Based on these findings, SPLX thinks that GPT-5 would be practically hard to use securely in a business setting if it weren’t customized. This is true even with extra security prompts, which still have a lot of flaws. However, GPT-4o continued to demonstrate greater resistance to such attacks, particularly when a robust defense mechanism was implemented.
Experts have cautioned that it is very dangerous to implement GPT-5 immediately, particularly in locations that need strong security. Prompt hardening and other protection methods only partially address the issue; they cannot take the place of multi-layered, real-time monitoring and defense systems.

As can be observed, despite GPT-5’s strong language processing capabilities, it still falls short of the required level of security for widespread deployment in the absence of additional protection mechanisms due to the increasing sophistication of context-based attack techniques and content obfuscation.
When “properly asked,” ChatGPT-5 readily provides instructions and generates hacking tools.

Leave a Reply