All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

Takemoto, Kazuhiro; 竹本, 和広

doi:https://doi.org/10.3390/app14093558

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

http://hdl.handle.net/10228/0002000612

名前 / ファイル	ライセンス	アクション
10430409.pdf (299 KB)

Item type

学術雑誌論文 = Journal Article(1)

公開日

2024-05-10

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

タイトル

言語

タイトル

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

言語

eng

著者

竹本, 和広

WEKO 24877
e-Rad 40512356
Scopus著者ID 35270356700
ORCiD 0000-0002-6355-1366
九工大研究者情報 100000509

en	Takemoto, Kazuhiro
ja	竹本, 和広

Search repository

抄録

内容記述タイプ

Abstract

内容記述

Large Language Models (LLMs), such as ChatGPT, encounter ‘jailbreak’ challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts that bypass LLM defenses. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved to be robust against model updates. The jailbreak prompts generated were not only naturally worded and succinct, but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.

言語

書誌情報

en : Applied Sciences

巻 14, 号 9, p. 3558, 発行日 2024-04-23

出版社

出版者

MDPI

DOI

関連識別子

https://doi.org/10.3390/app14093558

ISSN

収録物識別子タイプ

EISSN

収録物識別子

2076-3417

著作権関連情報

権利情報Resource

https://creativecommons.org/licenses/by/4.0/

権利情報

Copyright (c) 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

キーワード

主題Scheme

Other

主題

large language models

キーワード

主題Scheme

Other

主題

jailbreak attacks

キーワード

主題Scheme

Other

主題

security and privacy

出版タイプ

VoR

出版タイプResource

http://purl.org/coar/version/c_970fb48d4fbd8a85

査読の有無

値

yes

研究者情報

https://hyokadb02.jimu.kyutech.ac.jp/html/100000509_ja.html

論文ID（連携）

10430409

連携ID

12223

戻る

views

See details

	Views

Versions

Ver.1

2024-05-10 02:47:32.547340

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

× 竹本, 和広

Versions

Share

Cite as

エクスポート