A trapdoor prompt is an input designed to trigger a specific output from a language model, without using any of the words in that output. It’s not a guess and not a coincidence. It’s a byproduct of how models memorize fragments of their training data and the way those fragments can be resurfaced with the […]
