Official code for the ACL 2026 paper Peering Behind the Shield: Guardrail Identification in Large Language Models.
AP-Test implements suffix optimization and evaluation for guardrail identification: training (ap_test.py, adv_optimization.py), filter-side optimization helpers (adv_optimization_filter.py), and multi-backend evaluation (run_test.py). Query set CSVs are under data/. Full checkpoints, large logs, and auxiliary experiments are not shipped in this repository.
-
Clone this repository and enter the repo root (all Python scripts use
ROOT_PATH = "./"relative to that directory). -
Environment — Python 3 with GPU recommended. Install dependencies (exact versions may depend on your models; align with ProFLingo if you hit compatibility issues, e.g.
transformers):pip install torch transformers accelerate peft pandas numpy fschat pip install openai anthropic google-api-python-client
(
fschatis the PyPI package name for FastChat’sfastchatimports.)Set
HF_TOKENorHUGGING_FACE_HUB_TOKENif you use gated Hugging Face models. -
Train (writes under
saves/main/):python -u ap_test.py --model_name meta-llama/Llama-Guard-3-8B
Defaults match
ap_test.py(e.g.--dataset_name questions,data/questions.csv). Override flags as needed. -
Evaluate a saved checkpoint on a filter model — replace
<run_dir>and<epoch>with your run folder name and checkpoint index:python -u run_test.py \ --model_name meta-llama/Llama-Guard-3-8B \ --advsamples_path saves/main/<run_dir>/epoch_<epoch>.csv \ --dataset_name questions
Add API keys only for branches that need them (see Environment variables).
| Path | Role |
|---|---|
ap_test.py |
Training: load data, optimize suffixes, write CSV checkpoints under saves/main/. |
adv_optimization.py |
Loss and suffix optimization used by ap_test.py. |
adv_optimization_filter.py |
Filter-side optimization (generate_output, generate_suffix, …) and eval helpers (complete_conversation_filter, …). |
run_test.py |
Evaluate adversarial CSVs against a chosen filter or API-backed evaluator. |
bin/run_ap_test.sh |
SLURM training wrapper (set ROOT_PATH and container image inside). |
bin/eval.sh |
SLURM evaluation wrapper. |
example.sh |
Example sbatch lines (edit before use). |
CSV format: header question,answer,keyword. The stem of the filename is --dataset_name (no .csv).
| File | Use |
|---|---|
questions.csv |
Default --dataset_name questions (compatible with the ProFLingo-style setup). |
questions_alpaca.csv |
--dataset_name questions_alpaca |
questions_gpt5-2.csv |
--dataset_name questions_gpt5-2 |
- Python (
ap_test.py,run_test.py,adv_optimization_filter.py):ROOT_PATH = "./". Run commands from the repository root sodata/andsaves/resolve correctly, or changeROOT_PATHin code if your layout differs. - Shell (
bin/run_ap_test.sh,bin/eval.sh): they still use a placeholderROOT_PATH='/path/to/AP-Test/'. Replace that with your absolute checkout path, setYOUR_CONTAINER_IMAGEfor Apptainer/Singularity if used, and adjust#SBATCHpartitions and resources for your cluster.
Training reads data/{dataset_name}.csv and writes under saves/main/<run-specific>/.
bin/run_ap_test.sh positional arguments: (1) short label for logs, (2) Hugging Face model id or local path, (3) data_offset, (4) batch_size, (5) alpha, (6) beta, (7) epochs, (8) dataset_name (CSV stem). The script fixes lr=0.001, save_per_epochs=20, init_mode=0. Default hate-model auxiliary: facebook/roberta-hate-speech-dynabench-r4-target (--hate_model_name).
Logs with prefix adv_optimization_ go under logs/ when using the provided shell layout.
bin/eval.sh positional arguments: (1) log label, (2) filter model path or HF id, (3) subdirectory of saves/main/ containing epoch_<n>.csv, (4) epoch number n, (5) dataset_name.
Direct invocation uses --model_name, --advsamples_path, and --dataset_name (see argparse at the end of run_test.py).
| Variable | Purpose |
|---|---|
HF_TOKEN / HUGGING_FACE_HUB_TOKEN |
Hugging Face Hub for gated or private models. |
OPENAI_API_KEY |
OpenAI / GPT-style paths in run_test.py. |
ANTHROPIC_API_KEY |
Claude when the selected backend uses it. |
GOOGLE_API_KEY |
Perspective API when --model_name selects that path. |
Example:
export HF_TOKEN="hf_..."
export OPENAI_API_KEY="sk-..."Set TRANSFORMERS_CACHE (or other HF cache env vars) if you need a non-default model cache.
If you find this useful in your research, please consider citing:
@inproceedings{YWWBZ26,
author = {Ziqing Yang and Yixin Wu and Rui Wen and Michael Backes and Yang Zhang},
title = {{Peering Behind the Shield: Guardrail Identification in Large Language Models}},
booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
publisher = {ACL},
year = {2026}
}
This repository is released under the MIT License; see LICENSE.
This project builds on ProFLingo (MIT), which introduced the fingerprinting / adversarial-suffix pipeline we extend for guardrail-focused experiments. See ProFLingo’s repository for the original paper, citation, and related assets. When redistributing, keep both this project’s license notice and respect upstream and third-party model or dataset terms.