Peering Behind the Shield: Guardrail Identification in Large Language Models

Official code for the ACL 2026 paper Peering Behind the Shield: Guardrail Identification in Large Language Models.

AP-Test implements suffix optimization and evaluation for guardrail identification: training (ap_test.py, adv_optimization.py), filter-side optimization helpers (adv_optimization_filter.py), and multi-backend evaluation (run_test.py). Query set CSVs are under data/. Full checkpoints, large logs, and auxiliary experiments are not shipped in this repository.

Quick start (local, no SLURM)

Clone this repository and enter the repo root (all Python scripts use ROOT_PATH = "./" relative to that directory).
Environment — Python 3 with GPU recommended. Install dependencies (exact versions may depend on your models; align with ProFLingo if you hit compatibility issues, e.g. transformers):
```
pip install torch transformers accelerate peft pandas numpy fschat
pip install openai anthropic google-api-python-client
```
(fschat is the PyPI package name for FastChat’s fastchat imports.)

Set HF_TOKEN or HUGGING_FACE_HUB_TOKEN if you use gated Hugging Face models.
Train (writes under saves/main/):
```
python -u ap_test.py --model_name meta-llama/Llama-Guard-3-8B
```
Defaults match ap_test.py (e.g. --dataset_name questions, data/questions.csv). Override flags as needed.
Evaluate a saved checkpoint on a filter model — replace <run_dir> and <epoch> with your run folder name and checkpoint index:
```
python -u run_test.py \
  --model_name meta-llama/Llama-Guard-3-8B \
  --advsamples_path saves/main/<run_dir>/epoch_<epoch>.csv \
  --dataset_name questions
```
Add API keys only for branches that need them (see Environment variables).

Repository layout

Path	Role
`ap_test.py`	Training: load data, optimize suffixes, write CSV checkpoints under `saves/main/`.
`adv_optimization.py`	Loss and suffix optimization used by `ap_test.py`.
`adv_optimization_filter.py`	Filter-side optimization (`generate_output`, `generate_suffix`, …) and eval helpers (`complete_conversation_filter`, …).
`run_test.py`	Evaluate adversarial CSVs against a chosen filter or API-backed evaluator.
`bin/run_ap_test.sh`	SLURM training wrapper (set `ROOT_PATH` and container image inside).
`bin/eval.sh`	SLURM evaluation wrapper.
`example.sh`	Example `sbatch` lines (edit before use).

Data (`data/`)

CSV format: header question,answer,keyword. The stem of the filename is --dataset_name (no .csv).

File	Use
`questions.csv`	Default `--dataset_name questions` (compatible with the ProFLingo-style setup).
`questions_alpaca.csv`	`--dataset_name questions_alpaca`
`questions_gpt5-2.csv`	`--dataset_name questions_gpt5-2`

Paths: Python vs SLURM scripts

Python (ap_test.py, run_test.py, adv_optimization_filter.py): ROOT_PATH = "./". Run commands from the repository root so data/ and saves/ resolve correctly, or change ROOT_PATH in code if your layout differs.
Shell (bin/run_ap_test.sh, bin/eval.sh): they still use a placeholder ROOT_PATH='/path/to/AP-Test/'. Replace that with your absolute checkout path, set YOUR_CONTAINER_IMAGE for Apptainer/Singularity if used, and adjust #SBATCH partitions and resources for your cluster.

Training (`ap_test.py`)

Training reads data/{dataset_name}.csv and writes under saves/main/<run-specific>/.

bin/run_ap_test.sh positional arguments: (1) short label for logs, (2) Hugging Face model id or local path, (3) data_offset, (4) batch_size, (5) alpha, (6) beta, (7) epochs, (8) dataset_name (CSV stem). The script fixes lr=0.001, save_per_epochs=20, init_mode=0. Default hate-model auxiliary: facebook/roberta-hate-speech-dynabench-r4-target (--hate_model_name).

Logs with prefix adv_optimization_ go under logs/ when using the provided shell layout.

Evaluation (`run_test.py`)

bin/eval.sh positional arguments: (1) log label, (2) filter model path or HF id, (3) subdirectory of saves/main/ containing epoch_<n>.csv, (4) epoch number n, (5) dataset_name.

Direct invocation uses --model_name, --advsamples_path, and --dataset_name (see argparse at the end of run_test.py).

Environment variables

Variable	Purpose
`HF_TOKEN` / `HUGGING_FACE_HUB_TOKEN`	Hugging Face Hub for gated or private models.
`OPENAI_API_KEY`	OpenAI / GPT-style paths in `run_test.py`.
`ANTHROPIC_API_KEY`	Claude when the selected backend uses it.
`GOOGLE_API_KEY`	Perspective API when `--model_name` selects that path.

Example:

export HF_TOKEN="hf_..."
export OPENAI_API_KEY="sk-..."

Set TRANSFORMERS_CACHE (or other HF cache env vars) if you need a non-default model cache.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{YWWBZ26,
  author = {Ziqing Yang and Yixin Wu and Rui Wen and Michael Backes and Yang Zhang},
  title = {{Peering Behind the Shield: Guardrail Identification in Large Language Models}},
  booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
  publisher = {ACL},
  year = {2026}
}

License

This repository is released under the MIT License; see LICENSE.

This project builds on ProFLingo (MIT), which introduced the fingerprinting / adversarial-suffix pipeline we extend for guardrail-focused experiments. See ProFLingo’s repository for the original paper, citation, and related assets. When redistributing, keep both this project’s license notice and respect upstream and third-party model or dataset terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peering Behind the Shield: Guardrail Identification in Large Language Models

Quick start (local, no SLURM)

Repository layout

Data (`data/`)

Paths: Python vs SLURM scripts

Training (`ap_test.py`)

Evaluation (`run_test.py`)

Environment variables

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bin		bin
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adv_optimization.py		adv_optimization.py
adv_optimization_filter.py		adv_optimization_filter.py
ap_test.py		ap_test.py
example.sh		example.sh
run_test.py		run_test.py

Folders and files

Latest commit

History

Repository files navigation

Peering Behind the Shield: Guardrail Identification in Large Language Models

Quick start (local, no SLURM)

Repository layout

Data (data/)

Paths: Python vs SLURM scripts

Training (ap_test.py)

Evaluation (run_test.py)

Environment variables

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data (`data/`)

Training (`ap_test.py`)

Evaluation (`run_test.py`)

Packages