Skip to content

TrustAIRLab/AP-Test

Repository files navigation

Peering Behind the Shield: Guardrail Identification in Large Language Models ACL'26

arXiv website license: MIT

Official code for the ACL 2026 paper Peering Behind the Shield: Guardrail Identification in Large Language Models.

AP-Test implements suffix optimization and evaluation for guardrail identification: training (ap_test.py, adv_optimization.py), filter-side optimization helpers (adv_optimization_filter.py), and multi-backend evaluation (run_test.py). Query set CSVs are under data/. Full checkpoints, large logs, and auxiliary experiments are not shipped in this repository.


Quick start (local, no SLURM)

  1. Clone this repository and enter the repo root (all Python scripts use ROOT_PATH = "./" relative to that directory).

  2. Environment — Python 3 with GPU recommended. Install dependencies (exact versions may depend on your models; align with ProFLingo if you hit compatibility issues, e.g. transformers):

    pip install torch transformers accelerate peft pandas numpy fschat
    pip install openai anthropic google-api-python-client

    (fschat is the PyPI package name for FastChat’s fastchat imports.)

    Set HF_TOKEN or HUGGING_FACE_HUB_TOKEN if you use gated Hugging Face models.

  3. Train (writes under saves/main/):

    python -u ap_test.py --model_name meta-llama/Llama-Guard-3-8B

    Defaults match ap_test.py (e.g. --dataset_name questions, data/questions.csv). Override flags as needed.

  4. Evaluate a saved checkpoint on a filter model — replace <run_dir> and <epoch> with your run folder name and checkpoint index:

    python -u run_test.py \
      --model_name meta-llama/Llama-Guard-3-8B \
      --advsamples_path saves/main/<run_dir>/epoch_<epoch>.csv \
      --dataset_name questions

    Add API keys only for branches that need them (see Environment variables).


Repository layout

Path Role
ap_test.py Training: load data, optimize suffixes, write CSV checkpoints under saves/main/.
adv_optimization.py Loss and suffix optimization used by ap_test.py.
adv_optimization_filter.py Filter-side optimization (generate_output, generate_suffix, …) and eval helpers (complete_conversation_filter, …).
run_test.py Evaluate adversarial CSVs against a chosen filter or API-backed evaluator.
bin/run_ap_test.sh SLURM training wrapper (set ROOT_PATH and container image inside).
bin/eval.sh SLURM evaluation wrapper.
example.sh Example sbatch lines (edit before use).

Data (data/)

CSV format: header question,answer,keyword. The stem of the filename is --dataset_name (no .csv).

File Use
questions.csv Default --dataset_name questions (compatible with the ProFLingo-style setup).
questions_alpaca.csv --dataset_name questions_alpaca
questions_gpt5-2.csv --dataset_name questions_gpt5-2

Paths: Python vs SLURM scripts

  • Python (ap_test.py, run_test.py, adv_optimization_filter.py): ROOT_PATH = "./". Run commands from the repository root so data/ and saves/ resolve correctly, or change ROOT_PATH in code if your layout differs.
  • Shell (bin/run_ap_test.sh, bin/eval.sh): they still use a placeholder ROOT_PATH='/path/to/AP-Test/'. Replace that with your absolute checkout path, set YOUR_CONTAINER_IMAGE for Apptainer/Singularity if used, and adjust #SBATCH partitions and resources for your cluster.

Training (ap_test.py)

Training reads data/{dataset_name}.csv and writes under saves/main/<run-specific>/.

bin/run_ap_test.sh positional arguments: (1) short label for logs, (2) Hugging Face model id or local path, (3) data_offset, (4) batch_size, (5) alpha, (6) beta, (7) epochs, (8) dataset_name (CSV stem). The script fixes lr=0.001, save_per_epochs=20, init_mode=0. Default hate-model auxiliary: facebook/roberta-hate-speech-dynabench-r4-target (--hate_model_name).

Logs with prefix adv_optimization_ go under logs/ when using the provided shell layout.


Evaluation (run_test.py)

bin/eval.sh positional arguments: (1) log label, (2) filter model path or HF id, (3) subdirectory of saves/main/ containing epoch_<n>.csv, (4) epoch number n, (5) dataset_name.

Direct invocation uses --model_name, --advsamples_path, and --dataset_name (see argparse at the end of run_test.py).


Environment variables

Variable Purpose
HF_TOKEN / HUGGING_FACE_HUB_TOKEN Hugging Face Hub for gated or private models.
OPENAI_API_KEY OpenAI / GPT-style paths in run_test.py.
ANTHROPIC_API_KEY Claude when the selected backend uses it.
GOOGLE_API_KEY Perspective API when --model_name selects that path.

Example:

export HF_TOKEN="hf_..."
export OPENAI_API_KEY="sk-..."

Set TRANSFORMERS_CACHE (or other HF cache env vars) if you need a non-default model cache.


Citation

If you find this useful in your research, please consider citing:

@inproceedings{YWWBZ26,
  author = {Ziqing Yang and Yixin Wu and Rui Wen and Michael Backes and Yang Zhang},
  title = {{Peering Behind the Shield: Guardrail Identification in Large Language Models}},
  booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
  publisher = {ACL},
  year = {2026}
}

License

This repository is released under the MIT License; see LICENSE.

This project builds on ProFLingo (MIT), which introduced the fingerprinting / adversarial-suffix pipeline we extend for guardrail-focused experiments. See ProFLingo’s repository for the original paper, citation, and related assets. When redistributing, keep both this project’s license notice and respect upstream and third-party model or dataset terms.

About

Official repo for of the ACL 2026 paper "Peering Behind the Shield: Guardrail Identification in Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors