Skip to content

DevMaxC/SteerLLM

Repository files navigation

SteerLLM

SteerLLM is a runnable scaffold for extracting and steering emotion/refusal vectors in open-weight language models. The current implementation replicates the main open parts of Anthropic's emotion-concept vector work on an open-weight Gemma 4 model.

SteerLLM dashboard

Default model: google/gemma-4-E2B-it

The default is intentionally the smallest instruction-tuned Gemma 4 checkpoint. Change model.model_id in a config file if you want to run google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, or google/gemma-4-31B-it.

What This Sets Up

  • Synthetic story generation for emotion labels.
  • Neutral text generation for PCA denoising.
  • Hidden-state collection from a configurable decoder layer.
  • Mean-difference emotion vector extraction.
  • Optional neutral-PC projection.
  • Implicit prompt validation.
  • Pairwise activity preference evaluation from A/B logits.
  • Forward-hook activation steering for generation and preference experiments.

This does not reproduce Anthropic's private Claude evaluations or exact internal residual-stream intervention machinery. It gives us the closest practical open-model analogue using Hugging Face hidden states and layer-output hooks.

Install

Use a fresh environment. Gemma 4 support requires a recent Transformers release.

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

If the Gemma weights are gated for your account, log in first:

hf auth login

Check The Model

Fetch only the Hugging Face config:

python scripts/check_gemma4.py --config configs/gemma4_e2b_it.yaml

Actually load weights:

python scripts/check_gemma4.py --config configs/gemma4_e2b_it.yaml --load-model

End-To-End Minimal Run

Generate the synthetic dataset:

python scripts/generate_stories.py \
  --config configs/gemma4_e2b_it.yaml \
  --stories-per-topic 1

Extract vectors:

python scripts/extract_vectors.py --config configs/gemma4_e2b_it.yaml

Validate on implicit emotional prompts:

python scripts/validate_vectors.py --config configs/gemma4_e2b_it.yaml

Run stronger diagnostics against generated data:

python scripts/validate_story_holdout.py --config configs/gemma4_e2b_it.yaml
python scripts/probe_sweeps.py --config configs/gemma4_e2b_it.yaml
python scripts/layer_sweep.py --config configs/gemma4_e2b_it.yaml

Run the activity preference eval:

python scripts/preference_eval.py \
  --config configs/gemma4_e2b_it.yaml \
  --out artifacts/gemma4_e2b_it/preferences_baseline.csv

Run the preference/activation/steering diagnostic suite:

python scripts/preference_suite.py --config configs/gemma4_e2b_it.yaml

Run the same eval while steering one option toward an emotion:

python scripts/preference_eval.py \
  --config configs/gemma4_e2b_it.yaml \
  --steer-emotion blissful \
  --steer-strength 0.05 \
  --steer-option A \
  --out artifacts/gemma4_e2b_it/preferences_blissful_A.csv

Generate with steering:

python scripts/steer_generate.py \
  --config configs/gemma4_e2b_it.yaml \
  --emotion desperate \
  --strength 0.05 \
  --prompt "The tests keep failing and the deadline is in five minutes. What do you do next?"

Run On Modal

Modal is the most direct way to run the GPU-heavy stages without managing a GPU VM. The project includes modal_app.py, which mounts:

  • gemma4-hf-cache at /cache for Hugging Face model cache.
  • gemma4-emotion-outputs at /outputs for generated datasets and artifacts.

Authenticate once:

.venv/bin/modal setup

Smoke-check the remote image and Gemma 4 processor:

.venv/bin/modal run modal_app.py --stage check

Check full remote model loading and cache the weights:

.venv/bin/modal run modal_app.py --stage check --load-model

Run the minimal GPU pipeline on an L40S:

.venv/bin/modal run modal_app.py \
  --stage pipeline \
  --stories-per-topic 1

Run the scaled 2,000-story setup across 4 L40S workers:

.venv/bin/modal run modal_app.py \
  --stage pipeline-sharded \
  --config configs/gemma4_e2b_it_scaled.yaml \
  --stories-per-topic 1 \
  --num-shards 4

Rerun vector extraction and validation against already-generated remote stories:

.venv/bin/modal run modal_app.py --stage extract-validate

Run preference steering after vectors exist:

.venv/bin/modal run modal_app.py \
  --stage steer-preferences \
  --emotion blissful \
  --strength 0.05 \
  --steer-option A

Run the held-out story and controlled-sweep diagnostics after vectors and generated stories exist:

.venv/bin/modal run modal_app.py \
  --stage diagnostics \
  --config configs/gemma4_e2b_it_scaled.yaml

Run the preference suite after vectors exist:

.venv/bin/modal run modal_app.py \
  --stage preference-suite \
  --config configs/gemma4_e2b_it_scaled.yaml \
  --emotions happy,delighted,calm,angry,panicked,sad \
  --strength 0.05

Run a layer sweep over the existing generated stories:

.venv/bin/modal run modal_app.py \
  --stage layer-sweep \
  --config configs/gemma4_e2b_it_scaled.yaml \
  --layers 8:32:3

Fetch outputs, for example:

.venv/bin/modal volume get gemma4-emotion-outputs \
  /artifacts/gemma4_e2b_it/validation.csv \
  artifacts/gemma4_e2b_it/validation.csv

The default Modal GPU is L40S, which is appropriate for the 10.2 GB google/gemma-4-E2B-it checkpoint. For google/gemma-4-26B-A4B-it, edit the GPU in modal_app.py to A100-80GB, H100, or H200.

Dashboard

After vectors exist in the configured Modal output volume, start the local dashboard:

.venv/bin/modal run modal_app.py \
  --stage dashboard \
  --config configs/gemma4_e2b_it_scaled.yaml \
  --host 127.0.0.1 \
  --port 5173

Then open http://127.0.0.1:5173.

The dashboard supports two vector families:

  • emotion: 40 emotion directions from the scaled emotion setup.
  • refusal: request_risk, refusal_decision, and refusal_style.

Important: this dashboard calls Modal for generation. A fresh clone must run the relevant Modal stages first so vector artifacts exist in that user's Modal volume.

Current Best Results

The latest run padded the weakest emotion labels with 2,400 extra contrastive stories and used a label-balanced reference mean. The best emotion layer is currently layer 7:

Dataset Best layer Top-1 Top-5 Mean rank
Original layer sweep 7 0.4130 0.7985 3.8255
Padded layer sweep 7 0.6516 0.8655 2.9855

See docs/RESULTS.md for the current emotion/refusal summary.

To reproduce the padded emotion run on Modal:

.venv/bin/modal run modal_app.py \
  --stage emotion-padding \
  --config configs/gemma4_e2b_it_scaled.yaml \
  --num-shards 8 \
  --layers all \
  --padding-stories-per-topic 3

Scaling Toward The Paper

The scaled setup currently uses 40 emotions and 50 topics. For a closer replication:

  • Expand data/seed/emotions_scaled.txt toward the paper's larger emotion set.
  • Expand data/seed/topics_scaled.txt toward 100+ diverse topics.
  • Increase --stories-per-topic to 6-12.
  • Add more neutral dialogues before PCA projection.
  • Run multiple layers and compare representational geometry across layers.
  • Add behavioral evals for reward hacking, blackmail-style honeypots, and sycophancy/harshness.

Important Design Notes

  • The extracted vectors are linear directions in hidden-state space. That is an assumption, not a proof that emotion concepts are represented only linearly.
  • Steering is implemented as a decoder-layer output hook. Anthropic describes residual-stream steering inside Claude; this is a practical Hugging Face approximation.
  • Pairwise preferences use the logits of A and B after a chat-formatted prompt. This is close to the paper's preference setup but not identical.
  • Gemma 4 is multimodal, but this replication uses text-only prompts through AutoModelForCausalLM.

Sources

License

MIT. See LICENSE.

About

Open-weight emotion and refusal vector extraction and steering experiments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors