SteerLLM is a runnable scaffold for extracting and steering emotion/refusal vectors in open-weight language models. The current implementation replicates the main open parts of Anthropic's emotion-concept vector work on an open-weight Gemma 4 model.
Default model: google/gemma-4-E2B-it
The default is intentionally the smallest instruction-tuned Gemma 4 checkpoint. Change
model.model_id in a config file if you want to run google/gemma-4-E4B-it,
google/gemma-4-26B-A4B-it, or google/gemma-4-31B-it.
- Synthetic story generation for emotion labels.
- Neutral text generation for PCA denoising.
- Hidden-state collection from a configurable decoder layer.
- Mean-difference emotion vector extraction.
- Optional neutral-PC projection.
- Implicit prompt validation.
- Pairwise activity preference evaluation from A/B logits.
- Forward-hook activation steering for generation and preference experiments.
This does not reproduce Anthropic's private Claude evaluations or exact internal residual-stream intervention machinery. It gives us the closest practical open-model analogue using Hugging Face hidden states and layer-output hooks.
Use a fresh environment. Gemma 4 support requires a recent Transformers release.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"If the Gemma weights are gated for your account, log in first:
hf auth loginFetch only the Hugging Face config:
python scripts/check_gemma4.py --config configs/gemma4_e2b_it.yamlActually load weights:
python scripts/check_gemma4.py --config configs/gemma4_e2b_it.yaml --load-modelGenerate the synthetic dataset:
python scripts/generate_stories.py \
--config configs/gemma4_e2b_it.yaml \
--stories-per-topic 1Extract vectors:
python scripts/extract_vectors.py --config configs/gemma4_e2b_it.yamlValidate on implicit emotional prompts:
python scripts/validate_vectors.py --config configs/gemma4_e2b_it.yamlRun stronger diagnostics against generated data:
python scripts/validate_story_holdout.py --config configs/gemma4_e2b_it.yaml
python scripts/probe_sweeps.py --config configs/gemma4_e2b_it.yaml
python scripts/layer_sweep.py --config configs/gemma4_e2b_it.yamlRun the activity preference eval:
python scripts/preference_eval.py \
--config configs/gemma4_e2b_it.yaml \
--out artifacts/gemma4_e2b_it/preferences_baseline.csvRun the preference/activation/steering diagnostic suite:
python scripts/preference_suite.py --config configs/gemma4_e2b_it.yamlRun the same eval while steering one option toward an emotion:
python scripts/preference_eval.py \
--config configs/gemma4_e2b_it.yaml \
--steer-emotion blissful \
--steer-strength 0.05 \
--steer-option A \
--out artifacts/gemma4_e2b_it/preferences_blissful_A.csvGenerate with steering:
python scripts/steer_generate.py \
--config configs/gemma4_e2b_it.yaml \
--emotion desperate \
--strength 0.05 \
--prompt "The tests keep failing and the deadline is in five minutes. What do you do next?"Modal is the most direct way to run the GPU-heavy stages without managing a GPU VM. The project
includes modal_app.py, which mounts:
gemma4-hf-cacheat/cachefor Hugging Face model cache.gemma4-emotion-outputsat/outputsfor generated datasets and artifacts.
Authenticate once:
.venv/bin/modal setupSmoke-check the remote image and Gemma 4 processor:
.venv/bin/modal run modal_app.py --stage checkCheck full remote model loading and cache the weights:
.venv/bin/modal run modal_app.py --stage check --load-modelRun the minimal GPU pipeline on an L40S:
.venv/bin/modal run modal_app.py \
--stage pipeline \
--stories-per-topic 1Run the scaled 2,000-story setup across 4 L40S workers:
.venv/bin/modal run modal_app.py \
--stage pipeline-sharded \
--config configs/gemma4_e2b_it_scaled.yaml \
--stories-per-topic 1 \
--num-shards 4Rerun vector extraction and validation against already-generated remote stories:
.venv/bin/modal run modal_app.py --stage extract-validateRun preference steering after vectors exist:
.venv/bin/modal run modal_app.py \
--stage steer-preferences \
--emotion blissful \
--strength 0.05 \
--steer-option ARun the held-out story and controlled-sweep diagnostics after vectors and generated stories exist:
.venv/bin/modal run modal_app.py \
--stage diagnostics \
--config configs/gemma4_e2b_it_scaled.yamlRun the preference suite after vectors exist:
.venv/bin/modal run modal_app.py \
--stage preference-suite \
--config configs/gemma4_e2b_it_scaled.yaml \
--emotions happy,delighted,calm,angry,panicked,sad \
--strength 0.05Run a layer sweep over the existing generated stories:
.venv/bin/modal run modal_app.py \
--stage layer-sweep \
--config configs/gemma4_e2b_it_scaled.yaml \
--layers 8:32:3Fetch outputs, for example:
.venv/bin/modal volume get gemma4-emotion-outputs \
/artifacts/gemma4_e2b_it/validation.csv \
artifacts/gemma4_e2b_it/validation.csvThe default Modal GPU is L40S, which is appropriate for the 10.2 GB
google/gemma-4-E2B-it checkpoint. For google/gemma-4-26B-A4B-it, edit the GPU in
modal_app.py to A100-80GB, H100, or H200.
After vectors exist in the configured Modal output volume, start the local dashboard:
.venv/bin/modal run modal_app.py \
--stage dashboard \
--config configs/gemma4_e2b_it_scaled.yaml \
--host 127.0.0.1 \
--port 5173Then open http://127.0.0.1:5173.
The dashboard supports two vector families:
emotion: 40 emotion directions from the scaled emotion setup.refusal:request_risk,refusal_decision, andrefusal_style.
Important: this dashboard calls Modal for generation. A fresh clone must run the relevant Modal stages first so vector artifacts exist in that user's Modal volume.
The latest run padded the weakest emotion labels with 2,400 extra contrastive stories and used a label-balanced reference mean. The best emotion layer is currently layer 7:
| Dataset | Best layer | Top-1 | Top-5 | Mean rank |
|---|---|---|---|---|
| Original layer sweep | 7 | 0.4130 | 0.7985 | 3.8255 |
| Padded layer sweep | 7 | 0.6516 | 0.8655 | 2.9855 |
See docs/RESULTS.md for the current emotion/refusal summary.
To reproduce the padded emotion run on Modal:
.venv/bin/modal run modal_app.py \
--stage emotion-padding \
--config configs/gemma4_e2b_it_scaled.yaml \
--num-shards 8 \
--layers all \
--padding-stories-per-topic 3The scaled setup currently uses 40 emotions and 50 topics. For a closer replication:
- Expand
data/seed/emotions_scaled.txttoward the paper's larger emotion set. - Expand
data/seed/topics_scaled.txttoward 100+ diverse topics. - Increase
--stories-per-topicto 6-12. - Add more neutral dialogues before PCA projection.
- Run multiple layers and compare representational geometry across layers.
- Add behavioral evals for reward hacking, blackmail-style honeypots, and sycophancy/harshness.
- The extracted vectors are linear directions in hidden-state space. That is an assumption, not a proof that emotion concepts are represented only linearly.
- Steering is implemented as a decoder-layer output hook. Anthropic describes residual-stream steering inside Claude; this is a practical Hugging Face approximation.
- Pairwise preferences use the logits of
AandBafter a chat-formatted prompt. This is close to the paper's preference setup but not identical. - Gemma 4 is multimodal, but this replication uses text-only prompts through
AutoModelForCausalLM.
- Anthropic research post: https://www.anthropic.com/research/emotion-concepts-function
- Full Transformer Circuits paper: https://transformer-circuits.pub/2026/emotions/index.html
- Hugging Face Gemma 4 docs: https://huggingface.co/docs/transformers/model_doc/gemma4
- Default checkpoint: https://huggingface.co/google/gemma-4-E2B-it
MIT. See LICENSE.
