Skip to content

devinnicholson/llm-activation

Repository files navigation

LLM Activation Steering From Scratch

I trained a 126M parameter TinyStories language model from scratch, then built activation steering tools that let me shift its generations toward a playful style from inside the model instead of changing the prompt.

The idea was to see if I could find a direction in the model's hidden states that controlled tone, then turn that direction up during generation and watch the same prompt produce a different kind of story. I plan to use this as a starting point for alignment research because it gives me a way to inspect where behavior lives inside a model. If I can find directions for playful vs serious, the next step is asking whether similar internal directions exist for honesty, refusal, uncertainty, sycophancy, or harmful intent, and whether we can steer those behaviors in a controlled way during generation.

Open the Colab demo

What This Is

This repo contains the code for an end-to-end LLM training system.

  • byte-level BPE tokenizer
  • decoder-only Transformer blocks
  • multi-head self-attention
  • RoPE positional embeddings
  • SwiGLU MLP
  • RMSNorm
  • training and checkpointing loop
  • text generation loop
  • activation vector extraction
  • activation steering sweeps and reports
  • SLURM scripts for GPU/HPC runs
  • Colab notebook for an interactive steering demo

The current demo model is a TinyStories Transformer with roughly 126M parameters.

Setting Value
Layers 12
Attention heads 12
Hidden size 768
MLP size 3072
Vocabulary 8192 tokens
Context length 512 tokens
Parameters 125,848,320
Demo checkpoint checkpoints/tinystories_125m_full_ctx512_continue/ckpt.pt
Checkpoint iteration 129,999
Stable validation loss 1.0659 over 300 eval batches

The model was trained from random initialization, then continued from the best 125M checkpoint with a lower learning rate. The Colab demo loads an exported inference bundle from Google Drive and lets you change the prompt, steering layer, alpha, and steering position.

Current Demo

The current default steering setting is below.

emotion = playful
layer = 3
alpha = 10
position = all
prompt = Once upon a time there was a little robot

Baseline generation stays closer to the original TinyStories continuation. The steered generation shifts toward a more playful continuation while the prompt stays the same. Higher-alpha settings can force more playful keywords, but this default is a cleaner tradeoff than the repetitive settings at the top of the raw keyword sweep.

The steering vector is added directly inside the model during generation.

What To Expect From The Demo

The default playful setting keeps the same prompt and shifts the continuation toward a more playful story.

Playful steering demo output

Google Colab Demo

GitHub stores the code and notebook. Google Drive stores the exported model bundle.

Expected Drive layout

MyDrive/llm-activation-colab/playful_125m_continue_direct_ctx512/
  model.pt
  tokenizer.json
  vectors.pt
  manifest.json

Open the notebook

notebooks/playful_steering_colab.ipynb

Use a GPU runtime in Colab

Runtime -> Change runtime type -> T4 GPU

The notebook clones this repo from

https://github.com/devinnicholson/llm-activation.git

Then it loads the Drive bundle and runs baseline and steered generation.

Reproducing The Main Pipeline

Install locally

python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Run the tiny smoke path

python scripts/00_train_tokenizer.py --config configs/tiny.yaml
python scripts/01_prepare_dataset.py --config configs/tiny.yaml
python scripts/02_train.py --config configs/tiny.yaml
python scripts/03_generate.py --config configs/tiny.yaml --prompt "Once upon a time"

Cluster scripts live in slurm/. The main ctx512 training configs are below.

configs/tinystories_125m_full_ctx512.yaml
configs/tinystories_125m_full_ctx512_continue.yaml

Build playful and serious steering vectors

python scripts/05_build_emotion_vectors.py \
  --config configs/tinystories_125m_full_ctx512_continue.yaml \
  --checkpoint checkpoints/tinystories_125m_full_ctx512_continue/ckpt.pt \
  --prompt-bank prompt_banks/playful_vs_serious_direct.yaml \
  --output benchmarks/results/playful_direct_vectors_125m_continue_ctx512.pt

Run steering

python scripts/06_steer_generation.py \
  --config configs/tinystories_125m_full_ctx512_continue.yaml \
  --checkpoint checkpoints/tinystories_125m_full_ctx512_continue/ckpt.pt \
  --vectors benchmarks/results/playful_direct_vectors_125m_continue_ctx512.pt \
  --emotion playful \
  --layer 3 \
  --alpha 10 \
  --position all \
  --prompt "Once upon a time there was a little robot"

Run the refined steering sweep used to avoid repetitive keyword-gaming outputs:

sbatch slurm/playful_refined_pipeline_125m_continue_ctx512.sbatch

Repository Layout

configs/        model and training configs
scripts/        tokenizer, data prep, training, generation, steering, export
src/            project-owned Python package
native/         Rust/PyO3 tokenizer backend
prompt_banks/   contrastive prompts for activation vectors
slurm/          cluster job scripts
notebooks/      Colab demo
tests/          smoke and correctness tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors