GitHub - Shekswess/tiny-think: Reasoning-first post-training for tiny language models (140M) on a single GPU.

Tiny Think
Reasoning-first post-training for tiny language models (140M) on a single GPU.

🤗 Hugging Face Collection

About

Tiny Think is the official research codebase for:

Tiny Think: Reasoning-First Post-Training for Tiny Math and STEM Language Models

This repository studies a simple question:

What does post-training actually do to reasoning in very small language models under strict hardware constraints?

The project is intentionally:

minimal
reproducible
runnable on a single consumer GPU

Why This Repo Exists

Tiny models are attractive because they are cheap, fast, and practical to run locally. But once you push reasoning-style post-training into the 140M regime, the behavior becomes less obvious:

supervised fine-tuning can improve math behavior,
preference optimization can further improve task-specific math accuracy,
but those gains can come with regressions in broader reasoning and instruction following.

This repo exists to make that tradeoff concrete, reproducible, and easy to inspect.

Headline Results

The main result is not just that post-training helps math, but that it can also introduce a measurable general reasoning tax.

Checkpoint	GSM8K	BBH	IFEval	Interpretation
`SFT (NLL; epoch 2)`	`8.04`	`23.84`	`21.63`	Strongest balanced SFT checkpoint
`DPO (beta=1, lr=3e-6)`	`9.40`	`13.18`	`16.45`	Best GSM8K, but broad reasoning regresses
`APO-zero (beta=0.5, lr=3e-6)`	`8.26`	`12.01`	`16.08`	Similar tradeoff pattern under APO

In other words:

full fine-tuning at 140M is enough to produce non-trivial math reasoning,
preference optimization behaves more like calibration than free capability gain,
math-only evaluation would miss important regressions.

Artifacts

Model collection: Hugging Face Collection
Main SFT config: configs/sft/math_stem_nll_bf16.yaml
Main DPO config: configs/dpo/math_stem_dpo_beta1_lr3e_6_e1_bs8.yaml
Main APO config: configs/dpo/math_stem_apo_zero_beta0_5_lr3e_6_e1_bs8.yaml
Evaluation entrypoint: eval/run_eval_vllm_multi.sh

Experimental Scope

These constraints are deliberate:

Single machine only
Single GPU (RTX 5060 Ti, 16 GB VRAM)
No distributed training
No DeepSpeed / FSDP
No LoRA / PEFT
Full fine-tuning only
Base model fixed to facebook/MobileLLM-R1-140M-base

This is not a generic training framework. It is a controlled research repo for studying tiny-model post-training under tight resource limits.

Training Overview

Tiny Think uses a simple two-stage post-training recipe:

Stage A - Supervised Fine-Tuning (SFT)

math + STEM data with explicit <think> traces
about 60M tokens
NLL or DFT objectives

Stage B - Preference Optimization

math/STEM preference pairs
about 10M tokens
DPO or APO-zero

Stage B improves solution selection, but it can also narrow behavior and hurt broader reasoning.

Stage	Goal	Data Budget	Objective
`Stage A: SFT`	Teach structured reasoning traces in math/STEM	`60M` tokens	`NLL` or `DFT`
`Stage B: Preference`	Calibrate solution selection	`10M` tokens	`DPO` or `APO-zero`

Repository Layout

assets/                # logo
configs/               # experiment configs
  sft/
  dpo/
data/                  # dataset download / preparation utilities
  sources/
train/                 # SFT and preference-optimization training entrypoints
eval/                  # vLLM + lm-eval evaluation entrypoints

Quickstart

This repository uses Python 3.12 + uv and expects a local .venv.

1. Create or activate the environment

if [ -d ".venv" ]; then
  source .venv/bin/activate
else
  uv venv .venv --python=3.12 --seed
  source .venv/bin/activate
fi

2. Install dependencies in the required order

uv pip install "lm-eval[api]"
uv pip install langdetect immutabledict
uv pip install sympy math_verify antlr4-python3-runtime==4.11
uv pip install -U vllm --torch-backend=cu128
uv pip install trl
uv pip install liger-kernel
uv pip install kernels
uv pip install wandb

3. Inspect or prepare dataset sources

The datasets are derived from allenai/Dolci-Think-SFT-7B and allenai/Dolci-Think-DPO-7B. Repository utilities for source inspection and dataset preparation live under data/.

Examples:

python data/download_dolci_think_sft.py
python data/download_dolci_think_dpo.py

4. Run the main SFT stage

python train/sft.py --config-path configs/sft/math_stem_nll_bf16.yaml

5. Run the main DPO stage

python train/dpo.py --config-path configs/dpo/math_stem_dpo_beta1_lr3e_6_e1_bs8.yaml

6. Evaluate checkpoints

Full evaluation sweep:

./eval/run_eval_vllm_multi.sh

Paper-style math evaluation:

MODE=math_eval MODEL_ID=Shekswess/tiny-think-dpo-math-stem-dpo-beta1-lr3e-6-e1-bs8 ./eval/run_eval_vllm_multi.sh

Evaluation

Evaluation uses:

vLLM for inference
lm-eval for benchmark execution

Benchmarks used include:

GSM8K
MATH500
BBH
IFEval
STEM-oriented tasks such as MMLU-STEM, ARC-Challenge, OpenBookQA, GPQA, and PIQA

The evaluation scripts are designed to follow the same reasoning-oriented setup used in MobileLLM-R1.

What This Repo Is And Is Not

This repo is:

research code for controlled tiny-model post-training experiments
optimized for local reproducibility on one consumer GPU
focused on understanding tradeoffs, not just maximizing a single benchmark

This repo is not:

a production system
a general-purpose chatbot stack
a distributed training framework
a PEFT / LoRA benchmark suite

Citation

If you use this repository, please cite.

@article{jakimovski2026tinythink,
  title={Tiny Think: Reasoning-First Post-Training for Tiny Math and STEM Language Models},
  author={Jakimovski, Bojan and Ilijoski, Bojan},
  year={2026}
}

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
configs		configs
data		data
eval		eval
train		train
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Why This Repo Exists

Headline Results

Artifacts

Experimental Scope

Training Overview

Repository Layout

Quickstart

1. Create or activate the environment

2. Install dependencies in the required order

3. Inspect or prepare dataset sources

4. Run the main SFT stage

5. Run the main DPO stage

6. Evaluate checkpoints

Evaluation

What This Repo Is And Is Not

Citation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Why This Repo Exists

Headline Results

Artifacts

Experimental Scope

Training Overview

Repository Layout

Quickstart

1. Create or activate the environment

2. Install dependencies in the required order

3. Inspect or prepare dataset sources

4. Run the main SFT stage

5. Run the main DPO stage

6. Evaluate checkpoints

Evaluation

What This Repo Is And Is Not

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages