rmsnorm
Here are 19 public repositories matching this topic...
Efficient kernel for RMS normalization with fused operations, includes both forward and backward passes, compatibility with PyTorch.
-
Updated
Jun 5, 2024 - Python
Simple and easy to understand PyTorch implementation of Large Language Model (LLM) GPT and LLAMA from scratch with detailed steps. Implemented: Byte-Pair Tokenizer, Rotational Positional Embedding (RoPe), SwishGLU, RMSNorm, Mixture of Experts (MOE). Tested on Taylor Swift song lyrics dataset.
-
Updated
Nov 18, 2024 - Python
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
-
Updated
May 25, 2026 - Python
249M-param MoE transformer built from scratch in PyTorch. GQA, RoPE, SwiGLU, sparse MoE with 3 aux losses, AMP training loop no Trainer abstractions. Architecture mirrors LLaMA/Mistral/Mixtral decisions, fully inspectable.
-
Updated
May 22, 2026 - Jupyter Notebook
A 36M-parameter goldfish language model with a 10-second memory + pixel-art PWA desk pet. Runs in your browser, fully offline. Adopt it at den-sec.github.io/glublm/desk-pet/
-
Updated
May 29, 2026 - JavaScript
LLM pretraining from scratch on FineWeb dataset (architecture and all components explained), plus optimal use of GPU on SLURM cluster
-
Updated
May 12, 2026 - Python
Simple character level Transformer
-
Updated
May 27, 2024 - Jupyter Notebook
Generative models nano version for fun. No STOA here, nano first.
-
Updated
Jul 27, 2025 - Jupyter Notebook
Optimized Fused RMSNorm implementation with CUDA. Features vectorized memory access (float4), warp-level reductions, and efficient backward pass for LLM training
-
Updated
Dec 24, 2025 - Python
A non-official implementation of Qwen 3.5, as there doesn’t seem to be a paper or any code available that I can find, so I decided to implement it just for fun.
-
Updated
Mar 11, 2026 - Python
Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.
-
Updated
Apr 22, 2026 - Python
A from-scratch PyTorch LLM implementing Sparse Mixture-of-Experts (MoE) with Top-2 gating. Integrates modern Llama-3 components (RMSNorm, SwiGLU, RoPE, GQA) and a custom-coded Byte-Level BPE tokenizer. Pre-trained on a curated corpus of existential & dark philosophical literature.
-
Updated
Jan 7, 2026 - Python
Build an LLM in PyTorch: BPE tokenizer, GPT-1/2 + LLaMA, end-to-end train/infer
-
Updated
Mar 15, 2026 - Python
A transformer language model built from scratch, from byte-level BPE tokenization through pretraining and QA fine-tuning.
-
Updated
May 25, 2026 - Python
CUDA kernels for LLM decode-stage inference, built as a PyTorch extension with correctness tests and latency benchmarks.
-
Updated
May 21, 2026 - Python
🚀 Build your own LLM easily with OpenLabLM, a lightweight, hackable codebase tailored for hobbyists using a single consumer GPU.
-
Updated
Jun 1, 2026 - Python
Improve this page
Add a description, image, and links to the rmsnorm topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the rmsnorm topic, visit your repo's landing page and select "manage topics."