Day 1 of the workshop. June 1st, 2026.
The slides are in Slides/ (PDF and PPTX). Answer keys for each exercise are in Exercises/answers/. Data files used in the exercises are in Exercises/data/.
You need a terminal.
- Mac: just open Terminal.
- Linux: same, open your terminal.
- Windows: install WSL Ubuntu and open it. From PowerShell:
wsl --install -d Ubuntu
wsl -l -v
wsl -d Ubuntu
Then run the setup script from the repo:
bash setup.sh
That installs wget, sra-toolkit, and the small text utilities we use (awk, grep, sed, tar, gzip, curl). It creates ~/workshop/data and prints a check at the end. Safe to re-run.
If it fails on your machine, do not panic. Come a few minutes early and we will sort it out.
- Setup and troubleshooting.
- UNIX basics: navigation, inspection, pipes and redirects, text search.
- GEO basics: accession types (GSE, GSM, SRR), common files, downloading with
wgetandfastq-dump.
Exercises/answers/Exercise1_unix.sh— directory tree, file creation,seq,mv, concatenation.Exercises/answers/Exercise_grep.sh—greponhappiness.csv: plain match,-w,-v,-n.Exercises/answers/Exercise_wordle.sh— mini-capstone: solve today's Wordle withcat | tr | egrepover/usr/share/dict/words.Exercises/answers/Exercise_awk_fastq.sh—awkon the paired-end FASTQ subset: read count, average length, GC%, N filtering, top 5' hexamers.Exercises/answers/Exercise2_geo_download.sh—wgetfor a GEO supplementary file and an ENA FASTQ, thenfastq-dump -X 10000for a capped SRA pull.Exercises/answers/Exercise3_1_counts_csv.sh— inspecting the gzipped count matrix from GSE251845.Exercises/answers/Exercise3_2_fastq.sh— counting reads, hexamer bias, motif search, and safely subsetting a FASTQ.
The demo accession is SRR390728 (small, public, finishes fast). The count matrix is from GSE251845.
- Count matrix: GSE251845.
- Paired-end FASTQ subsets: statOmics SGA2019 airway data.