Introduction

This repository contains three historical versions of curated maize TE libraries derived from the Maize TE Consortium (MTEC). I combined the three together and further clean the combined library with the following commands and curations. If you are looking for a comprehensive and high-quality maize TE library, look no further, this is the one (usually named like "maizeTE02052020" in the root directory).

Files

maizeTE10102014 was download from the MTEC official website (http://maizesequence.org). The website is gone, but I managed to get a mirror of the first page. Please refer to the file history/Maizedatabase_mirror.pdf for more information about the MTEC project.
TE_12-Feb-2015_15-35.fa was shared by Nicolas Blavet from https://github.com/mcstitzer/maize_v4_TE_annotation/issues/9.
Wessler-Bennetzen_2.fasta was used to annotate the initial B73 genome (Schnable et al. 2009), which was shared by Kapeel Chougule. I believe this is an earlier version of the MTEC library.
nonTE.repeat.fa contains 5 non-TE repeats (knob180, knob TR-1, rDNA spacer, subtelomere 4-12-1, and CentC) in maize, which was shared by Jianing Liu.

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

for i in history/Wessler-Bennetzen_2.fasta history/maizeTE10102014 history/TE_12-Feb-2015_15-35.fa; do perl -nle 's/\s+$//g; $_=(split)[0]; s/\-/_/g; print $_' $i > $i.mod; done

2. Combine sequences with unique IDs

perl bin/output_by_list.pl 1 <(cat history/*.mod) 1 <(cat history/*.mod|grep \>|sort -u) -FA > history/maizeTE11212019.ori

3. Split the library into consensus and others

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA > history/maizeTE11212019.ori.consensus

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA -ex > history/maizeTE11212019.ori.others

4. Remove TEs in others that are represented by consensus TEs

RepeatMasker -pa 36 -div 40 -lib history/maizeTE11212019.ori.consensus -cutoff 225 history/maizeTE11212019.ori.others

perl bin/make_masked.pl -rmout history/maizeTE11212019.ori.others.out -genome history/maizeTE11212019.ori.others -maxdiv 20 -minscore 200 -minlen 80 -t 30

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11212019.ori.others.new.masked > history/maizeTE11212019.ori.others.new.masked.cln

cat history/maizeTE11212019.ori.consensus history/maizeTE11212019.ori.others.new.masked.cln > history/maizeTE11212019.ori2

5. Remove redundant sequences

perl bin/cleanup_nested.pl -in history/maizeTE11212019.ori2 -cov 0.98 -minlen 80 -miniden 95 -iter 2 -t 36

6. Remove nonTE repeats and tandem repeats

RepeatMasker -pa 36 -div 40 -no_is -norna -nolow -lib history/nonTE.repeat.fa -cutoff 225 history/maizeTE11212019.ori2.cln

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 1 -f history/maizeTE11212019.ori2.cln.masked > history/maizeTE11212019.ori2.cln2

cat history/nonTE.repeat.fa history/maizeTE11212019.ori2.cln2 > history/maizeTE11212019.ori3

Improve the combined library

1. Reclassify unknown TEs

python2 TEsorter.py history/maizeTE11212019.ori3 -p 36

2. Find misclassified entries

The file history/maizeTE11212019.ori3.rexdb.cls.tsv contains new classifications of the library. Most of them are consistent with the old classification. What really improved are the LTR/unknown classification.

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; print "$oldcla\t$cla\t$info" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv |less

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (`list.fa`).

These entries are put in the history/removal.list and removed:

Seq_ID	Removal reason
RLC_chr3_D_28761151	rDNA-contained
DTM_Zm08959_AC199876_1	LTRcoding-contained
DTM_Zm22805IC_AC207689_1	LINE-contained
RIX_nugimu_AC203843_0	Duplicted_with_RIL_nugimu_AC203843_0
RLX_fageri_AC204875_8470	misclassified_as_LINE
DTA_ZM00171_consensus	misclassified_as_CACTA
DTA_ZM00205_consensus	misclassified_as_CACTA
DTA_ZM00284_consensus	misclassified_as_CACTA
RLX_teki_AC202867-7492	rDNA-contained
RLG_ajajog_AC191578_3186	A_RLG_nested_in_RLC_ajajog_AC191578_3186
RLC_iwim_AC203300_7761	misclassified_RLG_duplicated
RLC_kupu_AC216069_13264	misclassified_RLG_duplicated
RLX_pute_AC197188_5467	duplicated_RLC_pute_AC197188_5467
RLX_votaed_AC215881_13209	duplicated_RLC_votaed_AC215881_13209
RLC_votaed_AC215881_13209	5-6_LTR_nested_together
RLX_bobeg_AC193485_3670	5_LTR_nested_together

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori3 1 history/removal.list -FA -ex > history/maizeTE11212019.ori3.cln

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; next unless /LTR/; my $info_new=$info; $info_new=~s/$oldcla/$cla/; print "$info|$info_new" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv > history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR

for i in cat history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR; do perl -i -slane 'my ($old, $new)=(split /\|/, $info); s/$old/$new/; print $_' -- -info=$i history/maizeTE11212019.ori3.cln; done

7. Convert sequence names to RepeatMasker format

perl -nle 'my $id=(split)[0]; $id=~s/RLC_(.*)/$1#LTR\/Copia/; $id=~s/RLG_(.*)/$1#LTR\/Gypsy/; $id=~s/RLX_(.*)/$1#LTR\/unknown/; $id=~s/DHH_(.*)/$1#DNA\/Helitron/; $id=~s/DTA_(.*)/$1#DNA\/DTA/; $id=~s/DTC_(.*)/$1#DNA\/DTC/; $id=~s/DTH_(.*)/$1#DNA\/DTH/; $id=~s/DTM_(.*)/$1#DNA\/DTM/; $id=~s/DTT_(.*)/$1#DNA\/DTT/; $id=~s/(RIT_.*)/$1#LINE\/RTE/; $id=~s/(RIL_.*)/$1#LINE\/L1/; $id=~s/(RIX_.*)/$1#LINE\/unknown/; $id=~s/(ZM_CACTA_noncoding.*)/$1#MITE\/DTC/; $id=~s/(ZM_Stowaway.*)/$1#DNA\/DTT/; $id=~s/(ZM_Tourist.*)/$1#DNA\/DTH/; $id=~s/(ZM_hAT_noncoding.*)/$1#MITE\/DTA/; $id=~s/(RST_.*)/$1#SINE\/tRNA/; print $id' history/maizeTE11212019.ori3.cln > history/maizeTE11222019.ori

8. Rename short TIR (<= 600bp) to MITE

perl bin/rename_MITE.pl history/maizeTE11222019.ori > history/maizeTE11222019.ori.rename

Remove gene sequences

1. Mask gene cds

RepeatMasker -pa 36 -q -no_is -norna -nolow -div 40 -lib history/Zea_mays.AGPv4.cds.all.noTE.fa.mod.cln -cutoff 500 history/maizeTE11222019.ori.rename

perl bin/cleanup_tandem.pl -misschar n -Nscreen 1 -nc 1000 -nr 0.3 -minlen 80 -maxlen 5000000 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11222019.ori.rename.masked > history/maizeTE11222019.ori.rename.nogene

2. Finalize

Manually add the sequence CL569186.1#subtelomere/4-12-1 back to history/maizeTE11222019.ori.rename.nogene

cp history/maizeTE11222019.ori.rename.nogene maizeTE11222019

Updates

01/30/2020

Added four CRM sequences (CRM1-4) contributed by Na Wang from Gernot and Presting (2008).

02/03/2020

Added 3-letter names before all consensus seq IDs. IDs like "ZM00034_consensus" were inherited from the 2014 version MTEC, so kept it unchanged.

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/; print $_' maizeTE01302020 > maizeTE02032020

02/05/2020

Fix namings

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/ if /consensus/; s/>(.*)/>$1#LTR\/CRM/ if /CRM/; print $_' history/maizeTE01302020 > maizeTE02052020

04/09/2026

The following 40 new transposable element (TE) sequences were added and the previous library maizeTE02052020 was moved to history/. Additionally, the TEnaming_guideline.txt file was added to the repository to document TE naming conventions.

28 LTR (Gypsy 14, Copia 9, unknown 5)
10 TIR DNA transposons (DTA 7, DTH 2, DTC 1)
2 LINEs (L1 1, RTE 1)

Curation Notes: The 40 newly added transposable element (TE) sequences are highly accurate consensus sequences curated and validated through the following pipeline:

Consensus Generation: Initial candidate sequences were identified and their consensus boundaries were curated using TEtrimmer.
Structural Validation (LTRs): The 28 LTR elements underwent rigorous manual validation. Complete and intact structures (including 5' and 3' long terminal repeats) were confirmed using coverage plots and BLAST analysis.
Boundary & TSD Integrity Verification: Target Site Duplications (TSDs) for all 40 curated TEs were initially identified using TSD-searcher and subsequently manually inspected to ensure maximum precision. During this manual curation, sequences were evaluated to:
- Examine the sequences to identify any gaps between the designated TE boundaries and the TSDs.
- Check for and evaluate any mismatches within the TSD sequences themselves.
This rigorous combination of computational searching and manual inspection confirmed the presence and structural integrity of the TSDs across the sequences, ensuring that the consensus boundaries are well-understood.
Novel Lineage Isolation: The RIL_YJ_0#LINE/L1 entry is a newly built de novo consensus. Due to its low sequence similarity to existing LINE/L1 elements in maizeTE02052020, it has been established as a novel consensus family.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bin		bin
history		history
README.md		README.md
TEnaming_guideline.txt		TEnaming_guideline.txt
maizeTE04092026		maizeTE04092026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Files

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

2. Combine sequences with unique IDs

3. Split the library into consensus and others

4. Remove TEs in others that are represented by consensus TEs

5. Remove redundant sequences

6. Remove nonTE repeats and tandem repeats

Improve the combined library

1. Reclassify unknown TEs

2. Find misclassified entries

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (`list.fa`).

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

7. Convert sequence names to RepeatMasker format

8. Rename short TIR (<= 600bp) to MITE

Remove gene sequences

1. Mask gene cds

2. Finalize

Updates

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Files

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

2. Combine sequences with unique IDs

3. Split the library into consensus and others

4. Remove TEs in others that are represented by consensus TEs

5. Remove redundant sequences

6. Remove nonTE repeats and tandem repeats

Improve the combined library

1. Reclassify unknown TEs

2. Find misclassified entries

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (list.fa).

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

7. Convert sequence names to RepeatMasker format

8. Rename short TIR (<= 600bp) to MITE

Remove gene sequences

1. Mask gene cds

2. Finalize

Updates

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Manually check misclassified sequences (`list.fa`).

Packages