5. Building transcript families

Install khmer, screed, and BLAST. (See 1. Quality Trimming and Filtering Your Sequences and 4. BLASTing your assembled data). I would suggest using an m1.large or m1.xlarge machine.

You’ll also need to setup a personal program binary directory:

mkdir -p ${HOME}/bin
export PATH=${PATH}:${HOME}/bin
echo 'export PATH=${PATH}:${HOME}/bin' >> ${HOME}/.bashrc

Then install a script:

cd ${HOME}/bin
wget https://raw.githubusercontent.com/ctb/eel-pond/protocols-v0.8.3/rename-with-partitions.py
chmod u+x rename-with-partitions.py

Copy in your data

You need your assembled transcriptome (from e.g. 3. Running the Actual Assembly). Put it in the project directory as ‘trinity-nematostella-raw.fa.gz’:

cd ${HOME}/projects/eelpond
gzip -c trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz

For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:

cd ${HOME}/projects/eelpond/
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz

Run khmer partitioning

Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will group transcripts into transcript families based on shared sequence.

cd ${HOME}/projects/eelpond
mkdir partitions
cd partitions
do-partition.py -x 1e9 -N 4 --threads ${THREADS:-1} nema \

This should take about 15 minutes, and outputs a file ending in ‘.part’ that contains the partition assignments. Now, group and rename the sequences:

cd ${HOME}/projects/eelpond/partitions
rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz \

Looking at the renamed sequences

Let’s look at the renamed sequences:

cd ${HOME}/projects/eelpond/partitions
gunzip -c trinity-nematostella.renamed.fa.gz | head

You’ll see that each sequence name looks like this:

>nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001

Some explanation:

  • nema is the prefix that you gave the rename script, above; modify accordingly for your own organism. It’s best to change it each time you do an assembly, just to keep things straight.

  • idN is the unique ID for this sequence; it will never be repeated in this


  • trN is the transcript family, which may contain one or more transcripts.

  • 1_of_1_in_tr16001 tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more.

  • len is the sequence length.

Next: 6. Annotating transcript families

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus