5. Building transcript families¶

Install khmer, screed, and BLAST. (See 1. Quality Trimming and Filtering Your Sequences and 4. BLASTing your assembled data). I would suggest using an m1.large or m1.xlarge machine.

You’ll also need to install some eel-pond scripts:

cd /usr/local/share
git clone https://github.com/ctb/eel-pond.git
cd eel-pond
git checkout protocols-v0.8.3

Copy in your data¶

You need your assembled transcriptome (from e.g. 3. Running the Actual Assembly). Put it in /mnt as ‘trinity-nematostella-raw.fa.gz’:

cd /mnt
gzip -c work/trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz

For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:

cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz

Run khmer partitioning¶

Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will group transcripts into transcript families based on shared sequence.

/usr/local/share/khmer/scripts/do-partition.py -x 1e9 -N 4 --threads 4 nema trinity-nematostella-raw.fa.gz

This should take about 15 minutes, and outputs a file ending in ‘.part’ that contains the partition assignments. Now, group and rename the sequences:

python /usr/local/share/eel-pond/rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz trinity-nematostella.renamed.fa.gz

Looking at the renamed sequences¶

Let’s look at the renamed sequences:

gunzip -c trinity-nematostella.renamed.fa.gz | head

You’ll see that each sequence name looks like this:

>nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001

Some explanation:

‘nema’ is the prefix that you gave the rename script, above; modify accordingly for your own organism. It’s best to change it each time you do an assembly, just to keep things straight.
‘idN’ is the unique ID for this sequence; it will never be repeated in this

file.
‘trN’ is the transcript family, which may contain one or more transcripts.
‘1_of_1_in_tr16001’ tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more.
‘len’ is the sequence length.

Next: 6. Annotating transcript families

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.

comments powered by Disqus

5. Building transcript families¶

Copy in your data¶

Run khmer partitioning¶

Looking at the renamed sequences¶

Table Of Contents

This Page

Navigation

5. Building transcript families¶

Copy in your data¶

Run khmer partitioning¶

Looking at the renamed sequences¶

Table Of Contents

This Page

Quick search

Navigation