Install khmer, screed, and BLAST. (See 1. Quality Trimming and Filtering Your Sequences and 4. BLASTing your assembled data). I would suggest using an m1.large or m1.xlarge machine.
You’ll also need to install some eel-pond scripts:
cd /usr/local/share
git clone https://github.com/ctb/eel-pond.git
cd eel-pond
git checkout protocols-v0.8.3
You need your assembled transcriptome (from e.g. 3. Running the Actual Assembly). Put it in /mnt as ‘trinity-nematostella-raw.fa.gz’:
cd /mnt
gzip -c work/trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz
For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:
cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz
Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will group transcripts into transcript families based on shared sequence.
/usr/local/share/khmer/scripts/do-partition.py -x 1e9 -N 4 --threads 4 nema trinity-nematostella-raw.fa.gz
This should take about 15 minutes, and outputs a file ending in ‘.part’ that contains the partition assignments. Now, group and rename the sequences:
python /usr/local/share/eel-pond/rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz trinity-nematostella.renamed.fa.gz
Let’s look at the renamed sequences:
gunzip -c trinity-nematostella.renamed.fa.gz | head
You’ll see that each sequence name looks like this:
>nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001
Some explanation:
‘nema’ is the prefix that you gave the rename script, above; modify accordingly for your own organism. It’s best to change it each time you do an assembly, just to keep things straight.
file.
‘trN’ is the transcript family, which may contain one or more transcripts.
‘1_of_1_in_tr16001’ tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more.
‘len’ is the sequence length.