6. Annotating transcript families

You can start with the ‘trinity-nematostella.renamed.fa.gz’ file from the previous page (5. Building transcript families) _or_ download a precomputed one:

cd /mnt
curl -O http://public.ged.msu.edu.s3.amazonaws.com/trinity-nematostella.renamed.fa.gz

Note

The BLASTs below will take a long time, like 24-36 hours. If you want to work with canned BLASTs, do:

cd /mnt
curl -O http://public.ged.msu.edu.s3.amazonaws.com/nema.x.mouse.gz
curl -O http://public.ged.msu.edu.s3.amazonaws.com/mouse.x.nema.gz
gunzip nema.x.mouse.gz
gunzip mouse.x.nema.gz

However, if you built your own transcript families, you’ll need to rerun these BLASTs.

Install BLAST

Make sure you’ve updated BLAST, as in 4. BLASTing your assembled data:

cd /root

curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data

Doing a preliminary annotation against mouse

Now let’s assign putative homology & orthology to these transcripts, by doing BLASTs & reciprocal best hit analysis. First, uncompress your transcripts file:

cd /mnt
gunzip trinity-nematostella.renamed.fa.gz

Now, grab the latest mouse RefSeq:

curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
gunzip mouse.protein.faa.gz

Format both as BLAST databases:

formatdb -i mouse.protein.faa -o T -p T
formatdb -i trinity-nematostella.renamed.fa -o T -p F

And, now, if you haven’t downloaded the canned BLAST data above, run BLAST in both directions. Note, this may take ~24 hours or longer; you probably want to run it in screen:

blastall -i trinity-nematostella.renamed.fa -d mouse.protein.faa -e 1e-3 -p blastx -o nema.x.mouse -a 8 -v 4 -b 4
blastall -i mouse.protein.faa -d trinity-nematostella.renamed.fa -e 1e-3 -p tblastn -o mouse.x.nema -a 8 -v 4 -b 4

Assigning names to sequences

Now, calculate putative homology (best BLAST hit) and orthology (reciprocal best hits):

python /usr/local/share/eel-pond/make-uni-best-hits.py nema.x.mouse nema.x.mouse.homol
python /usr/local/share/eel-pond/make-reciprocal-best-hits.py nema.x.mouse mouse.x.nema nema.x.mouse.ortho

Prepare some of the mouse info:

python /usr/local/share/eel-pond/make-namedb.py mouse.protein.faa mouse.namedb
python -m screed.fadbm mouse.protein.faa

And, finally, annotate the sequences:

python /usr/local/share/eel-pond/annotate-seqs.py trinity-nematostella.renamed.fa nema.x.mouse.ortho nema.x.mouse.homol

After this last, you should see:

207533 sequences total
10471 annotated / ortho
95726 annotated / homol
17215 annotated / tr
123412 total annotated

If any of these numbers are zero on the nematostella data, then you probably need to redo the BLAST.

This will produce a file ‘trinity-nematostella.renamed.fa.annot’, which will have sequences that look like this:

>nematostella.id1.tr115222 h=43% => suppressor of tumorigenicity 7 protein isoform 2 [Mus musculus] 1_of_7_in_tr115222 len=1635 id=1 tr=115222 1_of_7_in_tr115222 len=1635 id=1 tr=115222

I suggest renaming this file to ‘nematostella.fa’ and using it for BLASTs (see 4. BLASTing your assembled data).

cp trinity-nematostella.renamed.fa.annot nematostella.fa

The annotate-seqs command will also produce two CSV files. The first, trinity-nematostella.renamed.fa.annot.csv, is small, and contains sequence names linked to orthology and homology information. The secnod, trinity-nematostella.renamed.fa.annot.large.csv, is large, and contains all of the same information as in the first but also contains all of the actual DNA sequence in the last column. (Some spreadsheet programs may not be able to open it.) You can do:

cp *.csv /root/Dropbox

to copy them locally, if you have set up Dropbox (see: Installing Dropbox on your EC2 machine).

Next: 7. Expression analysis (with RSEM).


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus

Table Of Contents

This Page