===============================
5. Building transcript families
===============================

.. shell start

Install khmer, screed, and BLAST.  (See :doc:`1-quality` and
:doc:`installing-blastkit`).  I would suggest using an m1.large or
m1.xlarge machine.

.. ::

   set -x
   set -e
   echo 5-building-transcript-families install `date` >> ${HOME}/times.out

You'll also need to setup a personal program binary directory::

   mkdir -p ${HOME}/bin
   export PATH=${PATH}:${HOME}/bin
   echo 'export PATH=${PATH}:${HOME}/bin' >> ${HOME}/.bashrc

Then install a script::

   cd ${HOME}/bin
   wget https://raw.githubusercontent.com/ctb/eel-pond/protocols-v0.8.3/rename-with-partitions.py
   chmod u+x rename-with-partitions.py

Copy in your data
=================

You need your assembled transcriptome (from
e.g. :doc:`3-big-assembly`).  Put it in the project directory as 
'trinity-nematostella-raw.fa.gz'::

   cd ${HOME}/projects/eelpond
   gzip -c trinity_out_dir/Trinity.fasta > trinity-nematostella-raw.fa.gz

For the purposes of your first run through, I suggest just grabbing my copy
of the Nematostella assembly::

   cd ${HOME}/projects/eelpond/
   curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz

Run khmer partitioning
======================

.. ::

   echo 5-building-transcript-families partition `date` >> ${HOME}/times.out

Partitioning runs a de Bruijn graph-based clustering algorithm that will
cluster your transcripts by transitive sequence overlap.  That is, it will
group transcripts into transcript families based on shared sequence. ::

   cd ${HOME}/projects/eelpond
   mkdir partitions
   cd partitions
   do-partition.py -x 1e9 -N 4 --threads ${THREADS:-1} nema \
     ../trinity-nematostella-raw.fa.gz

.. ::

   echo 5-building-transcript-families rename `date` >> ${HOME}/times.out

This should take about 15 minutes, and outputs a file ending in '.part'
that contains the partition assignments.  Now, group and rename the
sequences::

   cd ${HOME}/projects/eelpond/partitions
   rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
   mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz \
     trinity-nematostella.renamed.fa.gz

Looking at the renamed sequences
================================

Let's look at the renamed sequences::

   cd ${HOME}/projects/eelpond/partitions
   gunzip -c trinity-nematostella.renamed.fa.gz | head

You'll see that each sequence name looks like this::

   >nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001

Some explanation:

* ``nema`` is the prefix that you gave the rename script, above; modify
  accordingly for your own organism.  It's best to change it each time
  you do an assembly, just to keep things straight.

* ``idN`` is the unique ID for this sequence; it will never be repeated in this
   file.

* ``trN`` is the transcript family, which may contain one or more transcripts.

* ``1_of_1_in_tr16001`` tells you that this transcript family has only
  one transcript in it (this one!) Other transcript families may
  (will) have more.

* ``len`` is the sequence length.

.. ::

   echo 5-building-transcript-families DONE `date` >> ${HOME}/times.out

.. shell stop

Next: :doc:`6-annotating-transcript-families`