===============
3. Partitioning
===============

.. note::

   Partitioning may not be needed or useful for data sets with low or
   medium richness.  You can proceed to :doc:`4-assemble` and use the
   pe.qc.fq.gz and se.fq.fq.gz files from :doc:`2-diginorm` in place of
   XXX. @@

.. note::

   Make sure you're running in screen!

Start with the QC'ed files from :doc:`2-diginorm` or copy them into a
working directory.

.. shell start

Simple Partitioning
-------------------

Partitioning is a rather complex process -- nowhere near as nice and
simple as digital normalization.  However, we do have a simple script
to run the basic stuff; if this script is too slow, or doesn't work
well for big chunks of data, we might have remedies, so please
`contact us <mailto:protocols@lists.idyll.org>`__.  But in the
meantime, here is a simple procedure.

First, eliminate highly repetitive k-mers that could join multiple species
and rename the files appropriately

::
 
   cd /mnt/work
   python /usr/local/share/khmer/sandbox/filter-below-abund.py normC5k20.kh *.kak.*.fq.gz 
   for i in *.below
   do
      mv $i $i.fq
   done

.. note::

   You will need the normC5k20.kh file from :doc:`2-diginorm` for this
   step.  If you don't have it, you can regenerate it like so::

      /usr/local/share/khmer/scripts/load-into-counting.py -k 20 -N 4 -x 5e8 normC5k20.kh *.qc.fq.gz

Next, run partitioning

::

   /usr/local/share/khmer/scripts/do-partition.py -k 32 -x 1e9 --threads 4 kak *.kak.qc.fq.gz.below.fq
   
This should take about 15 minutes, and will produce '.part' files.  These
are now FASTQ files that contain partition annotations.  For example, check
out::
   head *.pe.kak.qc.fq.gz.below.fq.part

Extracting the Partitions Into Groups
--------------------------------------

Generally there are *lots* of partitions, and for convenience sake we
group them into group files that can be assembled in small chunks.
To do this, run

::

   /usr/local/share/khmer/scripts/extract-partitions.py -X 100000 kak *.part

This will leave you with a bunch of ``kak.group*.fq``, as well as a
``kak.dist`` file containing the distribution of partition sizes (how
many sequences are in a given partition).

Here, the '-X' sets the number of sequences stuck into a group file.
By default the -X parameter is 1 million, which would put all of the
sequences into a single file for this data set.

Separating Groups Into PE and SE
--------------------------------

We still want to track paired and single-ended reads, so let's go ahead and
extract the PE reads as before

::

    for i in kak*.fq
    do
       /usr/local/share/khmer/scripts/extract-paired-reads.py $i
       name=$(basename $i .fq)
       mv ${name}.fq.pe ${name}.pe.fq
       mv ${name}.fq.se ${name}.se.fq                                           
    done

And, finally, compress them

::

   gzip *.pe.fq *.se.fq

Reinflating Partitions (Optional)
---------------------------------

At this point it's worth noting that the partitions are *normalized*,
that is, diginormed.  That makes it hard to use them for abundance
calculations, and some assemblers prefer to have the original
abundances in there.

So, can you recover the abundances?  Of course you can! 

::
   
    python /usr/local/share/khmer/sandbox/sweep-files.py -x 3e8 --db kak.group*.fq --query *.?e.qc.fq.gz

Finally, split the resulting ``.sweep`` files into PE/SE

::

   for i in *kak*.sweep
   do
       sweep=$(basename $i .fq)
       mv $i $sweep.fq 
       /usr/local/share/khmer/scripts/extract-paired-reads.py ${sweep}.fq 
       mv $sweep.fq.pe ${sweep}.nodn.pe.fq
       mv $sweep.fq.se ${sweep}.nodn.se.fq
   done

and compress

::

   gzip *.nodn.se.fq *.nodn.pe.fq

Cleaning Up
------------

At this point, you have quite a few intermediate files, all of
which can be removed
::

   rm *.part kak.group*.fq *.kh *.below.fq *.sweep.fq

You'll be left with 18 pairs of files named ``kak.group00XX.pe.fq.gz``
and ``kak.group00xx.se.fq.gz``, and (if you reinflated the
partitions), another 18 pairs of files named
``kak.group0000.nodn.pe.fq.gz`` and ``kak.group0000.nodn.se.fq.gz``.

----

Next: :doc:`4-assemble`