================================
2. Running digital normalization
================================

.. note::

   Make sure you're running in screen!

Start with the QC'ed files from :doc:`1-quality` or copy them into a
working directory; you should start this in /mnt/assembly.

.. note::

   You can start from this point by taking the following files from
   the data on snapshot snap-8f89b092::

      SRR492065.pe.qc.fq.gz
      SRR492065.se.qc.fq.gz
      SRR492066.pe.qc.fq.gz
      SRR492066.se.qc.fq.gz

   Just do::

      mkdir /mnt/assembly
      cd /mnt/assembly
      cp /data/SRR49206?.?e.qc.fq.gz /mnt/assembly

Run a first round of digital normalization
==========================================

Normalize everything to a coverage of 20, starting with the (more valuable)
PE reads; keep pairs using '-p'::

   /usr/local/share/khmer/scripts/normalize-by-median.py -k 20 -C 20 -N 4 -x 5e8 -p --savehash normC20k20.kh *.pe.qc.fq.gz

...and continuing into the (less valuable but maybe still useful) SE reads::

   /usr/local/share/khmer/scripts/normalize-by-median.py -C 20 --savehash normC20k20.kh --loadhash normC20k20.kh *.se.qc.fq.gz

This produces a set of '.keep' files, as well as a normC20k20.kh
database file.

Error-trim your data
====================

Use 'filter-abund' to trim off any k-mers that are abundance-1 in
high-coverage reads.  The -V option is used to make this work better
for variable coverage data sets::

   /usr/local/share/khmer/scripts/filter-abund.py -V normC20k20.kh *.keep

This produces .abundfilt files containing the trimmed sequences.

The process of error trimming could have orphaned reads, so split the
PE file into still-interleaved and non-interleaved reads::

   for i in *.pe.qc.fq.gz.keep.abundfilt
   do
      /usr/local/share/khmer/scripts/extract-paired-reads.py $i
   done

This leaves you with PE files (.pe.qc.fq.gz.keep.abundfilt.pe) and
two sets of SE files (.se.qc.fq.gz.keep.abundfilt and
.pe.qc.fq.gz.keep.abundfilt.se).  (Yes, the naming scheme does make
sense.  Trust me.)

Normalize down to C=5
=====================

Now that we've eliminated many more erroneous k-mers, let's ditch some more
high-coverage data.  First, normalize the paired-end reads::

   /usr/local/share/khmer/scripts/normalize-by-median.py -C 5 -k 20 -N 4 -x 5e8 --savehash normC5k20.kh -p *.pe.qc.fq.gz.keep.abundfilt.pe

and then do the remaining single-ended reads::

   /usr/local/share/khmer/scripts/normalize-by-median.py -C 5 --savehash normC5k20.kh --loadhash normC5k20.kh *.pe.qc.fq.gz.keep.abundfilt.se *.se.qc.fq.gz.keep.abundfilt

Compress and combine the files
==============================

Now let's tidy things up.  Here are the paired files (kak =
keep/abundfilt/keep)::

   gzip -9c SRR492065.pe.qc.fq.gz.keep.abundfilt.pe.keep > SRR492065.pe.kak.qc.fq.gz
   gzip -9c SRR492066.pe.qc.fq.gz.keep.abundfilt.pe.keep > SRR492066.pe.kak.qc.fq.gz

and the single-ended files::

   gzip -9c SRR492066.pe.qc.fq.gz.keep.abundfilt.se.keep SRR492066.se.qc.fq.gz.keep.abundfilt.keep > SRR492066.se.kak.qc.fq.gz
   gzip -9c SRR492065.pe.qc.fq.gz.keep.abundfilt.se.keep SRR492065.se.qc.fq.gz.keep.abundfilt.keep > SRR492065.se.kak.qc.fq.gz

You can now remove all of these various files::

   SRR492066.pe.qc.fq.gz.keep
   SRR492066.pe.qc.fq.gz.keep.abundfilt
   SRR492066.pe.qc.fq.gz.keep.abundfilt.pe
   SRR492066.pe.qc.fq.gz.keep.abundfilt.pe.keep
   SRR492066.pe.qc.fq.gz.keep.abundfilt.se
   SRR492066.pe.qc.fq.gz.keep.abundfilt.se.keep

by typing::

   rm *.keep *.abundfilt *.pe *.se

If you are *not* doing partitioning (see :doc:`3-partition`), you may
also want to remove the k-mer hash tables::

   rm *.kh

If you *are* running partitioning, you can remove the ``normC20k20.kh`` file::

   rm normC20k20.kh

but you will need the ``normC5k20.kh`` file.

Read stats
==========

Try running::

   /usr/local/share/khmer/sandbox/readstats.py *.kak.qc.fq.gz *.?e.qc.fq.gz

after a long wait, you'll see ::

   ---------------
   861769600 bp / 8617696 seqs; 100.0 average length -- SRR492065.pe.qc.fq.gz
   79586148 bp / 802158 seqs; 99.2 average length -- SRR492065.se.qc.fq.gz
   531691400 bp / 5316914 seqs; 100.0 average length -- SRR492066.pe.qc.fq.gz
   89903689 bp / 904157 seqs; 99.4 average length -- SRR492066.se.qc.fq.gz

   173748898 bp / 1830478 seqs; 94.9 average length -- SRR492065.pe.kak.qc.fq.gz
   8825611 bp / 92997 seqs; 94.9 average length -- SRR492065.se.kak.qc.fq.gz
   52345833 bp / 550900 seqs; 95.0 average length -- SRR492066.pe.kak.qc.fq.gz
   10280721 bp / 105478 seqs; 97.5 average length -- SRR492066.se.kak.qc.fq.gz
   
   ---------------

This shows you how many sequences were in the original QC files, and
how many are left in the 'kak' files.  Not bad -- considerably more
than 80% of the reads were eliminated in the kak!

----

Next: :doc:`3-partition`