================================= 2. Applying Digital Normalization ================================= In this section, we'll apply `digital normalization `__ and `variable-coverage k-mer abundance trimming `__ to the reads prior to assembly. This has the effect of reducing the computational cost of assembly `without negatively affecting the quality of the assembly `__. .. shell start .. :: set -x set -e source /home/ubuntu/work/bin/activate .. note:: You'll need ~15 GB of RAM for this, or more if you have a LOT of data. Link in your data ----------------- Make sure your data is in ``/mnt/work``:: ls /mnt/work Run digital normalization ------------------------- .. :: echo 2-diginorm normalize1-pe `date` >> ${HOME}/times.out Apply digital normalization to the paired-end reads :: cd /mnt/work normalize-by-median.py -p -k 20 -C 20 -M 4e9 \ --savegraph normC20k20.ct -u orphans.fq.gz \ *.pe.qc.fq.gz Note the ``-p`` in the normalize-by-median command -- when run on PE data, that ensures that no paired ends are orphaned. The ``-u`` tells it that the following filename is unpaired. Also note the ``-M`` parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you're using. (See `choosing hash sizes for khmer `__ for more information.) Trim off likely erroneous k-mers -------------------------------- .. :: echo 2-diginorm filter-abund `date` >> ${HOME}/times.out Now, run through all the reads and trim off low-abundance parts of high-coverage reads :: filter-abund.py -V -Z 18 normC20k20.ct *.keep && \ rm *.keep normC20k20.ct This will turn some reads into orphans when their partner read is removed by the trimming. Rename files ~~~~~~~~~~~~ You'll have a bunch of ``keep.abundfilt`` files -- let's make things prettier. .. :: echo 2-diginorm extract `date` >> ${HOME}/times.out First, let's break out the orphaned and still-paired reads :: for file in *.pe.*.abundfilt do extract-paired-reads.py ${file} && \ rm ${file} done We can combine all of the orphaned reads into a single file :: gzip -9c orphans.fq.gz.keep.abundfilt > orphans.keep.abundfilt.fq.gz && \ rm orphans.fq.gz.keep.abundfilt for file in *.pe.*.abundfilt.se do gzip -9c ${file} >> orphans.keep.abundfilt.fq.gz && \ rm ${file} done We can also rename the remaining PE reads & compress those files :: for file in *.abundfilt.pe do newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq mv ${file} ${newfile} gzip ${newfile} done This leaves you with a bunch of files named ``*.keep.abundfilt.fq``, which represent the paired-end/interleaved reads that remain after both digital normalization and error trimming, together with ``orphans.keep.fq.gz`` Save all these files to a new volume, and get ready to assemble! .. :: echo 2-diginorm DONE `date` >> ${HOME}/times.out .. shell stop Next: :doc:`3-big-assembly`.