================================ 2. Running digital normalization ================================ .. |memuse| replace:: 4e9 .. shell start .. :: set -x set -e source /home/ubuntu/work/bin/activate .. note:: Make sure you're running in screen! Start with the QC'ed files from :doc:`1-quality` or copy them into a working directory; Run a First Round of Digital Normalization ------------------------------------------ Normalize everything to a coverage of 20, starting with the (more valuable) PE reads; keep pairs using ``-p``, and include orphans with ``-u``. :: cd /mnt/work normalize-by-median.py -p -k 20 -C 20 -M 4e9 \ --savegraph normC20k20.ct -u orphans.fq.gz *.pe.qc.fq.gz This produces a set of '.keep' files, as well as a normC20k20.ct file containing k-mer counts that we will use in the next step. Note the ``-x`` and ``-N`` parameters. These specify how much memory diginorm should use. The product of these should be less than the memory size of the machine you selected. (See `choosing hash sizes for khmer `__ for more information.) Error-trim Our Data -------------------- Use 'filter-abund' to trim off any k-mers that are abundance-1 in high-coverage reads. The -V option is used to ignore low coverage reads that are prevalent in variable abundance data sets: :: filter-abund.py -V -Z 18 normC20k20.ct *.keep && \ rm *.keep normC20k20.ct This produces .abundfilt files containing the trimmed sequences. The process of error trimming could have orphaned reads, so split the PE file into still-interleaved and non-interleaved reads :: for file in *.pe.*.abundfilt do extract-paired-reads.py ${file} && \ rm ${file} done This leaves you with PE files (*.pe) and SE files (*.se). Next, concatenate all of the *.se files into orphan files :: gzip -9c orphans.fq.gz.keep.abundfilt *.se > orphans.keep.abundfilt.fq.gz && \ rm orphans.fq.gz.keep.abundfilt *.se Normalize Down to C=5 --------------------- Now that we've eliminated many more erroneous k-mers, let's ditch some more high-coverage data. First, normalize the paired-end reads :: normalize-by-median.py -C 5 -k 20 -M 4e8 \ --savegraph normC5k20.ct -p *.abundfilt.pe \ -u orphans.keep.abundfilt.fq.gz && \ rm *.abundfilt.pe orphans.keep.abundfilt.fq.gz Compress and Combine the Files ------------------------------ Now let's tidy things up. Here are the paired files (kak = keep/abundfilt/keep) :: for file in *.keep.abundfilt.pe.keep do newfile=${file/fq.gz.keep.abundfilt.pe.keep/kak.fq} mv ${file} ${newfile} gzip -9 ${newfile} done and here are the orphaned reads :: mv orphans.keep.abundfilt.fq.gz.keep orphans.qc.kak.fq && \ gzip orphans.qc.kak.fq ----- If you are *not* doing partitioning (see :doc:`3-partition`), you may want to remove the k-mer hash tables:: rm *.ct Read Stats ---------- Try running :: readstats.py *.kak.fq.gz after a long wait, you'll see:: --------------- 861769600 bp / 8617696 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz 79586148 bp / 802158 seqs; 99.2 average length -- SRR606249.se.qc.fq.gz 531691400 bp / 5316914 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz 89903689 bp / 904157 seqs; 99.4 average length -- SRR606249.se.qc.fq.gz 173748898 bp / 1830478 seqs; 94.9 average length -- SRR606249.pe.kak.qc.fq.gz 8825611 bp / 92997 seqs; 94.9 average length -- SRR606249.se.kak.qc.fq.gz 52345833 bp / 550900 seqs; 95.0 average length -- SRR606249.pe.kak.qc.fq.gz 10280721 bp / 105478 seqs; 97.5 average length -- SRR606249.se.kak.qc.fq.gz --------------- This shows you how many sequences were in the original QC files, and how many are left in the 'kak' files. Not bad -- considerably more than 80% of the reads were eliminated in the kak! ---- Next: :doc:`3-partition`