================================ 2. Running digital normalization ================================ .. shell start .. note:: Make sure you're running in screen! Start with the QC'ed files from :doc:`1-quality` or copy them into a working directory; Run a First Round of Digital Normalization ------------------------------------------ Normalize everything to a coverage of 20, starting with the (more valuable) PE reads; keep pairs using '-p' :: cd /mnt/work normalize-by-median.py -p -k 20 -C 20 -N 4 -x 1e9 --savetable normC20k20.kh *.pe.qc.fq.gz and continuing into the (less valuable but maybe still useful) SE reads :: normalize-by-median.py -C 20 --loadtable normC20k20.kh --savetable normC20k20.kh *.se.qc.fq.gz This produces a set of '.keep' files, as well as a normC20k20.kh database file. Error-trim Our Data -------------------- Use 'filter-abund' to trim off any k-mers that are abundance-1 in high-coverage reads. The -V option is used to make this work better for variable coverage data sets: :: filter-abund.py -V normC20k20.kh *.keep This produces .abundfilt files containing the trimmed sequences. The process of error trimming could have orphaned reads, so split the PE file into still-interleaved and non-interleaved reads :: for i in *.pe.qc.fq.gz.keep.abundfilt do extract-paired-reads.py $i done This leaves you with PE files (.pe.qc.fq.gz.keep.abundfilt.pe) and two sets of SE files (.se.qc.fq.gz.keep.abundfilt and .pe.qc.fq.gz.keep.abundfilt.se). (Yes, the naming scheme does make sense. Trust me.) You can now remove the ``normC20k20.kh`` file, too :: rm normC20k20.kh Normalize Down to C=5 --------------------- Now that we've eliminated many more erroneous k-mers, let's ditch some more high-coverage data. First, normalize the paired-end reads :: normalize-by-median.py -C 5 -k 20 -N 4 -x 1e9 --savetable normC5k20.kh -p *.pe.qc.fq.gz.keep.abundfilt.pe and then do the remaining single-ended reads :: normalize-by-median.py -C 5 --savetable normC5k20.kh --loadtable normC5k20.kh *.pe.qc.fq.gz.keep.abundfilt.se *.se.qc.fq.gz.keep.abundfilt Compress and Combine the Files ------------------------------ Now let's tidy things up. Here are the paired files (kak = keep/abundfilt/keep) :: for i in *.abundfilt.pe do newfile=$(basename $i .pe.qc.fq.gz.keep.abundfilt.pe) echo newfile is $newfile gzip -c $i > $newfile.pe.kak.qc.fq.gz done and for the single-ended files :: for i in *.se.qc.fq.gz.keep.abundfilt do pe_orphans=$(basename $i .se.qc.fq.gz.keep.abundfilt).pe.qc.fq.gz.keep.abundfilt.se newfile=$(basename $i .se.qc.fq.gz.keep.abundfilt).se.kak.qc.fq.gz cat $i $pe_orphans | gzip -c > $newfile done You can now remove all of these various files:: *.pe.qc.fq.gz.keep *.pe.qc.fq.gz.keep.abundfilt *.pe.qc.fq.gz.keep.abundfilt.pe *.pe.qc.fq.gz.keep.abundfilt.pe.keep *.pe.qc.fq.gz.keep.abundfilt.se *.pe.qc.fq.gz.keep.abundfilt.se.keep by typing :: rm *.keep *.abundfilt *.pe *.se If you are *not* doing partitioning (see :doc:`3-partition`), you may also want to remove the k-mer hash tables:: rm *.kh If you *are* running partitioning, you can remove the ``normC20k20.kh`` file:: rm normC20k20.kh but you will need the ``normC5k20.kh`` file. Read Stats ---------- Try running :: /usr/local/share/khmer/sandbox/readstats.py *.kak.qc.fq.gz *.?e.qc.fq.gz after a long wait, you'll see:: --------------- 861769600 bp / 8617696 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz 79586148 bp / 802158 seqs; 99.2 average length -- SRR606249.se.qc.fq.gz 531691400 bp / 5316914 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz 89903689 bp / 904157 seqs; 99.4 average length -- SRR606249.se.qc.fq.gz 173748898 bp / 1830478 seqs; 94.9 average length -- SRR606249.pe.kak.qc.fq.gz 8825611 bp / 92997 seqs; 94.9 average length -- SRR606249.se.kak.qc.fq.gz 52345833 bp / 550900 seqs; 95.0 average length -- SRR606249.pe.kak.qc.fq.gz 10280721 bp / 105478 seqs; 97.5 average length -- SRR606249.se.kak.qc.fq.gz --------------- This shows you how many sequences were in the original QC files, and how many are left in the 'kak' files. Not bad -- considerably more than 80% of the reads were eliminated in the kak! ---- Next: :doc:`3-partition`