2. Running digital normalization


Make sure you’re running in screen!

Start with the QC’ed files from 1. Quality Trimming and Filtering Your Sequences or copy them into a working directory;

Run a First Round of Digital Normalization

Normalize everything to a coverage of 20, starting with the (more valuable) PE reads; keep pairs using ‘-p’

cd /mnt/work
normalize-by-median.py -p -k 20 -C 20 -N 4 -x 1e9 --savetable normC20k20.kh  *.pe.qc.fq.gz

and continuing into the (less valuable but maybe still useful) SE reads

normalize-by-median.py -C 20 --loadtable normC20k20.kh --savetable normC20k20.kh *.se.qc.fq.gz

This produces a set of ‘.keep’ files, as well as a normC20k20.kh database file.

Error-trim Our Data

Use ‘filter-abund’ to trim off any k-mers that are abundance-1 in high-coverage reads. The -V option is used to make this work better for variable coverage data sets:

filter-abund.py -V normC20k20.kh *.keep

This produces .abundfilt files containing the trimmed sequences.

The process of error trimming could have orphaned reads, so split the PE file into still-interleaved and non-interleaved reads

for i in *.pe.*.keep*
   extract-paired-reads.py $i

This leaves you with PE files (.pe.qc.fq.gz.keep.abundfilt.pe) and two sets of SE files (.se.qc.fq.gz.keep.abundfilt and .pe.qc.fq.gz.keep.abundfilt.se). (Yes, the naming scheme does make sense. Trust me.)

Normalize Down to C=5

Now that we’ve eliminated many more erroneous k-mers, let’s ditch some more high-coverage data. First, normalize the paired-end reads

normalize-by-median.py -C 5 -k 20 -N 4 -x 1e9 --savetable normC5k20.kh -p *.pe.qc.fq.gz.keep.abundfilt.pe

and then do the remaining single-ended reads

normalize-by-median.py -C 5 --savetable normC5k20.kh --loadtable normC5k20.kh *.pe.qc.fq.gz.keep.abundfilt.se *.se.qc.fq.gz.keep.abundfilt

Compress and Combine the Files

Now let’s tidy things up. Here are the paired files (kak = keep/abundfilt/keep)

for pe in *.pe.qc.fq.gz.keep.abundfilt.pe.keep
    cat $pe $se |gzip -c > $newfile

and for the single-ended files

for se in *.se.qc.fq.gz.keep.abundfilt.keep
    gzip -c  $se >$newfile

You can now remove all of these various files:


by typing

rm *.keep *.abundfilt *.pe *.se

If you are not doing partitioning (see 3. Partitioning), you may also want to remove the k-mer hash tables:

rm *.kh

If you are running partitioning, you can remove the normC20k20.kh file:

rm normC20k20.kh

but you will need the normC5k20.kh file.

Read Stats

Try running

/usr/local/share/khmer/sandbox/readstats.py *.kak.qc.fq.gz *.?e.qc.fq.gz

after a long wait, you’ll see:

861769600 bp / 8617696 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz
79586148 bp / 802158 seqs; 99.2 average length -- SRR606249.se.qc.fq.gz
531691400 bp / 5316914 seqs; 100.0 average length -- SRR606249.pe.qc.fq.gz
89903689 bp / 904157 seqs; 99.4 average length -- SRR606249.se.qc.fq.gz

173748898 bp / 1830478 seqs; 94.9 average length -- SRR606249.pe.kak.qc.fq.gz
8825611 bp / 92997 seqs; 94.9 average length -- SRR606249.se.kak.qc.fq.gz
52345833 bp / 550900 seqs; 95.0 average length -- SRR606249.pe.kak.qc.fq.gz
10280721 bp / 105478 seqs; 97.5 average length -- SRR606249.se.kak.qc.fq.gz


This shows you how many sequences were in the original QC files, and how many are left in the ‘kak’ files. Not bad – considerably more than 80% of the reads were eliminated in the kak!

Next: 3. Partitioning

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus