2. Applying Digital Normalization

In this section, we’ll apply digital normalization and variable-coverage k-mer abundance trimming to the reads prior to assembly. This has the effect of reducing the computational cost of assembly without negatively affecting the quality of the assembly.

Note

You’ll need ~15 GB of RAM for this, or more if you have a LOT of data.

Run digital normalization

Apply digital normalization to the paired-end reads

cd /mnt/work
normalize-by-median.py -p -k 20 -C 20 -M 4e9 \
  --savegraph normC20k20.ct -u orphans.fq.gz \
  *.pe.qc.fq.gz

Note the -p in the normalize-by-median command – when run on PE data, that ensures that no paired ends are orphaned. The -u tells it that the following filename is unpaired.

Also note the -M parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you’re using. (See choosing hash sizes for khmer for more information.)

Trim off likely erroneous k-mers

Now, run through all the reads and trim off low-abundance parts of high-coverage reads

filter-abund.py -V -Z 18 normC20k20.ct *.keep && \
   rm *.keep normC20k20.ct

This will turn some reads into orphans when their partner read is removed by the trimming.

Rename files

You’ll have a bunch of keep.abundfilt files – let’s make things prettier.

First, let’s break out the orphaned and still-paired reads

for file in *.pe.*.abundfilt
do
   extract-paired-reads.py ${file} && \
         rm ${file}
done

We can combine all of the orphaned reads into a single file

gzip -9c orphans.fq.gz.keep.abundfilt > orphans.keep.abundfilt.fq.gz && \
    rm orphans.fq.gz.keep.abundfilt
for file in *.pe.*.abundfilt.se
do
   gzip -9c ${file} >> orphans.keep.abundfilt.fq.gz && \
        rm ${file}
done

We can also rename the remaining PE reads & compress those files

for file in *.abundfilt.pe
do
   newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq
   mv ${file} ${newfile}
   gzip ${newfile}
done

This leaves you with a bunch of files named *.keep.abundfilt.fq, which represent the paired-end/interleaved reads that remain after both digital normalization and error trimming, together with orphans.keep.fq.gz

Save all these files to a new volume, and get ready to assemble!

Next: 3. Running the Actual Assembly.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus