2. Applying Digital Normalization¶
In this section, we’ll apply digital normalization and variable-coverage k-mer abundance trimming to the reads prior to assembly. This has the effect of reducing the computational cost of assembly without negatively affecting the quality of the assembly.
Note
You’ll need ~15 GB of RAM for this, or more if you have a LOT of data.
Run digital normalization¶
Apply digital normalization to the paired-end reads
cd /mnt/work
normalize-by-median.py -p -k 20 -C 20 -M 4e9 \
--savegraph normC20k20.ct -u orphans.fq.gz \
*.pe.qc.fq.gz
Note the -p
in the normalize-by-median command – when run on
PE data, that ensures that no paired ends are orphaned. The -u
tells
it that the following filename is unpaired.
Also note the -M
parameter. This specifies how much memory diginorm
should use, and should be less than the total memory on the computer
you’re using. (See choosing hash
sizes for khmer
for more information.)
Trim off likely erroneous k-mers¶
Now, run through all the reads and trim off low-abundance parts of high-coverage reads
filter-abund.py -V -Z 18 normC20k20.ct *.keep && \
rm *.keep normC20k20.ct
This will turn some reads into orphans when their partner read is removed by the trimming.
Rename files¶
You’ll have a bunch of keep.abundfilt
files – let’s make things prettier.
First, let’s break out the orphaned and still-paired reads
for file in *.pe.*.abundfilt
do
extract-paired-reads.py ${file} && \
rm ${file}
done
We can combine all of the orphaned reads into a single file
gzip -9c orphans.fq.gz.keep.abundfilt > orphans.keep.abundfilt.fq.gz && \
rm orphans.fq.gz.keep.abundfilt
for file in *.pe.*.abundfilt.se
do
gzip -9c ${file} >> orphans.keep.abundfilt.fq.gz && \
rm ${file}
done
We can also rename the remaining PE reads & compress those files
for file in *.abundfilt.pe
do
newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq
mv ${file} ${newfile}
gzip ${newfile}
done
This leaves you with a bunch of files named *.keep.abundfilt.fq
,
which represent the paired-end/interleaved reads that remain after
both digital normalization and error trimming, together with
orphans.keep.fq.gz
Save all these files to a new volume, and get ready to assemble!
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus