2. Applying Digital Normalization

Note

You can start this tutorial with the contents of EC2/EBS snapshot snap-126cc847.

Note

You’ll need ~15 GB of RAM for this, or more if you have a LOT of data.

Run digital normalization

Apply digital normalization to the paired-end reads:

/usr/local/share/khmer/scripts/normalize-by-median.py -p -k 20 -C 20 -N 4 -x 3e9 --savehash normC20k20.kh *.pe.qc.fq.gz

and then to the single-end reads:

/usr/local/share/khmer/scripts/normalize-by-median.py -C 20 --loadhash normC20k20.kh --savehash normC20k20.kh *.se.qc.fq.gz

Note the ‘-p’ in the first normalize-by-median command – when run on PE data, that ensures that no paired ends are orphaned. However, it will complain on single-ended data, so you have to give the data to it separately.

Also note the ‘-N’ and ‘-x’ parameters. These specify how much memory diginorm should use. The product of these should be less than the memory size of the machine you selected. The maximum needed for any transcriptome should be in the ~60 GB range, e.g. -N 4 -x 15e9; for only a few hundred million reads, 16 GB should be plenty. (See choosing hash sizes for khmer for more information.)

Trim off likely erroneous k-mers

Now, run through all the reads and trim off low-abundance parts of high-coverage reads:

/usr/local/share/khmer/scripts/filter-abund.py -V normC20k20.kh *.keep

This will turn some reads into orphans, but that’s ok – their partner read was bad.

Rename files

You’ll have a bunch of ‘keep.abundfilt’ files – let’s make things prettier.

First, let’s break out the orphaned and still-paired reads:

for i in *.pe.*.abundfilt;
do
   /usr/local/share/khmer/scripts/extract-paired-reads.py $i
done

We can combine the orphaned reads into a single file:

for i in *.se.qc.fq.gz.keep.abundfilt
do
   pe_orphans=$(basename $i .se.qc.fq.gz.keep.abundfilt).pe.qc.fq.gz.keep.abundfilt.se
   newfile=$(basename $i .se.qc.fq.gz.keep.abundfilt).se.qc.keep.abundfilt.fq.gz
   cat $i $pe_orphans | gzip -c > $newfile
done

We can also rename the remaining PE reads & compress those files:

for i in *.abundfilt.pe
do
   newfile=$(basename $i .fq.gz.keep.abundfilt.pe).keep.abundfilt.fq
   mv $i $newfile
   gzip $newfile
done

This leaves you with a whole passel o’ files, most of which you want to go away!

6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz
6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep
6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt
6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt.se
6Hour_CGATGT_L002_R1_005.se.qc.fq.gz
6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep
6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep.abundfilt

So, finally, let’s get rid of a lot of the old files

rm *.se.qc.fq.gz.keep.abundfilt
rm *.pe.qc.fq.gz.keep.abundfilt.se
rm *.keep
rm *.abundfilt
rm *.qc.fq.gz

Gut check

You should now have:

6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz
6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz

These files are, respectively, the paired (pe) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences, and the orphaned (se) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences.

Save all these files to a new volume, and get ready to assemble!

Next: 3. Running the Actual Assembly.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus



Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to 2. Applying Digital Normalization on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.