1. Quality Trimming and Filtering Your Sequences

Boot up an m1.xlarge machine from Amazon Web Services running Ubuntu 12.04 LTS (ami-59a4a230); this has about 15 GB of RAM, and 2 CPUs, and will be enough to complete the assembly of the example data set.

On the new machine, run the following commands to update the base software and reboot the machine:

apt-get update
apt-get -y install screen git curl gcc make g++ python-dev unzip default-jre \
          pkg-config libncurses5-dev r-base-core r-cran-gplots python-matplotlib\
          sysstat && shutdown -r now


Some of these commands may take a very long time. Please see Using ‘screen’.

Install software

Install khmer:

cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git
cd khmer
git checkout v1.1
make install

Install Trimmomatic:

cd /root
curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.30.zip
unzip Trimmomatic-0.30.zip
cd Trimmomatic-0.30/
cp trimmomatic-0.30.jar /usr/local/bin
cp -r adapters /usr/local/share/adapters

Install libgtextutils and fastx:

cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2
tar xjf libgtextutils-0.6.1.tar.bz2
cd libgtextutils-0.6.1/
./configure && make && make install
cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-
tar xjf fastx_toolkit-
cd fastx_toolkit-
./configure && make && make install

In each of these cases, we’re downloading the software – you can use google to figure out what each package is and does if we don’t discuss it below. We’re then unpacking it, sometimes compiling it (which we can discuss later), and then installing it for general use.

Trim Your Data

cd /mnt/work
python /usr/local/share/khmer/sandbox/write-trimmomatic.py > trim.sh
more trim.sh

If it looks like it contains the right commands, you can run it by doing

bash trim.sh


This is a prime example of scripting to make your life much easier and less error prone. Take a look at this file sometime – ‘more /usr/local/share/khmer/sandbox/write-trimmomatic.py’ – to get some idea of how this works.

Quality Trim Each Pair of Files

After you run this, you should have a bunch of ‘.pe.fq.gz’ files and a bunch of ‘.se.fq.gz’ files. The former are files that contain paired, interleaved sequences; the latter contain single-ended, non-interleaved sequences.

Next, for each of these files, run:

gunzip -c <filename> | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c > <filename>.qc.fq.gz

This uncompresses each file, removes poor-quality sequences, and then recompresses it. Note that (following Short-read quality evaluation) you can also trim to a specific length by putting in a ‘fastx_trimmer -Q33 -l 70 |‘ into the mix.

If fastq_quality_filter complains about invalid quality scores, try removing the -Q33 in the command; Illumina has blessed us with multiple quality score encodings.

Automating This Step

This step can be automated with a ‘for’ loop at the shell prompt. Try:

for i in *.pe.fq.gz *.se.fq.gz
    echo working with $i
    newfile="$(basename $i .fq.gz)"
    gunzip -c $i | fastq_quality_filter -Q33 -q 30 -p 50 | gzip -9c > "${newfile}.qc.fq.gz"

What this loop does is:

  • for every file ending in pe.fq.gz and se.fq.gz,
  • print out a message with the filename,
  • construct a name ‘newfile’ that omits the trailing .fq.gz
  • uncompresses the original file, passes it through fastq, recompresses it, and saves it as ‘newfile’.qc.fq.gz

Extracting Paired Ends From The Interleaved Files

The fastx utilities that we’re using to do quality trimming aren’t paired-end aware; they’re removing individual sequences. Because the pe files are interleaved, this means that there may now be some orphaned sequences in there. Downstream, we will want to pay special attention to the remaining paired sequences, so we want to separate out the pe and se files. How do we go about that? Another script, of course!

The khmer script ‘extract-paired-reads.py’ does exactly that. You run it on an interleaved file that may have some orphans, and it produces .pe and .se files afterwards, containing pairs and orphans respectively.

To run it on all of the pe qc files, do:

for i in *.pe.qc.fq.gz
    extract-paired-reads.py $i

Renaming Files

I’m a fan of keeping the files named somewhat sensibly, and keeping them compressed. Let’s do some mass renaming:

for i in *.pe.qc.fq.gz.pe
    echo working on PE file $i
    newfile="$(basename $i .pe.qc.fq.gz.pe).pe.qc.fq"
    rm $(basename $i .pe)
    mv $i $newfile
    gzip $newfile

and also some mass combining:

for i in *.pe.qc.fq.gz.se
    echo working on SE file $i
    otherfile="$(basename $i .pe.qc.fq.gz.se).se.qc.fq.gz"
    gunzip -c $otherfile > combine
    cat $i >> combine
    rm -f $otherfile
    gzip -c combine > $otherfile
    rm $i combine

then make it hard to delete the files you just created

chmod u-w *.qc.fq.gz

Done! Now you have two files: SRR606249-extract.pe.qc.fq.gz, SRR606249-extract.se.qc.fq.gz.

The ‘.pe’ file are interleaved paired-end; you can take a look at them like so

The ‘.se’ files is a single-ended file, where the reads have been orphaned because we discarded stuff.

All TWO files are in FASTQ format.

Next: 2. Running digital normalization

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus