Boot up an m1.xlarge machine from Amazon Web Services; this has about 15 GB of RAM, and 2 CPUs, and will be enough to complete the assembly of the example data set.
Note
This follows the NGS 2013 tutorial, Short-read quality evaluation, but for multiple files.
Note
The end results of this tutorial are available as public snapshot XXX on EC2/EBS.
Also see: Using ‘screen’.
Install screed:
pip install git+https://github.com/ged-lab/screed.git
Install the bleeding-edge version of khmer:
cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git -b bleeding-edge
cd khmer
make
echo 'export PYTHONPATH=/usr/local/share/khmer/python' >> ~/.bashrc
source ~/.bashrc
Install Trimmomatic:
cd /root
curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.27.zip
unzip Trimmomatic-0.27.zip
cp Trimmomatic-0.27/trimmomatic-0.27.jar /usr/local/bin
Install libgtextutils and fastx:
cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2
tar xjf libgtextutils-0.6.1.tar.bz2
cd libgtextutils-0.6.1/
./configure && make && make install
cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
tar xjf fastx_toolkit-0.0.13.2.tar.bz2
cd fastx_toolkit-0.0.13.2/
./configure && make && make install
In each of these cases, we’re downloading the software – you can use google to figure out what each package is and does if we don’t discuss it below. We’re then unpacking it, sometimes compiling it (which we can discuss later), and then installing it for general use.
Grab some Illumina adapters:
curl -O https://s3.amazonaws.com/public.ged.msu.edu/illuminaClipping.fa
Trim the first data set (~20 minutes):
mkdir trim
cd trim
java -jar /usr/local/bin/trimmomatic-0.27.jar PE ../SRR492065_?.fastq.gz s1_pe s1_se s2_pe s2_se ILLUMINACLIP:../illuminaClipping.fa:2:30:10
/usr/local/share/khmer/scripts/interleave-reads.py s?_pe > combined.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i combined.fq > combined-trim.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i s1_se > s1_se.trim
/usr/local/share/khmer/scripts/extract-paired-reads.py combined-trim.fq
gzip -9c combined-trim.fq.pe > ../SRR492065.pe.qc.fq.gz
gzip -9c combined-trim.fq.se s1_se > ../SRR492065.se.qc.fq.gz
cd ../
rm -fr trim
Trim the second data set (~20 minutes):
mkdir trim
cd trim
java -jar /usr/local/bin/trimmomatic-0.27.jar PE ../SRR492066_?.fastq.gz s1_pe s1_se s2_pe s2_se ILLUMINACLIP:../illuminaClipping.fa:2:30:10
/usr/local/share/khmer/scripts/interleave-reads.py s?_pe > combined.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i combined.fq > combined-trim.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i s1_se > s1_se.trim
/usr/local/share/khmer/scripts/extract-paired-reads.py combined-trim.fq
gzip -9c combined-trim.fq.pe > ../SRR492066.pe.qc.fq.gz
gzip -9c combined-trim.fq.se s1_se > ../SRR492066.se.qc.fq.gz
cd ../
rm -fr trim
Done! Now you have four files: SRR492065.pe.qc.fq.gz, SRR492065.se.qc.fq.gz, SRR492066.pe.qc.fq.gz, and SRR492066.se.qc.fq.gz.
The ‘.pe’ files are interleaved paired-end; you can take a look at them like so:
gunzip -c SRR492065.pe.qc.fq.gz | head
The other two are single-ended files, where the reads have been orphaned because we discarded stuff.
All four files are in FASTQ format.
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.