1. Quality Trimming and Filtering Your Sequences

Boot up an m1.xlarge machine from Amazon Web Services running Ubuntu 12.04 LTS (ami-59a4a230); this has about 15 GB of RAM, and 2 CPUs, and will be enough to complete the assembly of the example data set.

On the new machine, run the following commands to update the base software and install new necessary software

sudo apt-get update && sudo apt-get -y install screen git curl gcc make \
        g++ python-dev unzip default-jre pkg-config libncurses5-dev \
        r-base-core r-cran-gplots python-matplotlib sysstat \
        python-virtualenv fastqc trimmomatic

Note

Some of these commands may take a very long time. Please see Using ‘screen’.

You may also want to make sure that the default working directory on Amazon is writeable:

sudo chmod a+rwxt /mnt

Install software

Install khmer:

cd ~/
python2.7 -m virtualenv work
source work/bin/activate
pip install -U setuptools
git clone --branch v2.0-rc4 https://github.com/dib-lab/khmer.git
cd khmer
make install

The use of virtualenv allows us to install Python software without having root access. If you come back to this protocol in a different terminal session you will need to run:

source ~/work/bin/activate

Run FastQC on all your files

We can use FastQC to look at the quality of your sequences:

fastqc *.fastq.gz

Find the right Illumina adapters

You’ll need to know which Illumina sequencing adapters were used for your library in order to trim them off. Below, we will use the TruSeq3-PE.fa adapters

cd /mnt/work
wget https://sources.debian.net/data/main/t/trimmomatic/0.33+dfsg-1/adapters/TruSeq3-PE.fa

Note

You’ll need to make sure these are the right adapters for your data. If they are the right adapters, you should see that some of the reads are trimmed; if they’re not, you won’t see anything get trimmed.

Trim Your Data

(From this point on, you may want to be running things inside of screen, so that you can leave it running while you go do something else; see Using ‘screen’ for more information.)

Run

rm -f orphans.fq.gz

for filename in *_R1_*.fastq.gz
do
     # first, make the base by removing fastq.gz
     base=$(basename $filename .fastq.gz)
     echo $base

     # now, construct the R2 filename by replacing R1 with R2
     baseR2=${base/_R1_/_R2_}
     echo $baseR2

     # finally, run Trimmomatic
     TrimmomaticPE ${base}.fastq.gz ${baseR2}.fastq.gz \
        ${base}.qc.fq.gz s1_se \
        ${baseR2}.qc.fq.gz s2_se \
        ILLUMINACLIP:TruSeq3-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25

     # save the orphans
     gzip -9c s1_se s2_se >> orphans.fq.gz
     rm -f s1_se s2_se
done

Each file with an R1 in its name should have a matching file with an R2 – these are the paired ends.

The paired sequences output by this set of commands will be in the files ending in qc.fq.gz, with any orphaned sequences all together in orphans.fq.gz.


Interleave the sequences

Next, we need to take these R1 and R2 sequences and convert them into interleaved form, for the next step. To do this, we’ll use scripts from the khmer package, which we installed above.

Now let’s use a for loop again - you might notice this is only a minor modification of the previous for loop...

for filename in *_R1_*.qc.fq.gz
do
     # first, make the base by removing .extract.fastq.gz
     base=$(basename $filename .qc.fq.gz)
     echo $base

     # now, construct the R2 filename by replacing R1 with R2
     baseR2=${base/_R1_/_R2_}
     echo $baseR2

     # construct the output filename
     output=${base/_R1_/}.pe.qc.fq.gz

     (interleave-reads.py ${base}.qc.fq.gz ${baseR2}.qc.fq.gz | \
         gzip > $output) && rm ${base}.qc.fq.gz ${baseR2}.qc.fq.gz
done

The final product of this is now a set of files named *.pe.qc.fq.gz that are paired-end / interleaved and quality filtered sequences, together with the file orphans.fq.gz that contains orphaned sequences.

Finishing up

Make the end product files read-only:

chmod u-w *.pe.qc.fq.gz orphans.fq.gz

to make sure you don’t accidentally delete them.

If you linked your original data files into /mnt/work, you can now do

rm *.fastq.gz

to remove them from this location; you don’t need them any more.

Things to think about

Note that the filenames, while ugly, are conveniently structured with the history of what you’ve done to them. This is a good strategy to keep in mind.

Evaluate the quality of your files with FastQC again

We can once again use FastQC to look at the quality of your newly-trimmed sequences:

fastqc *.pe.qc.fq.gz

Next: 2. Running digital normalization


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus