7. Expression analysis (with RSEM)

In addition to screed, khmer, and eel-pond, you’ll also need to install bowtie (see 3. Running the Actual Assembly).


You can grab the partitioned and renamed data for nematostella here:

cd /mnt
curl -O http://athyra.idyll.org/~t/trinity-nematostella.renamed.fa.gz
gunzip -c trinity-nematostella.renamed.fa.gz > nematostella.fa

Installing rsem

We’ll be using the RSEM package to do some expression analysis, and EBSeq to do differential expression. To install these packages, do:

cd /root
curl -O http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.8.tar.gz
tar xzf rsem-1.2.8.tar.gz
cd rsem-1.2.8
cd EBSeq

And now add this directory into your PATH, which is where Unix looks for things to run:

echo 'export PATH=$PATH:/root/rsem-1.2.8' >> ~/.bashrc
export PATH=$PATH:/root/rsem-1.2.8

Installing bowtie

If you didn’t install bowtie on this machine already (e.g. as part of 3. Running the Actual Assembly), RSEM needs it; do:

cd /root
curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip
unzip bowtie-0.12.7-linux-x86_64.zip
cd bowtie-0.12.7
cp bowtie bowtie-build bowtie-inspect /usr/local/bin

Prepare the reference

Go to a working directory on /mnt:

cd /mnt
mkdir rsem
cd rsem

Link in the nematostella file:

ln -fs ../nematostella.fa .

Make a transcript-to-gene-map file:

python /usr/local/share/eel-pond/make-transcript-to-gene-map-file.py nematostella.fa nematostella.fa.tr_to_genes

and ask RSEM to prepare the reference against which to map the reads:

rsem-prepare-reference --transcript-to-gene-map nematostella.fa.tr_to_genes nematostella.fa nema

(Here, the ‘nema’ at the end is what to call the reference; the other two are just file names.)

This last step will take about half an hour or more.

Find and list the reads

Find the QC reads, and link them in; e.g. if using the Nematostella reads, make a volume from snap-126cc847, mount it as /data, and do:

ln -fs /data/*.pe.qc.fq.gz .

Now, make a list of the data files:

ls -1 *.pe.qc.fq.gz > list.txt

Note, the order of the files in this list is going to determine the order in the final RSEM output matrix. You might consider rearranging it so that your controls are first, etc.


Now, for each one of the files in ‘list.txt’, run RSEM. This will take a long time for lots of data, so definitely run this step in screen! :

for filename in $(cat list.txt)
    echo mapping $filename
    gunzip -c $filename > ${n}.fq
    /usr/local/share/khmer/scripts/split-paired-reads.py ${n}.fq
    rsem-calculate-expression --paired-end ${n}.fq.1 ${n}.fq.2 nema -p 4 ${n}.fq
    rm ${n}.fq ${n}.fq.[12] ${n}.fq.transcript.bam ${n}.fq.transcript.sorted.bam
    n=$(($n + 1))

Gather results:

rsem-generate-data-matrix [0-9].fq.genes.results 10.fq.genes.results > 0-vs-6-hour.matrix

...and voila, you now have a file, ‘0-vs-6-hour.matrix’, which is a tab-separated file (that Excel can load) containing a matrix of gene expression levels in FPKM (rows) vs condition (columns). The ‘1’ condition will be the first file in list.txt, the ‘2’ condition will be the second file, etc. If you want the conditions in a specific order, you can specify the files in the order you want – e.g. :

rsem-generate-data-matrix 1.fq.genes.results 3.fq.genes.results > results.matrix


Our current protocol only supports pairwise differential expression analysis, i.e. comparing two conditions, which is why we only create the 0-vs-6 hour matrix, above.

Next: 8. Differential expression (with EBSeq)

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus