0. Downloading and Saving Your Initial Data

We’re going to run transcriptome assembly completely in the cloud, because that way (a) you don’t need to buy a big computer, and (b) I don’t have to figure out all the special details of your own computer system.

This does mean that the first thing you need to do is get your data over to the cloud. I tend to just store it there in the first place, because...

The basics

... Amazon is happy to rent disk space to you, in addition to compute time. They’ll rent you disk space in a few different ways, but the way that’s most useful for us is through what’s called Elastic Block Store. This is essentially a hard-disk rental service.

There are two basic concepts – “volume” and “snapshot”. A “volume” can be thought of as a pluggable-in hard drive: you create an empty volume of a given size, attach it to a running instance, and voila! You have extra hard disk space. Volume-based hard disks have two problems, however: first, they cannot be used outside of the “availability zone” they’ve been created in, which means that you need to be careful to put them in the same zone that your instance is running in; and they can’t be shared amongst people.

Snapshots, the second concept, are the solution to transporting and sharing the data on volumes. A “snapshot” is essentially a frozen copy of your volume; you can copy a volume into a snapshot, and a snapshot into a volume.

Getting started

Run through Getting started with Amazon EC2 once, to get the hang of the mechanics. Essentially you create a disk; attach it; format it; copy things to and from it.

Downloading and saving your data to a volume

There are many different ways of getting big sequence files to and from Amazon. The two that I mostly use are curl, which downloads files from a Web site URL; and ncftp, which is a robust FTP client that let’s you get files from an FTP site. Sequencing centers almost always make their data available in one of these two ways.


To use ncftp on your Amazon instance, you may need to install it:

apt-get -y install ncftp

For example, to retrieve a file from an FTP site, you would do something like:

cd /mnt
ncftp -u <username> ftp://path/to/FTP/site

use cd to find the right directory, and then:

>> mget *

to download the files. Then type ‘quit’.

You can also use curl to download files one at a time from Web or FTP sites. For example, to save a file from a website, you could use:

cd /mnt
curl -O http://path/to/file/on/website

Once you have the files, figure out their size using du -sh (e.g. after the above, du -sh /mnt will tell you how much data you have saved under /mnt), and go create and attach a volume (see Getting started with Amazon EC2).

Any files in the ‘/mnt’ directory will be lost when the instance is stopped or rebooted. However, files stored in the root, ‘/’, directory will remain available. Thus, it’s a good rule of thumb to do “savepoints” – whenever you complete a big chunk of work, think about saving the data at that point. I’ve broken the mRNAseq tutorial down into chunks of work whereyou can do this – after each Web page, basically. To sync a folder to attached volume simply type:

rsync -av folder_to_keep /path_to_volume

Some test data

To get started with multfile analysis and assembly, I’ve provided some test mRNAseq data from embryonic stages of Nematostella vectensis; the source is this excellent paper by Tulin et al., “A quantitative reference transcriptome for Nematostella vectensis”. The data is on snapshot ‘snap-f5a9dea7’, so go create a volume from that and mount it as ‘/data’ to get started; to mount it read-only, do:

mount -o ro /dev/xvdf /data

after attaching the volume as ‘sdf’.

Additional information

Throughout this protocol we will be using commandline interfaces. There is a short document explaining the notations used here. (see Commandline conventions)

Next: 1. Quality Trimming and Filtering Your Sequences

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus