4. Assembling

At last! All that filtering and diginorming is done, and we can get down to the serious business of assembling. Huzzah!

Let’s start with some collection of group files in /mnt/assembly.

Install some assemblers

Let’s try assembling the sequences with three different assemblers: Velvet, IDBA, and SPADes.

First, let’s install Velvet:

cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
cp velvet? /usr/local/bin

And also IDBA:

cd /root
curl -O https://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz
tar xzf idba-1.1.1.tar.gz
cd idba-1.1.1
./configure && make
cp bin/idba_ud /usr/local/bin

And also SPADes. First:

apt-get -y install cmake


wget http://spades.bioinf.spbau.ru/release2.5.1/SPAdes-2.5.1.tar.gz
tar -xzf SPAdes-2.5.1.tar.gz
cd SPAdes-2.5.1
PREFIX=/usr/local ./spades_compile.sh

Preparing the data

Start in the assembly work directory:

cd /mnt/assembly

Each assembler takes in data in a slightly different format. Velvet is the most flexible: it can take in paired-end and orphaned reads, in FASTA or FASTQ format, gzipped or not. However, SPADes requires paired-end FASTQ, while IDBA requires paired-end FASTA. So we need to do some conversions.

First, let’s gunzip all of the pe.fq files (for SPADes):

for i in *.pe.fq.gz
   gunzip -c $i > $(basename $i .gz)

and convert them all into FASTA (for IDBA):

for i in *.pe.fq
   name=$(basename $i .fq).fa
   python /usr/local/share/khmer/sandbox/fastq-to-fasta.py $i > $name

Setting up the assemblies

We’re going to set up a bunch of assemblies in a script called ‘do-assembly.sh’. This will let us execute them in parallel if we want to, and just keep track of things if we don’t want to.

Let’s make sure the file is empty first:

rm -f do-assembly.sh

And install GNU parallel:

cd /root curl -O http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 tar xjf parallel-latest.tar.bz2 cd parallel-* ./configure && make && make install cd /mnt/assembly

Setting up Velvet runs

I personally really like the Velvet assembler as a first cut assembler, since it yields pretty good results in a wide variety of situations. It’s also rather fast. The downside is that you have to specify a ‘k’ parameter, which gets annoying because it gives you different results and you have to run a lot of assemblies. Typically, I just run a bunch of ‘k’s, and evaluate the results.

So, let’s just do that:

for i in *.pe.fq.gz; do
   name=$(basename $i .pe.fq.gz);
   for k in {19..51..2}; do
      echo "velveth $name.velvet.$k.d $k -fastq.gz -shortPaired $pefile -short $sefile && \
           velvetg $name.velvet.$k.d -exp_cov auto -cov_cutoff auto"
done >> do-assembly.sh

Setting up IDBA runs

Set up a bunch of IDBA runs:

for i in *.pe.fa
   name=$(basename $i .pe.fa);
   echo "idba_ud --pre_correction -r $i -o $name.idba.d"
done >> do-assembly.sh

Setting up SPADes runs

And also set up a bunch of SPADes runs:

for i in *.pe.fq
   name=$(basename $i .pe.fq);
   echo "spades.py --sc --pe1-12 $name.pe.fq.gz -o $name.spades.d"
done >> do-assembly.sh

Running the assemblies

Now, run them all in parallel:

parallel -j 4 < do-assembly.sh

or, if you’re memory limited, one by one:

bash do-assembly.sh

Getting stats for the assemblies

To get some basic stats for the assemblies, run:

python /usr/local/share/khmer/sandbox/assemstats3.py 500 *.velvet.*.d/contigs.fa *idba.d/scaffold.fa *.spades.d/contigs.fasta

This will yield something like:

N       sum     max     filename
38      671957  83467   dn.21/contigs.fa
32      668918  83568   dn.23/contigs.fa
35      668509  83401   dn.25/contigs.fa
31      671843  83817   dn.27/contigs.fa
32      669104  83721   dn.29/contigs.fa
32      672735  84066   dn.31/contigs.fa
32      673102  83774   dn.33/contigs.fa
31      674629  83912   dn.35/contigs.fa
31      677446  84200   dn.37/contigs.fa
33      681099  84554   dn.39/contigs.fa
35      685245  84852   dn.41/contigs.fa
40      686733  85276   dn.43/contigs.fa
41      649574  62719   nodn.21/contigs.fa
39      639388  62155   nodn.23/contigs.fa
49      646132  62145   nodn.25/contigs.fa
39      647100  83798   nodn.27/contigs.fa
38      650487  83750   nodn.29/contigs.fa
33      649863  83770   nodn.31/contigs.fa
31      636979  83822   nodn.33/contigs.fa
35      645536  83856   nodn.35/contigs.fa
36      647848  83800   nodn.37/contigs.fa
33      654660  83934   nodn.39/contigs.fa
36      645126  83897   nodn.41/contigs.fa
34      660289  83231   idba.dn.d/scaffold.fa
45      666147  41120   idba.nodn.d/scaffold.fa

Generating a final set of contigs

Next, for each partition of input reads, let’s pick one assembly to represent the partition. (An alternative would be to use something like minimus2 to merge assemblies, but we think assembling assemblies is a bad idea.) To pick the best assembly, we’re going to choose the one that foo

for i in {0..1000};
     groupid=$(printf kak.group%04d $i);
     if [ -e ${groupid}.pe.fq.gz ]; then
        echo $groupid
done > grouplist.txt

for group in $(cat grouplist.txt)
   python /usr/local/share/khmer/sandbox/calc-best-assembly.py -q $group.{*velvet.*.d/contigs.fa,*idba.d/scaffold.fa,*spades.d/contigs.fasta} -o $group.best.fa
done > best-assemblies.txt

python /usr/local/share/khmer/sandbox/multi-rename.py testasm *.best.fa > final-assembly.fa

After this, ‘final-assembly.fa’ will contain the final set of contigs, with each contig renamed to ‘testasm.N’. To get stats:

python /usr/local/share/khmer/sandbox/assemstats3.py 500 final-assembly.fa

Note, ‘best-assemblies.txt’ contains the list of “best” assembly file names, if you want to know which ones were picked.

Next: 5. Mapping and abundance quantitation.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus