4. Assembling

At last! All that filtering and diginorming is done, and we can get down to the serious business of assembling. Huzzah!


We’ve found that the MEGAHIT assembler (Li et al., 2015) is a good, fast, low-memory assembler for metagenomes (and transcriptomes), with no downsides for short read assembly. You might look at the SPAdes assembler if you want to combine long reads.

MEGAHIT is primarily distributed via GitHub, and you can find the latest release here. We’ll be using v1.0.2

cd ~/
curl -L https://github.com/voutcn/megahit/archive/v1.0.2.tar.gz > megahit.tar.gz
tar xzf megahit.tar.gz
cd megahit*
make -j 4
export PATH=$PATH:${PWD}

Install QUAST

We also want to use the QUAST tool to get statistics for the assemblies; let’s install that

cd ~/
curl -L http://sourceforge.net/projects/quast/files/quast-3.0.tar.gz/download > quast-3.0.tar.gz
tar xvf quast-3.0.tar.gz


To run MEGAHIT, we need to give it a list of paired-end files, together with the file full of orphans

cd /mnt/work
PE_FILES=$(ls -1 *.pe.qc.kak.fq.gz | tr '\n' ',')
megahit --12 ${PE_FILES%,} -r orphans.qc.kak.fq.gz

If everything works, you should see ALL DONE. with some other information at the end. If this command works:

ls megahit_out/done

then your assembly completed, and your final contigs are in megahit_out/final.contigs.fa.

Getting statistics for the assembly

To get some basic stats for the assemblies, run

~/quast-3.0/quast.py megahit_out/final.contigs.fa -o report

and then look at report/report.txt:

less report/report.txt

This will give you all of your basic assembly statistics, should you care :).

Next: 5. Mapping and abundance quantitation.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus