3. Partitioning

Note

Make sure you’re running in screen!

Start with the QC’ed files from 2. Running digital normalization or copy them into a working directory.

Simple partitioning

Partitioning is a rather complex process – nowhere near as nice and simple as digital normalization. However, we do have a simple script to run the basic stuff; if this script is too slow, or doesn’t work well for big chunks of data, we might have remedies. But for the meantime, try the simple script:

/usr/local/share/khmer/scripts/do-partition.py -k 32 -x 1e9 --threads 4 kak *.kak.qc.fq.gz

This should take about 15 minutes, and will produce ‘.part’ files. These are now FASTA files that contain partition annotations. For example, check out:

head SRR492065.pe.kak.qc.fq.gz.part

Extracting the partitions into groups

Generally there are lots of partitions, and for convenience sake we group them into group files that can be assembled in small chunks. To do this,

/usr/local/share/khmer/scripts/extract-partitions.py -X 100000 kak *.part

This will leave you with a bunch of ‘kak.group*.fa’, as well as a ‘.dist’ file containing the distribution of partition sizes (how many sequences are in a given partition).

Here, the ‘-X’ sets the number of sequences stuck into a group file. By default the -X parameter is 1 million, which would put all of the sequences into a single file for this data set.

Occasionally (OK, rather frequently) you’ll find that almost all of your sequences coalesce into one partition, which we unaffectionately call the ‘lump’. There are many possible reasons for this, and we have a series of increasingly large hammers that can be used on the lump.

For now, simply observe that:

tail kak.dist

reports that about 2/3 of the sequences are in a single partition:

1674746 1 112164 2539252

Reinflating partitions (optional)

At this point it’s worth noting that the partitions are normalized, that is, diginormed. That makes it hard to use them for abundance calculations, and some assemblers prefer to have the original abundances in there.

So, ran you recover the abundances? Of course you can! However, you do have to combine all of the raw (unpartitioned) reads into a single file, because the script to reinflate the partitions takes only single file. Sorry :(.

gunzip -c /class/cbrown/data/SRR49206?.?e.qc.fq.gz > all.fq
python /usr/local/share/khmer/sandbox/sweep-reads3.py -x 1e8 kak.group*.fa all.fq

Next: 4. Assembling


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus



Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to 3. Partitioning on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.