3. Partitioning

Note

Partitioning may not be needed or useful for data sets with low or medium richness. You can proceed to 4. Assembling and use the pe.qc.fq.gz and se.fq.fq.gz files from 2. Running digital normalization in place of XXX. @@

Note

Make sure you’re running in screen!

Start with the QC’ed files from 2. Running digital normalization or copy them into a working directory; you should be in /mnt/assembly below.

Simple partitioning

Partitioning is a rather complex process – nowhere near as nice and simple as digital normalization. However, we do have a simple script to run the basic stuff; if this script is too slow, or doesn’t work well for big chunks of data, we might have remedies, so please contact us. But in the meantime, here is a simple procedure.

First, eliminate highly repetitive k-mers that could join multiple species and rename the files appropriately:

python /usr/local/share/khmer/sandbox/filter-below-abund.py normC5k20.kh *.fq.gz

for i in *.below
do
   mv $i $i.fq
done

Note

You will need the normC5k20.kh file from 2. Running digital normalization for this step. If you don’t have it, you can regenerate it like so:

/usr/local/share/khmer/scripts/load-into-counting.py -k 20 -N 4 -x 5e8 normC5k20.kh *.qc.fq.gz

Next, run partitioning:

/usr/local/share/khmer/scripts/do-partition.py -k 32 -x 1e9 --threads 4 kak *.kak.qc.fq.gz.below.fq

This should take about 15 minutes, and will produce ‘.part’ files. These are now FASTQ files that contain partition annotations. For example, check out:

head SRR492065.pe.kak.qc.fq.gz.below.part

Extracting the partitions into groups

Generally there are lots of partitions, and for convenience sake we group them into group files that can be assembled in small chunks. To do this, run:

/usr/local/share/khmer/scripts/extract-partitions.py -X 100000 kak *.part

This will leave you with a bunch of kak.group*.fq, as well as a kak.dist file containing the distribution of partition sizes (how many sequences are in a given partition).

Here, the ‘-X’ sets the number of sequences stuck into a group file. By default the -X parameter is 1 million, which would put all of the sequences into a single file for this data set.

Separating groups into PE and SE

We still want to track paired and single-ended reads, so let’s go ahead and extract the PE reads as before:

for i in kak*.fq
do
   /usr/local/share/khmer/scripts/extract-paired-reads.py $i
   name=$(basename $i .fq)
   mv ${name}.fq.pe ${name}.pe.fq
   mv ${name}.fq.se ${name}.se.fq
done

And, finally, compress them:

gzip *.pe.fq *.se.fq

Reinflating partitions (optional)

At this point it’s worth noting that the partitions are normalized, that is, diginormed. That makes it hard to use them for abundance calculations, and some assemblers prefer to have the original abundances in there.

So, can you recover the abundances? Of course you can! However, you do have to combine all of the raw (unpartitioned) reads into a single file, because the script to reinflate the partitions takes only a single file. Sorry :(.

gunzip -c SRR49206?.?e.qc.fq.gz > all.fq

Now, run the command to pull all of the reads for each group from ‘all.fq’ into reinflated files:

python /usr/local/share/khmer/sandbox/sweep-reads3.py -x 3e8 kak.group*.fq all.fq

Finally, split the resulting .sweep3 files into PE/SE:

for i in kak*.sweep3
do
   /usr/local/share/khmer/scripts/extract-paired-reads.py $i
   name=$(basename $i .fq.sweep3)
   mv $i.pe ${name}.nodn.pe.fq
   mv $i.se ${name}.nodn.se.fq
done

and compress:

gzip *.nodn.se.fq *.nodn.pe.fq

Cleaning up

At this point, you have quite a few intermediate files, all of which can be removed:

rm *.part
rm kak.group*.fq
rm all.fq
rm *.sweep3
rm *.kh
rm *.below.fq

You’ll be left with 18 pairs of files named kak.group00XX.pe.fq.gz and kak.group00xx.se.fq.gz, and (if you reinflated the partitions), another 18 pairs of files named kak.group0000.nodn.pe.fq.gz and kak.group0000.nodn.se.fq.gz.


Next: 4. Assembling


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github.
comments powered by Disqus