================================= 2. Applying Digital Normalization ================================= .. shell start .. :: set -x set -e .. note:: You can start this tutorial with the contents of EC2/EBS snapshot snap-126cc847. .. note:: You'll need ~15 GB of RAM for this, or more if you have a LOT of data. Link in your data ----------------- Make sure your data is in ``${HOME}/projects/eelpond/filtered-pairs``:: ls ${HOME}/projects/eelpond/filtered-pairs If you've loaded it onto ``${HOME}/data/``, you can do:: mkdir -p ${HOME}/projects/eelpond/filtered-pairs ln -fs ${HOME}/data/*.qc.fq.gz ${HOME}/projects/eelpond/filtered-pairs/ Run digital normalization ------------------------- .. :: echo 2-diginorm normalize1-pe `date` >> ${HOME}/times.out Apply digital normalization to the paired-end reads :: cd ${HOME}/projects/eelpond/ mkdir diginorm cd diginorm normalize-by-median.py --paired -ksize 20 --cutoff 20 -n_tables 4 \ --min-tablesize 3e8 --savetable normC20k20.ct \ ../filtered-pairs/*.pe.qc.fq.gz .. :: echo 2-diginorm normalize1-se `date` >> ${HOME}/times.out and then to the single-end reads:: normalize-by-median.py --cutoff 20 --loadtable normC20k20.ct \ --savetable normC20k20.ct ../filtered-pairs/*.se.qc.fq.gz Note the ``--paired`` in the first normalize-by-median command -- when run on PE data, that ensures that no paired ends are orphaned. However, it will complain on single-ended data, so you have to give the data to it separately. Also note the ``--n_tables`` and ``--min_tablesize`` parameters. These specify how much memory diginorm should use. The product of these should be less than the memory size of the machine you selected. The maximum needed for *any* transcriptome should be in the ~60 GB range, e.g. ``-- n_tables 4 --min_tablesize 15e9``; for only a few hundred million reads, 16 GB should be plenty. (See `choosing hash sizes for khmer` `__ for more information.) Trim off likely erroneous k-mers -------------------------------- .. :: echo 2-diginorm filter-abund `date` >> ${HOME}/times.out Now, run through all the reads and trim off low-abundance parts of high-coverage reads:: cd ${HOME}/projects/eelpond mkdir abundfilt cd abundfilt filter-abund.py --variable-coverage ../diginorm/normC20k20.ct \ --threads ${THREADS:-1} ../diginorm/*.keep This will turn some reads into orphans, but that's ok -- their partner read was bad. Rename files ~~~~~~~~~~~~ You'll have a bunch of 'keep.abundfilt' files -- let's make things prettier. .. :: echo 2-diginorm extract `date` >> ${HOME}/times.out First, let's break out the orphaned and still-paired reads:: cd ${HOME}/projects/eel-pond mkdir digiresult cd digiresult for file in ../abundfilt/*.pe.*.abundfilt do extract-paired-reads.py ${file} done .. # parallel version cd ${HOME}/projects/eel-pong mkdir digiresult cd digiresult ls ../abundfilt/*.pe* | parallel extract-paired-reads.py We can combine the orphaned reads into a single file:: cd ${HOME}/projects/eel-pond/abundfilt for file in *.se.qc.fq.gz.keep.abundfilt do pe_orphans=${file%%.se.qc.fq.gz.keep.abundfilt}.pe.qc.fq.gz.keep.abundfilt.se newfile=${file%%.se.qc.fq.gz.keep.abundfilt}.se.qc.keep.abundfilt.fq.gz cat ${file} ../digiresult/${pe_orphans} | gzip -c > ../digiresult/${newfile} rm ${pe_orphans} done .. # parallel version cd ${HOME}/projects/eel-pond/abundfilt ls *.se.qc.fq.gz.keep.abundfilt | parallel \ 'file={}; pe_orphans=${file%%.se.qc.fq.gz.keep.abundfilt}.pe.qc.fq.gz.keep.abundfilt.se; newfile=${file%%.se.qc.fq.gz.keep.abundfilt}.se.qc.keep.abundfilt.fq.gz; cat ${file} ../digiresult/${pe_orphans} \ | gzip -c > ../digiresult/${newfile} \ rm ${pe_orphans}' We can also rename the remaining PE reads & compress those files:: cd ${HOME}/projects/eel-pond/digiresult for file in *.abundfilt.pe do newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq mv ${file} ${newfile} gzip ${newfile} done .. # parallel version cd ${HOME}/projects/eel-pond/digiresult ls *.abundfilt.pe | parallel \ 'file={}; newfile=${file%%.fq.gz.keep.abundfilt.pe}.keep.abundfilt.fq mv ${file} ${newfile} gzip ${newfile}' This leaves you with a whole passel o' files, most of which you want to go away! :: filtered-reads/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz filtered-reads/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz diginorm/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep diginorm/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep diginorm/normC20k20.ct abundfilt/6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt abundfilt/6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep.abundfilt digiresult/6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz digiresult/6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz .. :: echo 2-diginorm DONE `date` >> ${HOME}/times.out So, finally, let's get rid of a lot of the old files :: rm filteredreads/* rm diginorm/* rm abundfilt/* rmdir filteredreads diginorm abundfilt Gut check ~~~~~~~~~ You should now have the following files in your directory (after typing ``ls digifilt``):: digifilt/6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz digifilt/6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz These files are, respectively, the paired (pe) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences, and the orphaned (se) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences. Save all these files to a new volume, and get ready to assemble! .. shell stop Next: :doc:`3-big-assembly`.