{"id":403,"date":"2015-11-18T02:46:44","date_gmt":"2015-11-18T02:46:44","guid":{"rendered":"http:\/\/www.abyteofcommonsense.com\/?p=403"},"modified":"2015-11-18T02:47:15","modified_gmt":"2015-11-18T02:47:15","slug":"how-to-deal-with-blast-output-and-fasta-to-get-what-you-need-for-megan","status":"publish","type":"post","link":"http:\/\/www.abyteofcommonsense.com\/?p=403","title":{"rendered":"How to deal with Blast output and Fasta to get what you need for MEGAN (using Qiime)"},"content":{"rendered":"<p>So&#8230; You have some sequence data from Illumina sequencing.\u00a0These are the two sets of files I have, one is forward reads, and the other is reverse.<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">PP2_S1_L001_R2_001.fastq.gz<\/span><\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">PP2_S1_L001_R1_001.fastq.gz<\/span><\/strong><\/p>\n<p>&#8230; it&#8217;s in fastaq.gz format. You want to upzip it, simple right? You want to use this command:<\/p>\n<p><strong>gunzip &#8211;keep PP2_S1_L001_R2_001.fastq.gz<\/strong><\/p>\n<p class=\"p1\">But it turns out, I didn&#8217;t know how to deal with that format yet. So then I went and hunted some more software down, and used Qiime.<\/p>\n<p class=\"p1\">My sequences are paired reads, and they are already demultiplexed. So what that means is that you just need to pair the two sets, no need to do anything else fancy. Qiime needs to be loaded within your terminal window, from where ever you installed it. I made the mistake of following the instructions without first moving my file out of the Downloads folder. I advise putting it in your root directory first. In order to open Qiime, use the code<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">\/&#8217;pathtoyourinstallation&#8217;\/MacQIIME_1.9.1-20150604_OS10.7\/scripts\/macqiime<\/span><\/strong><\/p>\n<p class=\"p1\">within the folder that you want to do the pairing within. Then the code is simply:<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">multiple_join_paired_ends.py -i . -o .<\/span><\/strong><\/p>\n<p class=\"p1\">Those &#8216;.&#8217; are important, because it tells it to just used the two files that are already in the directory [-i .]. And the [-o .] tells it that here is as good a place as any to create a new folder with its output, which consists of three files:<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">fastqjoin.join.fastq<\/span><\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">fastqjoin.un1.fastq<\/span><\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">fastqjoin.un2.fastq<\/span><\/strong><\/p>\n<p class=\"p1\">Now, tempting as it is to just continue onto the step where you produce your OTUs. But then you&#8217;d be missing the quality control step. What made it complicated for me was that I had no idea what a &#8216;barcode&#8217; was, and whether my [&#8211;barcode_type] was &#8216;non-barcoded&#8217;, in which case I needed to provide\u00a0[&#8211;sample_ids]. These are the two errors I got, when I confused it.<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">Error in split_libraries_fastq.py: If not providing barcode reads (because your data is not multiplexed), must provide &#8211;sample_ids.<\/span><\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">Error in split_libraries_fastq.py: Must provide &#8211;barcode_read_fps if &#8211;barcode_type is not &#8216;not-barcoded&#8217;<\/span><\/strong><\/p>\n<p class=\"p1\">So, the way the command\u00a0needs to be written is that even if you have no barcodes, you can specify pretty much whatever you want for the [<span class=\"s1\">&#8211;sample_ids], and it will just be chucked at the beginning of each of the lines that specify the name of the particular read. I chose to use one that reflects which sample it came from originally (but it doesn&#8217;t actually matter, you could just use &#8216;xyz&#8217; if you wanted).<\/span><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">split_libraries_fastq.py -i fastqjoin.join.fastq -o . -q 19 &#8211;sample_ids PP2 &#8211;barcode_type &#8216;not-barcoded&#8217;<\/span><\/strong><\/p>\n<p class=\"p1\"><span class=\"s1\">-q 19 means that the quality of the sequences must be &gt;=20. It seemed to be a standard value for the other work done in this area and discussed on the Qiime site. The default is 3, which seems low, so I specified it. You then end up with a file that is:<\/span><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">seqs.fna<\/span><\/strong><\/p>\n<p class=\"p1\">Don&#8217;t let the .fna deceive you, it&#8217;s actually a .fasta file, and so I renamed mine to reflect that, knowing perfectly well that some programs just have&#8230; issues with other file formats. Label it what it is! Then you want to be able to pick OTUs, which basically narrow down the number of sequences you are dealing with by removing ones that are the same. That is ok for my data, because it&#8217;s not quantitative (or I&#8217;d be worried about losing sequences). I did the default (97% similarity) using UCLUST.<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">pick_otus.py -i seqs.fasta <\/span><\/strong><\/p>\n<p class=\"p1\">I then spent quite a long painful while trying to sort out my OTUs using different programs offered by Qiime, but eventually gave up, because I am going to blast all of the sequences, which will make it more powerful. I DID NOT remove chimeric sequences, as I found the reference databases offered were too small for my interests. I am assuming that I will get mainly viruses back (this is not a metagenomic sample, so it&#8217;s ok to have some uncategorised reads). I ended up with 3 new files in their own directory:<\/p>\n<p class=\"p1\"><strong>seqs_clusters.uc<\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">seqs_otus.txt<\/span><\/strong><\/p>\n<p class=\"p1\"><strong><span class=\"s1\">seqs_otus.log<\/span><\/strong><\/p>\n<p class=\"p1\">The next command I used was:<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">pick_rep_set.py -i seqs_otus.txt -f ..\/seqs.fasta -o rep_s<\/span><span class=\"s1\">et.fasta<\/span><\/strong><\/p>\n<p class=\"p1\">So if there was more than one OTU that was the same as the others, it was removed. Finally, I called it done, and set about blasting my results from:<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">rep_s<\/span><span class=\"s1\">et.fasta<\/span><\/strong><\/p>\n<p class=\"p1\">NB:<\/p>\n<p class=\"p1\">So this is a handy awk script for parsing .fasta data so that you only use .fasta sequences that are longer than a certain length when you blastn them against the nt database. In my case, I&#8217;ve looked at the FastaQ toolkit output from basespace.illumina and the graphs of sequence length tell me that there are several hot-spots of sequence interest. There are some sequences that are around &lt;50bp, and some that are &gt;140. Theoretically I should have better hits from the ones that are 140+ bp. Also, this can speed up the blastn search you do.\u00a0I got this handy bit of script <a href=\"https:\/\/www.biostars.org\/p\/62678\/\">from here<\/a>, and then modified it for what I needed.<\/p>\n<p class=\"p1\"><strong><span class=\"s1\">cat rawfile.fasta\u00a0| awk &#8216;{y= i++ % 2 ; L[y]=$0; if(y==1 &amp;&amp; length(L[1])&gt;=80) {printf(&#8220;%s\\n%s\\n&#8221;,L[0],L[1]);}}&#8217; &gt; less.seqs.fasta<\/span><\/strong><\/p>\n<p class=\"p1\">y = 2, because we are working with fasta format text, so line 1 is the name and line 2 is the sequence.<\/p>\n<p class=\"p1\">length(L[1])&gt;=80 gives me only sequences that are greater than 80 bp long.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So&#8230; You have some sequence data from Illumina sequencing.\u00a0These are the two sets of files I have, one is forward reads, and the other is&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"http:\/\/www.abyteofcommonsense.com\/?p=403\">Continue reading<span class=\"screen-reader-text\">How to deal with Blast output and Fasta to get what you need for MEGAN (using Qiime)<\/span><\/a><\/div>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[69],"tags":[70,72,74,73],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6nzXS-6v","jetpack-related-posts":[{"id":400,"url":"http:\/\/www.abyteofcommonsense.com\/?p=400","url_meta":{"origin":403,"position":0},"title":"Local NCBI BLAST problems? Some solutions for the unexperienced bioinformatician","date":"November 16, 2015","format":false,"excerpt":"Is local blast driving you nuts? Blast is a super powerful tool if you download it onto your own computer because you can blast more than one sequence at once, and not have to worry about the server dying on you. Having set it up both with the experience of\u2026","rel":"","context":"In &quot;Bioinformatics&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":465,"url":"http:\/\/www.abyteofcommonsense.com\/?p=465","url_meta":{"origin":403,"position":1},"title":"Blast(ed) MEGAN Round 2 - or what to do when you're running yet another blast","date":"April 5, 2016","format":false,"excerpt":"So MEGAN can be a bit annoying at times. As can Blast outputs required by MEGAN. Because I'm particularly interested in the alignments produced, and want to create a consensus sequence for designing primers, I need to run blast again from what I did the other day. This time, I've\u2026","rel":"","context":"In &quot;Bioinformatics&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":410,"url":"http:\/\/www.abyteofcommonsense.com\/?p=410","url_meta":{"origin":403,"position":2},"title":"Creating a custom local Blast database from .fasta","date":"November 23, 2015","format":false,"excerpt":"All these commands are pretty simple to use, but I couldn't find really straight forward answers for why the hell I was getting errors, so here is my quick guide. I hope you find it useful. When you are creating a custom NCBI blast database to use, there's a couple\u2026","rel":"","context":"In &quot;Bioinformatics&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":412,"url":"http:\/\/www.abyteofcommonsense.com\/?p=412","url_meta":{"origin":403,"position":3},"title":"Using the mac OSX command line","date":"November 30, 2015","format":false,"excerpt":"The most sensible way to set up your working environment in the Mac command line when you want to do the same thing in multiple folders is to make sure you have labelled everything in the same way.\u00a0A handy hint here is that if you have terminal open, and you\u2026","rel":"","context":"In &quot;Bioinformatics&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":15,"url":"http:\/\/www.abyteofcommonsense.com\/?p=15","url_meta":{"origin":403,"position":4},"title":"LEGO Obsessions","date":"October 6, 2014","format":false,"excerpt":"I've admitted before that I'm a bit obsessed with Lego. It only started last Christmas, when I fished out my old Lego and bought two new 'City' sets to build with KK. And suddenly, before I knew what was happening, I was buying all of the collection. In fact, with\u2026","rel":"","context":"In &quot;LEGO&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5,"url":"http:\/\/www.abyteofcommonsense.com\/?p=5","url_meta":{"origin":403,"position":5},"title":"50 Shades of Grey - 200% better with LEGO","date":"February 10, 2015","format":false,"excerpt":"Anyone else seen the trailer for 50 Shades of Grey? I personally haven't - instead I watched the LEGO version! Some enterprising guy on YouTube decided to do a stop motion video of a LEGO Grey and Steel romance. Here's the link. Doesn't it just prove that LEGO can make\u2026","rel":"","context":"In &quot;LEGO&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/posts\/403"}],"collection":[{"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=403"}],"version-history":[{"count":6,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/posts\/403\/revisions"}],"predecessor-version":[{"id":409,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=\/wp\/v2\/posts\/403\/revisions\/409"}],"wp:attachment":[{"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=403"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.abyteofcommonsense.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}