Skip to content

How to deal with Blast output and Fasta to get what you need for MEGAN (using Qiime)

So… You have some sequence data from Illumina sequencing. These are the two sets of files I have, one is forward reads, and the other is reverse.



… it’s in fastaq.gz format. You want to upzip it, simple right? You want to use this command:

gunzip –keep PP2_S1_L001_R2_001.fastq.gz

But it turns out, I didn’t know how to deal with that format yet. So then I went and hunted some more software down, and used Qiime.

My sequences are paired reads, and they are already demultiplexed. So what that means is that you just need to pair the two sets, no need to do anything else fancy. Qiime needs to be loaded within your terminal window, from where ever you installed it. I made the mistake of following the instructions without first moving my file out of the Downloads folder. I advise putting it in your root directory first. In order to open Qiime, use the code


within the folder that you want to do the pairing within. Then the code is simply: -i . -o .

Those ‘.’ are important, because it tells it to just used the two files that are already in the directory [-i .]. And the [-o .] tells it that here is as good a place as any to create a new folder with its output, which consists of three files:




Now, tempting as it is to just continue onto the step where you produce your OTUs. But then you’d be missing the quality control step. What made it complicated for me was that I had no idea what a ‘barcode’ was, and whether my [–barcode_type] was ‘non-barcoded’, in which case I needed to provide [–sample_ids]. These are the two errors I got, when I confused it.

Error in If not providing barcode reads (because your data is not multiplexed), must provide –sample_ids.

Error in Must provide –barcode_read_fps if –barcode_type is not ‘not-barcoded’

So, the way the command needs to be written is that even if you have no barcodes, you can specify pretty much whatever you want for the [–sample_ids], and it will just be chucked at the beginning of each of the lines that specify the name of the particular read. I chose to use one that reflects which sample it came from originally (but it doesn’t actually matter, you could just use ‘xyz’ if you wanted). -i fastqjoin.join.fastq -o . -q 19 –sample_ids PP2 –barcode_type ‘not-barcoded’

-q 19 means that the quality of the sequences must be >=20. It seemed to be a standard value for the other work done in this area and discussed on the Qiime site. The default is 3, which seems low, so I specified it. You then end up with a file that is:


Don’t let the .fna deceive you, it’s actually a .fasta file, and so I renamed mine to reflect that, knowing perfectly well that some programs just have… issues with other file formats. Label it what it is! Then you want to be able to pick OTUs, which basically narrow down the number of sequences you are dealing with by removing ones that are the same. That is ok for my data, because it’s not quantitative (or I’d be worried about losing sequences). I did the default (97% similarity) using UCLUST. -i seqs.fasta

I then spent quite a long painful while trying to sort out my OTUs using different programs offered by Qiime, but eventually gave up, because I am going to blast all of the sequences, which will make it more powerful. I DID NOT remove chimeric sequences, as I found the reference databases offered were too small for my interests. I am assuming that I will get mainly viruses back (this is not a metagenomic sample, so it’s ok to have some uncategorised reads). I ended up with 3 new files in their own directory:




The next command I used was: -i seqs_otus.txt -f ../seqs.fasta -o rep_set.fasta

So if there was more than one OTU that was the same as the others, it was removed. Finally, I called it done, and set about blasting my results from:



So this is a handy awk script for parsing .fasta data so that you only use .fasta sequences that are longer than a certain length when you blastn them against the nt database. In my case, I’ve looked at the FastaQ toolkit output from basespace.illumina and the graphs of sequence length tell me that there are several hot-spots of sequence interest. There are some sequences that are around <50bp, and some that are >140. Theoretically I should have better hits from the ones that are 140+ bp. Also, this can speed up the blastn search you do. I got this handy bit of script from here, and then modified it for what I needed.

cat rawfile.fasta | awk ‘{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])>=80) {printf(“%s\n%s\n”,L[0],L[1]);}}’ > less.seqs.fasta

y = 2, because we are working with fasta format text, so line 1 is the name and line 2 is the sequence.

length(L[1])>=80 gives me only sequences that are greater than 80 bp long.

Published inBioinformatics

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.