Skip to content

Creating a custom local Blast database from .fasta

All these commands are pretty simple to use, but I couldn’t find really straight forward answers for why the hell I was getting errors, so here is my quick guide. I hope you find it useful.

When you are creating a custom NCBI blast database to use, there’s a couple of things you need to keep in mind. It will depend in part what kind of sequences you want, nt or aa. Remember that for blastn you need nt sequences (nucleotide sequences from that section of NCBI) and for blastp/blastx you need protein sequences (protein sequences as above). Simply choose which organisms you want it for, go to ‘send to’, then ‘File’, then ‘file type .fasta’ After that has downloaded, you’re going to want to copy it into your ncbi/db . I put mine in a new file. Then, use the command makeblastdb. I hard code all of the things I am dealing with, because sometimes my computer just doesn’t play nice.

 

ncbi-blast-2.2.31+/bin/makeblastdb -in ncbi-blast-2.2.31+/db/flavivirus-nt-custom-db/flavivirus.nogaps.noempty.nt -outncbi-blast-2.2.31+/db/flavivirus-nt-custom-db/flavivirus.nogaps.noempty.db -dbtype nucl

If you want to do protein sequences, then you need to use -dbtype prot

 

BLAST Database creation error: FASTA-Reader: No residues given

Blast can give this ugly, rather uninformative error for two reasons that I have run into so far. If there are gaps between your sequences in your fasta file, then it will fail. And if there are any blank sequences, that will also fail. The commands you want to use are:

grep -v ‘^$’ flavivirus.nt > flavivirus.nogaps.nt

awk -v RS=”>” -v FS=”\n” -v ORS=”” ‘ { if ($2) print “>”$0 } ‘ flavivirus.nogaps.nt > flavivirus.nogaps.noempty.nt

Now that that has hopefully completed successfully, you can use your new database as you like. In the file where you have created the database, there should now be 4 files –

flavivirus.nogaps.noempty.db

flavivirus.nogaps.noempty.db.nsq

flavivirus.nogaps.noempty.db.nin

flavivirus.nogaps.noempty.db.nhr

or, if you are using protein sequences, the files will end with .pin, .psq and .phr

To call upon your database, you need to use the original file that you created the database from. The command I used for a nt search was:

ncbi-blast-2.2.31+/bin/blastn -query ./seqs.fasta -db ncbi-blast-2.2.31+/db/flavivirus-nt-custom-db/flavivirus.nogaps.noempty.nt -out seqs.blast

At this point you might get another annoying error message:

Error: Too many positional arguments (1), the offending value: db

Error: Too many positional arguments (1), the offending value: /ncbi-blast-2.2.31+/db/flavivirus-nt-custom-db/flavivirus.nogaps.noempty.nt

These errors will come up for a number of reasons:

  1. make sure that you are blasting against the right database (eg. blastn = nucleotide)
  2. blast may be throwing a hissy fit that you have accidentally called the wrong file (eg. flavivirus.nogaps.noempty.nt is not the same as flavivirus.nogaps.noempty). You must use exactly the same naming scheme for your original .fasta file as you do for the output from makeblastdb . It is best not to put a .fasta on the end of your original file, as this seems to confuse blast sometimes also.

Remember, don’t worry if this error comes up

Warning: [blastn] yoursequencenamewillbehere Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options

This just means that that particular record is empty (or is occupied with a string of NNNNNNNNNN, quite common in large scale miseq datasets). You could have also deleted it using the above command that you used on empty sequences in your custom blast database. You can check whether it is empty or not by using the grep command

grep yoursequencenamewithoutgapswillbehere seqs.fasta -C 2

the [-C 2] tells grep to show you the 2 lines before and after where the query appears.

Published inBioinformatics

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.