BLAST, or Basic Local Alignment Search Tool, is a collection of tools that are used to search for and find regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases, and calculates the statistical significance of the matches. This software suite has been released free to the public by the National Centre for Biotechnology Information.
BLAST can be used for protein-protein comparisons or nucleotide-nucleotide comparisons. Before an example of the usage is presented, we must first define some environmental variables.
$BLASTDB - This is the variable which points to the Blast Database. This is set to $HOME/bio/ncbi/db/. This directory should contain the databases that you would want to search. BLAST, by default, checks this location and the current working directory for the presence of the databases. This variable is set during login by system login scripts , and may be changed by the user to point to her preferred location in her startup scripts.
$BLASTMAT - This variable points to the location where the BLAST scoring matricies are present. It is set to /opt/bio/ncbi/data. Again, they may be changed to point to a desired location on a per-user basis.
BLAST requires the presence of 2 datasets. One dataset is the input sequence that you want to search for, and the other dataset is the database that you want to search against.
Use the following procedure to run blast
Download a BLAST database that you want to run the comparison against. The databases can be obtained from the NCBI ftp site at ftp://ftp.ncbi.nlm.nih.gov/blast/db/.
The databases available on the site mentioned above are pre-formatted. It is recommended that the blast databases be stored at the $BLASTDB location. |
Visit ftp://ftp.ncbi.nlm.nih.gov/blast/db/ in your browser to see a list of available preformatted databases.
Download one of these on to your cluster using wget.
[nostromo@xxx ~]$ wget -q ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.08.tar.gz [nostromo@xxx ~]$ gunzip -c nt.08.tar.gz | ( cd $BLASTDB/ && tar -xf -) |
The above method downloads a formatted database, and untars it into $BLASTDB.
Unformatted databases can be obtained in FASTA format at ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.
Visit ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ in your web browser
If you've downloaded the databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/, then DO NOT run formatdb. |
Run the formatdb command to format the database to the BLAST format. For this example, we'll use the Drosophila Melanogaster (fruitfly) nucleotide database
[nostromo@xxx ~]$ cd $BLASTDB [nostromo@xxx ~]$ wget -q ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz [nostromo@xxx ~]$ gunzip drosoph.nt.gz [nostromo@xxx ~]$ formatdb -p F -V T -i drosoph.nt [nostromo@xxx ~]$ ls drosoph.nt* drosoph.nt drosoph.nt.nhr drosoph.nt.nin drosoph.nt.nsq [nostromo@xxx ~]$ cd $HOME |
After the database is formatted, create a test input file.
[nostromo@xxx ~]$ cat > test.txt >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT |
Run the blastall program on the test input against the formatted database.
[nostromo@xxx ~]$ blastall --help |
gives a list of all the options that you can use to run the blastall program.
[nostromo@xxx ~]$ blastall -d drosoph.nt -p blastn -i test.txt BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= Test (560 letters) Database: drosoph.nt 1170 sequences; 122,655,632 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gi|10729531|gb|AE002936.2|AE002936 Drosophila melanogaster genom... 36 0.86 gi|10728232|gb|AE003493.2|AE003493 Drosophila melanogaster genom... 36 0.86 gi|10726497|gb|AE003698.2|AE003698 Drosophila melanogaster genom... 36 0.86 gi|10726398|gb|AE003681.2|AE003681 Drosophila melanogaster genom... 36 0.86 gi|10729308|gb|AE002665.2|AE002665 Drosophila melanogaster genom... 34 3.4 gi|10729264|gb|AE002615.2|AE002615 Drosophila melanogaster genom... 34 3.4 gi|7298233|gb|AE003648.1|AE003648 Drosophila melanogaster genomi... 34 3.4 gi|7297628|gb|AE003628.1|AE003628 Drosophila melanogaster genomi... 34 3.4 gi|10728546|gb|AE003447.2|AE003447 Drosophila melanogaster genom... 34 3.4 gi|7290819|gb|AE003441.1|AE003441 Drosophila melanogaster genomi... 34 3.4 gi|10728461|gb|AE003431.2|AE003431 Drosophila melanogaster genom... 34 3.4 gi|10728241|gb|AE003495.2|AE003495 Drosophila melanogaster genom... 34 3.4 gi|7292554|gb|AE003484.1|AE003484 Drosophila melanogaster genomi... 34 3.4 gi|10727872|gb|AE003525.2|AE003525 Drosophila melanogaster genom... 34 3.4 gi|10727399|gb|AE003587.2|AE003587 Drosophila melanogaster genom... 34 3.4 gi|10727114|gb|AE003673.2|AE003673 Drosophila melanogaster genom... 34 3.4 gi|10726705|gb|AE003740.2|AE003740 Drosophila melanogaster genom... 34 3.4 |
The above example shows how to search for the test input in a drosophila nucleotide database, and a snippet of the output file.
This section gives a very simple example of running BLAST through the provided batch system SGE.
Create a simple submission script called blast_sge.sh containing the following -
#!/bin/bash # #$ -cwd #$ -S /bin/bash #$ -j y export BLASTDB=$HOME/bio/ncbi/db/ export BLASTMAT=/opt/bio/ncbi/data/ /opt/bio/ncbi/bin/blastall -d drosoph.nt \ -p blastn -i $HOME/test.txt \ -o $HOME/result.txt |
Run
[nostromo@xxx ~]$ qsub blast_sge.sh Your job 10 ("blast_sge.sh") has been submitted |
The output of the Blast job is similar to the one given above and will be stored in $HOME/result.txt
For further information about BLAST and its usage, please refer to the following sources
THE NCBI Blast website - http://www.ncbi.nlm.nih.gov/BLAST/
BLAST Help page on your cluster BLAST Help Page
BLAST Program selection Guide - http://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf