BLAST, or Basic Local Alignment Search Tool, is a collection of tools that are used to search for and find regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases, and calculates the statistical significance of the matches. This software suite has been released free to the public by the National Centre for Biotechnology Information.
BLAST can be used for protein-protein comparisons or nucleotide-nucleotide comparisons. Before an example of the usage is presented, we must first define some environmental variables.
$BLASTDB - This is the variable which points to the Blast Database. This is set to /share/bio/ncbi/db/. This directory should contain the databases that you would want to search. BLAST by default checks this location and the current working directory for the presence of the databases. This variable is set during login by system login scripts , and may be changed by the user to point to her preferred location in her startup scripts.
$BLASTMAT - This variable points to the location where the BLAST scoring matricies are present. It is set to /opt/Bio/ncbi/data Again, they may be changed to point to a desired location on a per-user basis.
BLAST requires the presence of 2 datasets. One dataset is the input sequence that you want to search for, and the other dataset is the database that you want to search against.
Use the following procedure to run blast
Download the BLAST database that you want to blast against. The databases can be obtained from the NCBI ftp site at ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Note that the databases available here are preformatted. Unformatted databases can be obtained in FASTA format at ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. The databases may also be obtained by running the /opt/Bio/ncbi/doc/blast/update_blastdb.pl script. Run the script without any parameters to view usage.
Note that it is recommended that the blast databases be downloaded to the $BLASTDB location. As not everybody has write access to this location, a seperate user called biouser is created who can write to this location. The users of the system may su to this user using the following command.
[nostromo@xxx ~]$ sudo su - biouser -bash-3.00$ cd $BLASTDB -bash-3.00$ /opt/Bio/ncbi/doc/blast/update_blastdb.pl --showall Connected to NCBI env_nr env_nt est est_human est_mouse est_others gss htgs human_genomic nr nt other_genomic pataa patnt pdbaa pdbnt refseq_genomic refseq_protein refseq_rna sts swissprot taxdb wgs -bash-3.00$ /opt/Bio/ncbi/doc/blast/update_blastdb.pl patnt Connected to NCBI Downloading patnt.tar.gz... done. -bash-3.1$ tar xzf patnt.tar.gz |
The above method downloads formatted databases. You can also download unformatted databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. If you've used the update_db.pl tool or downloaded the databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/, then DO NOT run formatdb.
Run the formatdb command to format the database to the BLAST format. For this example, we'll use the Drosophila Melanogaster (fruitfly) nucleotide database
-bash-3.1$ wget -q ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz -bash-3.1$ gunzip drosoph.nt.gz -bash-3.1$ formatdb -p F -V T -i drosoph.nt -bash-3.1$ ls drosoph.nt* drosoph.nt drosoph.nt.nhr drosoph.nt.nin drosoph.nt.nsq -bash-3.1$ |
After the database is formatted, create a test input file.
[nostromo@xxx ~]$ cat > test.txt >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT |
Run the blastall program on the test input against the formatted database.
[nostromo@xxx ~]$ blastall --help |
gives a list of all the options that you can use to run the blastall program.
[nostromo@xxx ~]$ blastall -d drosoph.nt -p blastn -i test.txt BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= Test (560 letters) Database: drosoph.nt 1170 sequences; 122,655,632 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gi|10729531|gb|AE002936.2|AE002936 Drosophila melanogaster genom... 36 0.86 gi|10728232|gb|AE003493.2|AE003493 Drosophila melanogaster genom... 36 0.86 gi|10726497|gb|AE003698.2|AE003698 Drosophila melanogaster genom... 36 0.86 gi|10726398|gb|AE003681.2|AE003681 Drosophila melanogaster genom... 36 0.86 gi|10729308|gb|AE002665.2|AE002665 Drosophila melanogaster genom... 34 3.4 gi|10729264|gb|AE002615.2|AE002615 Drosophila melanogaster genom... 34 3.4 gi|7298233|gb|AE003648.1|AE003648 Drosophila melanogaster genomi... 34 3.4 gi|7297628|gb|AE003628.1|AE003628 Drosophila melanogaster genomi... 34 3.4 gi|10728546|gb|AE003447.2|AE003447 Drosophila melanogaster genom... 34 3.4 gi|7290819|gb|AE003441.1|AE003441 Drosophila melanogaster genomi... 34 3.4 gi|10728461|gb|AE003431.2|AE003431 Drosophila melanogaster genom... 34 3.4 gi|10728241|gb|AE003495.2|AE003495 Drosophila melanogaster genom... 34 3.4 gi|7292554|gb|AE003484.1|AE003484 Drosophila melanogaster genomi... 34 3.4 gi|10727872|gb|AE003525.2|AE003525 Drosophila melanogaster genom... 34 3.4 gi|10727399|gb|AE003587.2|AE003587 Drosophila melanogaster genom... 34 3.4 gi|10727114|gb|AE003673.2|AE003673 Drosophila melanogaster genom... 34 3.4 gi|10726705|gb|AE003740.2|AE003740 Drosophila melanogaster genom... 34 3.4 |
The above example shows how to search for the test input in a drosophila nucleotide database, and a snippet of the output file.
This section gives a very simple example of running BLAST through the provided batch system SGE.
Create a simple submission script called blast_sge.sh containing the following -
#!/bin/bash # #$ -cwd #$ -S /bin/bash #$ -j y export BLASTDB=/share/bio/ncbi/db/ export BLASTMAT=/opt/Bio/ncbi/data/ export PATH=$PATH:/opt/Bio/ncbi/bin blastall -d drosoph.nt -p blastn -i $HOME/test.txt -o $HOME/result.txt |
Run
[nostromo@xxx ~]$ qsub blast_sge.sh Your job 10 ("blast_sge.sh") has been submitted |
The output of the Blast job is similar to the one given above and will be stored in $HOME/result.txt
For further information about BLAST and its usage, please refer to the following sources
THE NCBI Blast website - http://www.ncbi.nlm.nih.gov/BLAST/
BLAST Help page on your cluster BLAST Help Page
BLAST Program selection Guide - http://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf