BLAST, or Basic Local Alignment Search Tool, is a collection of tools that are used to search for and find regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases, and calculates the statistical significance of the matches. This software suite has been released free to the public by the National Centre for Biotechnology Information.
BLAST can be used for protein-protein comparisons or nucleotide-nucleotide comparisons. Before an example of the usage is presented, we must first define some environmental variables.
$BLASTDB - This is the variable which points to the Blast Database. This is set to /share/bio/ncbi/db/. This directory should contain the databases that you would want to search. BLAST by default checks this location and the current working directory for the presence of the databases. This variable is set during login by system login scripts , and may be changed by the user to point to her preferred location in her startup scripts.
$BLASTMAT - This variable points to the location where the BLAST scoring matricies are present. It is set to /opt/Bio/ncbi/data Again, they may be changed to point to a desired location on a per-user basis.
BLAST requires the presence of 2 datasets. One dataset is the input sequence that you want to search for, and the other dataset is the database that you want to search against.
Use the following procedure to run blast
Download the BLAST database that you want to blast against. The databases can be obtained from the NCBI ftp site at ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Note that the databases available here are preformatted. Unformatted databases can be obtained in FASTA format at ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. The databases may also be obtained by running the /opt/Bio/ncbi/doc/blast/update_blastdb.pl script. Run the script without any parameters to view usage.
Note that it is recommended that the blast databases be downloaded to the $BLASTDB location. As not everybody has write access to this location, a seperate user called biouser is created who can write to this location. The users of the system may su to this user using the following command.
[nostromo@rocks-168 ~]$ sudo su - biouser -bash-3.00$ cd $BLASTDB -bash-3.00$ /opt/Bio/ncbi/doc/blast/update_blastdb.pl --showall Connected to NCBI env_nr env_nt est est_human est_mouse est_others gss htgs human_genomic nr nt other_genomic pataa patnt pdbaa pdbnt refseq_genomic refseq_protein refseq_rna sts swissprot taxdb wgs -bash-3.00$ /opt/Bio/ncbi/doc/blast/update_blastdb.pl patnt Connected to NCBI Downloading patnt.tar.gz... done. -bash-3.00$ tar xzf patnt.tar.gz |
This step is to be followed ONLY if you have downloaded unformatted databases. Run the formatdb command to format the database to the BLAST format.
-bash-3.00$ formatdb --help |
gives you a list of all the available options to run formatdb. Make sure you choose the right set of options depending on whether you're running against a nucleotide database or a protien database.
Create a test input file.
[nostromo@rocks-168 ~]$ cat > test.txt >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT |
Run the blastall program on the test input against the downloaded database.
[nostromo@rocks-168 ~]$ blastall --help |
gives a list of all the options that you can use to run the blastall program.
[nostromo@rocks-168 ~]$ blastall -d patnt -p blastn -i test.txt -o result.txt [nostromo@rocks-168 ~]$ cat result.txt BLASTN 2.2.13 [Nov-27-2005] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= Test (560 letters) Database: Nucleotide sequences derived from the Patent division of GenBank 2,738,673 sequences; 1,647,221,730 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value emb|CS104136.1| Sequence 1 from Patent WO2005049808 589 e-166 dbj|DD171864.1| Method of producing amino acid by fermentation 589 e-166 dbj|BD179435.1| Method and apparatus for recording sequential da... 589 e-166 dbj|BD179434.1| Method and apparatus for recording sequential da... 589 e-166 dbj|BD131253.1| Recording method and apparatus of sequence infor... 589 e-166 dbj|BD131254.1| Recording method and apparatus of sequence infor... 589 e-166 dbj|BD103218.1| Method and apparatus for recording information o... 589 e-166 dbj|BD103217.1| Method and apparatus for recording information o... 589 e-166 gb|AR384840.1| Sequence 1569 from patent US 6610836 262 1e-67 gb|AR384989.1| Sequence 1718 from patent US 6610836 196 5e-48 dbj|E38337.1| Process for producing L-methionine by fermentation 121 2e-25 |
The above example shows how to search for the test input in a patented nucleotide database, and a snippet of the output file.
For further information about BLAST and its usage, please refer to the following sources
THE NCBI Blast website - http://www.ncbi.nlm.nih.gov/BLAST/
BLAST Help page on your cluster BLAST Help Page
BLAST Program selection Guide - http://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf