NCBI-PSI-BLAST and EBI CLUSTALW

Web sites:

Summary:

Procedure for obtaining a multiple sequence alignment starting from a query sequence in fasta format using all web-based tools.

Procedure:

Use NCBI Entrez/Protein, RCSB, or other favorite protein database to search and find desired sequence.

Display it in FASTA format.

Use it as query in the PSI-BLAST page.  Continue with iterations until no "New" sequences are detected (until convergence).  In the formatting output options, specify "gi number output".

Copy/paste the headers for the desired sequences into a new file: filename.gin

Use the following command sequence to obtain the gi-identifier numbers from the above file.

gi_parse.pl filename.gin | grep -v codes > filename.gib

Note that suffixes reflect:  gin= "gi numbers with full headers" and gib= "gi numbers isolated for batch query".  The original Perl script states the numbers of lines in the file at the bottom of the generated list of numbers, hence the pipe to "grep -v codes", and the subsequent output (a column of unique gi numbers) is directed to filename.gib.

Now go to http://www.ncbi.nlm.nih.gov:80/Entrez/batch.html and upload the filename.gib query to retrieve all sequences in fasta (or other format, as specified) sequence information.  Download this file to filename.fst (fasta format list).

Consider alternate output formats.  For example, select Format: GenBank/GenPept and deselect the Html checkbox.  Submit the query, and save output as filename.gbp.  This file can be "grep"ed for interesting results.  For example, to get an alphabetically sorted list of the different organisms found in the query, use:

grep ORGANISM filename.gbp | sort +1 > filename.org

This file of sequences in fasta format can be loaded onto the EBI ClustalW server which interfaces with a nice sequence viewer/editer (Jalview) as well.  Another useful secondary structure prediction site, JPred2 (from the Barton Group), takes a a single query sequence, performs a PSI-Blast search, and generates a consensus secondary structure prediction after comparing predictions from multiple algorithms.