(Please refer to your own copy of Introduction of Bioinformatics, by Arthur M. Lesk. All the following examples would be found from his book.
Example 1.1
Retrieve the amino acid sequence of horse pancreatic ribonuclease.
Use the ExPASy server at the Swiss Institute for Bioinformatics: The URL is: http://www.expasy.ch/cgi-bin/sprotsearch-
ful. Type in the keywords horse pancreatic ribonuclease followed by the ENTER key. Select
RNP_HORSE and then FASTA format (see Box: FASTA format). This will produce the following (the first line has been truncated):
>sp|P00674|RNP_HORSE RIBONUCLEASE PANCREATIC (EC 3.1.27.5) (RNASE 1) ...
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEP
LADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTS
QKERHIIVACEGNPYVPVHFDASVEVST
which can be cut and pasted into other programs
For example, we could retrieve several sequences and align them (see Box: Sequence Alignment). Analysis of patterns of similarity among aligned sequences are useful properties in assessing closeness of relationships.
FASTA format
A very common format for sequence data is derived from conventions of FASTA, a program for FAST
Alignment by W.R. Pearson. Many programs use FASTA format for reading sequences, or for reporting
results.
A sequence in FASTA format:
Begins with a single-line description. A > must appear in the first column. The rest of the title line is
arbitrary but should be informative.
Subsequent lines contain the sequence, one character per residue.
Use one-letter codes for nucleotides or amino acids specified by the International Union of
Biochemistry and International Union of Pure and Applied Chemistry (IUB/IUPAC).
See http://www.chem.qmw.ac.uk/iupac/misc/naabb.html
and http://www.chem.qmw.ac.uk/iupac/AminoAcid/
Use Sec and U as the three-letter and one-letter codes for selenocysteine:
http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html
Lines can have different lengths; that is, 'ragged right' margins.
Most programs will accept lower case letters as amino acid codes.
An example of FASTA format: Bovine glutathione peroxidase
>gi|121664|sp|P00435|GSHC\_BOVIN GLUTATHIONE PEROXIDASE
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMNDLQR
RLG
PRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLP
TPS
DDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVRRYSRRFLTIDIEPDIETLLSQGASA
The title line contains the following fields:
> is obligatory in column 1
gi|121664 is the geninfo number, an identifier assigned by the US National Center for Biotechnology
Information (NCBI) to every sequence in its ENTREZ databank. The NCBI collects sequences from a variety of sources, including primary archival data collections and patent applications. Its gi numbers provide a common and consistent 'umbrella' identifier, superimposed on different conventions of source databases.
When a source database updates an entry, the NCBI creates a new entry with a new gi number if the
changes affect the sequence, but updates and retains its entry if the changes affect only non-sequence
information, such as a literature citation.
sp|P00435 indicates that the source database was SWISS-PROT, and that the accession number of the
entry in SWISS-PROT was P00435.
GSHC_BOVIN GLUTATHIONE PEROXIDASE is the SWISS-PROT identifier of sequence and species,
(GSHC_BOVIN), followed by the name of the molecule.
Sequence alignment
Sequence alignment is the assignment of residue-residue correspondences. We may wish to find:
a Global match: align all of one sequence with all of the other.
And.--so,.from.hour.to.hour,.we.ripe.and.ripe
|||| ||||||||||||||||||||||||| ||||||
And.then,.from.hour.to.hour,.we.rot-.and.rot-
This illustrates mismatches, insertions and deletions.
a Local match: find a region in one sequence that matches a region of the other.
My.care.is.loss.of.care,.by.old.care.done,
||||||||| ||||||||||||| |||||| ||
Your.care.is.gain.of.care,.by.new.care.won
For local matching, overhangs at the ends are not treated as gaps. In addition to mismatches, seen in
this example, insertions and deletions within the matched region are also possible.
a Motif match: find matches of a short sequence in one or more regions internal to a long
one. In this case one mismatching character is allowed. Alternatively one could demand
perfect matches, or allow more mismatches or even gaps.
match
||||
for the watch to babble and to talk is most tolerable
or:
match
||||
Any thing that's mended is but patched: virtue that transgresses is
match match
|||| ||||
but patched with sin; and sin that amends is but patched with virtue
a Multiple alignment: a mutual alignment of many sequences.
no.sooner.---met.---------but.they.-look'd
no.sooner.look'd.---------but.they.-lo-v'd
no.sooner.lo-v'd.---------but.they.-sigh'd
no.sooner.sigh'd.---------but.they.--asked.one.another.the.reason
no.sooner.knew.the.reason.but.they.-------------sought.the.remedy
no.sooner. .but.they.
The last line shows characters conserved in all sequences in the alignment.
Example 1.2
Determine, from the sequences of pancreatic ribonuclease from horse (Equus caballus), minke whale
(Bolaenoptera acutorostrata) and red kangaroo (Macropus rufus), which two of these species are most closely related.
Knowing that horse and whale are placental mammals and kangaroo is a marsupial, we expect horse and whale to be the closest pair. Retrieving the three sequences as in the previous example and pasting the following:
>RNP_HORSE
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEP
LADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTS
QKERHIIVACEGNPYVPVHFDASVEVST
>RNP_BALAC
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHES
LEDVKAVCSQKNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTS
QKEKHIIVACEGNPYVPVHFDNSV
>RNP_MACRU
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPK
SVVDAVCHQENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSN
LNKQIIVACEGQYVPVHFDAYV
into the multiple-sequence alignment program CLUSTAL-W http://www.ebi.ac.uk/clustalw/
(or alternatively, T-coffee: http://www.ch.embnet.org/software/TCoffee.html)
produces the following:
CLUSTAL W (1.8) multiple sequence alignment
RNP_HORSE KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
60
RNP_BALAC
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
RNP_MACRU -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ
59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
RNP_HORSE KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
120
RNP_BALAC KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
RNP_MACRU ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
RNP_HORSE DASVEVST 128
RNP_BALAC DNSV---- 124
RNP_MACRU DAYV---- 122
* *
In this table, an * under the sequences indicates a position that is conserved (the same in all sequences), and : and . indicate positions at which all sequences contain residues of very similar physicochemical character (:), or somewhat similar physicochemical character (.).
Large patches of the sequences are identical. There are numerous substitutions but only one internal deletion.
By comparing the sequences in pairs, the number of identical residues shared among pairs in this alignment (not the same as counting *s) is:
Number of identical residues in aligned Ribonuclease A sequences (out of a total of 122–128 residues)
Horse and Minke whale 95
Minke Whale and Red kangaroo 82
Horse and Red kangaroo 75
Horse and whale share the most identical residues. The result appears significant, and therefore confirms our expectations.
Warning: Or is the logic really the other way round?
Let's try a harder one:
Example 1.3
The two living genera of elephant are represented by the African elephant (Loxodonta africana) and the Indian (Elephas maximus). It has been possible to sequence the mitochondrial cytochrome b from a specimen of the Siberian woolly mammoth (Mammuthus primigenius) preserved in the Arctic permafrost. To which modern elephant is this mammoth more closely related?
Retrieving the sequences and running CLUSTAL-W:
African elephant MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
Siberian mammoth MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
Indian elephant MTHTRKSHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ******:**:**********************************************
African elephant TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
Siberian mammoth TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
Indian elephant TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
African elephant LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPCIGTNLVEWIWGGFSVDKATLNRFFA 180
Siberian mammoth LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA
180
Indian elephant LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
********************************** ***:*********************
African Elephant LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
Siberian mammoth LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
Indian elephant LGLMPFLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFS 360
**:**:**********************: *** ************* ************
African elephant IILAFLPIAGMIENYLIK 378
Siberian mammoth IILAFLPIAGMIENYLIK 378
Indian elephant IILAFLPIAGMIENYLIK 378
**********:*******
The mammoth and African elephant sequences have 10 mismatches, and the mammoth and Indian elephant sequences have 14 mismatches. It appears that mammoth is more closely related to African elephants.
However, this result is less satisfying than the previous one. There are fewer differences. Are they significant? (It is harder to decide whether the differences are significant because we have no preconceived idea of what the answer should be.)
This example raises a number of questions:
1. We 'know' that African and Indian elephants and mammoths must be close relatives - just look
at them. But could we tell from these sequences alone that they are from closely related
species?
2. Given that the differences are small, do they represent evolutionary divergence arising from
selection, or merely random noise or drift? We need sensitive statistical criteria for judging the
significance of the similarities and differences.
As background to such questions, let us emphasize the distinction between similarity and homology. Similarity is the observation or measurement of resemblance and difference, independent of the source of the resemblance. Homology means, specifically, that the sequences and the organisms in which they occur are descended from a common ancestor, with the implication that the similarities are shared ancestral characteristics. Similarity of sequences (or of macroscopic biological characters) is observable in data collectable now, and involves no historical hypotheses. In contrast, assertions of homology are statements of historical events that are almost always unobservable. Homology must be an inference from observations of similarity. Only in a few special cases is homology directly observable; for instance in family pedigrees showing unusual phenotypes such as the Hapsburg lip, or in laboratory populations, or in clinical studies that follow the course of viral infections at the sequence level in individual patients.
The assertion that the cytochromes b from African and Indian elephants and mammoths are homologous
means that there was a common ancestor, presumably containing a unique cytochrome b, that by alternative mutations gave rise to the proteins of mammoths and modern elephants. Does the very high degree of similarity of the sequences justify the conclusion that they are homologous; or are there other explanations?
It might be that a functional cytochrome b requires so many conserved residues that cytochromes b
from all animals are as similar to one another as the elephant and mammoth proteins are. We
can test this by looking at cytochrome b sequences from other species. The result is that
cytochromes b from other animals differ substantially from those of elephants and mammoths.
A second possibility is that there are special requirements for a cytochrome b to function well in an
elephant-like animal, that the three cytochrome b sequences started out from independent
ancestors, and that common selective pressures forced them to become similar. (Remember that
we are asking what can be deduced from cytochrome b sequences alone.)
The mammoth may be more closely related to the African elephant, but since the time of the last
common ancestor the cytochrome b sequence of the Indian elephant has evolved faster than that
of the African elephant or the mammoth, accumulating more mutations.
Still a fourth hypothesis is that all common ancestors of elephants and mammoths had very
dissimilar cytochromes b, but that living elephants and mammoths gained a common gene by
transfer from an unrelated organism via a virus.
Suppose however we conclude that the similarity of the elephant and mammoth sequences is taken to be high enough to be evidence of homology, what then about the ribonuclease sequences in the previous example?
Are the larger differences among the pancreatic ribonucleases of horse, whale and kangaroo evidence that they are not homologues?
How can we answer these questions? Specialists have undertaken careful calibrations of sequence similarities and divergences, among many proteins from many species for which the taxonomic relationships have been worked out by classical methods. In the example of pancreatic ribonucleases, the reasoning from similarity to homology is justified. The question of whether mammoths are closer to African or Indian elephants is still too close to call, even using all available anatomical and sequence evidence. Analyses of sequence similarities are now sufficiently well established that they are considered the most reliable methods for establishing phylogenetic relationships, even though sometimes - as in the elephant example - the results may not be significant, while in other cases they even give incorrect answers. There are a lot of data available, effective tools for retrieving what is necessary to bring to bear on a specific question, and powerful analytic tools. None of this replaces the need for thoughtful scientific judgement.