Multiple sequence alignments

2. An example of a multiple alignment

Let us use the the cellulose-binding domain of cellobiohydrolase I (CBD-CBH1) as an example of what one may do with a multiple sequence alignment. This is a small (about 30-35 residues) disulfide-bonded domain of known 3D structure (PDB code 1CBH). Homologous domains can be found in a number of other cellulose-degrading enzymes. It is believed that the function of the domain is either to bind with high affinity to the cellulose fiber to allow the adjacent enzymatic domain to hydrolyse the cellulose. Another possibility is that the CBD domain wedges itself in between cellulose chains, making it easier for the enzymatic domain to attack the fiber.

The multiple alignment of these sequences is taken from the Pfam database (the entry with identifier CBM_1 (formerly CBD_1), accession code PF00734). Shown below is the so-called seed alignment, containing the sequences the Pfam curators have used to define the family. This is just a part of the complete alignment file; some comments have been removed. For each sequence, the SWISS-PROT identifier and the position in the parent protein is given on the left. The top line shows the position numbers using the 1CBH 3D structure scheme. The bottom line shows the consensus, which we define here as the same amino-acid residue type in 14 or more sequences (out of 18). Please note that this definition of consensus is just one of many possible.

                                   1            2         3
                             45678901...234567890123456789012

GUX1_TRIRE/481-509           HYGQCGGI...GYSGPTVCASGTTCQVLNPYY
GUN1_TRIRE/427-455           HWGQCGGI...GYSGCKTCTSGTTCQYSNDYY
GUX1_PHACH/484-512           QWGQCGGI...GYTGSTTCASPYTCHVLNPYY
GUN2_TRIRE/25-53             VWGQCGGI...GWSGPTNCAPGSACSTLNPYY
GUX2_TRIRE/30-58             VWGQCGGQ...NWSGPTCCASGSTCVYSNDYY
GUN5_TRIRE/209-237           LYGQCGGA...GWTGPTTCQAPGTCKVQNQWY
GUNF_FUSOX/21-49             IWGQCGGN...GWTGATTCASGLKCEKINDWY
GUX3_AGABI/24-52             VWGQCGGN...GWTGPTTCASGSTCVKQNDFY
GUX1_PENJA/505-533           DWAQCGGN...GWTGPTTCVSPYTCTKQNDWY
GUXC_FUSOX/482-510           QWGQCGGQ...NYSGPTTCKSPFTCKKINDFY
GUX1_HUMGR/493-521           RWQQCGGI...GFTGPTQCEEPYICTKLNDWY
GUX1_NEUCR/484-512           HWAQCGGI...GFSGPTTCPEPYTCAKDHDIY
PSBP_PORPU/26-54             LYEQCGGI...GFDGVTCCSEGLMCMKMGPYY
GUNB_FUSOX/29-57             VWAQCGGQ...NWSGTPCCTSGNKCVKLNDFY
PSBP_PORPU/69-97             PYGQCGGM...NYSGKTMCSPGFKCVELNEFF
GUNK_FUSOX/339-370           AYYQCGGSKSAYPNGNLACATGSKCVKQNEYY
PSBP_PORPU/172-200           RYAQCGGM...GYMGSTMCVGGYKCMAISEGS
PSBP_PORPU/128-156           EYAACGGE...MFMGAKCCKFGLVCYETSGKW

consensus                    ...QCGG.......G...C.....C.......

It is somewhat difficult to see the patterns of conservation in this table. The positions that are nearly completely conserved are easy to spot, but with more varied patterns, it becomes more difficult. For example, in position 24 there is a threonine in 9 sequences, and lysine in 5.

There is usually a problem with the numbering scheme in multiple alignments: the numbers in a multiple alignment are usually different from the numbering of any of the single sequences. (The terms "absolute numbers" versus "relative" have been used to describe the difference). Therefore it is necessary to be very careful when using sequence numbers from a multiple alignment; the numbers may be very different from the actual positions of the residues in any single sequence. For a few protein families (e.g. serine proteases of the trypsin family), a general scheme has been adopted that most scientists in those fields use.

It is common to use shaded boxes or coloured background to highlight residues or segments of a multiple alignment where the residues are strongly conserved. Commercial and some academic software can be used to add such features, but there is no common standard for exactly how this should look. It is necessary to check with the program documentation to figure out exactly how it works.

There are several other observations one might make about multiple alignments. For example, the fact that residues are aligned in a column does not necessarily mean that they are actually aligned structurally or in any other way. There is no common, accepted way of showing a 'frayed' alignment.

The correlation between residues far apart in the sequence in a protein family is usually difficult to spot in a multiple alignment. Other methods must be used to visualize this.

The combination of a known 3D structure and a multiple alignment can be very powerful for understanding a protein domain. Of course, knowledge of the biology and chemistry of the proteins increases the understanding. Often, a multiple alignment can help tie together many different observations into a coherent view of the structure and function of a protein domain.

Here is a schematic image of the 3D structure (PDB code 1CBH), as determined by protein NMR (Kraulis et al, 1989). A few residues are labelled, just to show some features of the structure, and to help with comparing with the multiple alignment.

Here is a table containing the counts per amino acid type for each position in the 18 sequences. Although it contains a lot of information, it is not particularly nice to look at, don't you agree?

	.	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	.	1	.	1	1	.	.	3	1	.	2	.	.	1	2	2	.	.	4	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	11	7
3	.	5	.	.	1	.	10	.	.	.	.	.	.	.	1	.	.	.	.	.	1
4	.	1	.	.	.	.	.	.	.	.	.	.	.	.	17	.	.	.	.	.	.
5	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
6	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
7	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
8	.	1	.	.	1	.	.	.	7	.	.	2	3	.	3	.	1	.	.	.	.
9	17	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.
10	17	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
11	17	1	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
12	.	.	.	.	.	.	12	.	.	.	.	1	4	.	.	.	.	.	.	.	1
13	.	.	.	.	.	4	.	.	.	.	.	.	.	1	.	.	.	.	.	7	6
14	.	.	.	1	.	.	.	.	.	.	.	2	1	.	.	.	8	6	.	.	.
15	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
16	.	2	1	.	.	.	.	.	.	1	.	.	1	9	.	.	2	1	1	.	.
17	.	.	.	.	.	.	.	.	.	2	1	.	.	1	.	.	.	14	.	.	.
18	.	1	4	.	.	.	.	.	.	.	.	2	1	.	1	.	.	8	1	.	.
19	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
20	.	7	.	.	1	.	.	.	.	2	.	.	.	1	1	.	2	2	2	.	.
21	.	1	.	.	3	1	1	.	.	.	.	.	.	2	.	.	9	1	.	.	.
22	.	.	.	.	.	.	12	.	.	.	.	.	.	6	.	.	.	.	.	.	.
23	.	.	.	.	.	2	1	.	.	.	3	.	1	.	.	.	4	2	.	.	5
24	.	1	.	.	.	.	.	.	1	5	.	1	.	.	.	.	.	9	1	.	.
25	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
26	.	1	.	.	1	.	.	1	.	2	.	2	.	.	2	.	1	2	5	.	1
27	.	1	.	.	2	.	.	.	.	9	.	.	.	.	.	.	.	1	3	.	2
28	.	.	.	1	.	.	.	.	3	.	6	1	.	.	4	.	2	1	.	.	.
29	.	.	.	.	.	.	1	1	.	.	.	.	14	.	.	.	2	.	.	.	.
30	.	.	.	9	3	.	1	.	.	.	.	.	.	4	1	.	.	.	.	.	.
31	.	.	.	.	.	4	1	.	1	1	.	.	.	.	.	.	.	.	.	4	7
32	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	1	.	.	1	15

A very useful representation of the conservation patterns is the so-called sequence logo. This shows the conserved residues as larger characters, where the total height of a column is proportional to how conserved that position is. Technically, the height is proportional to the information content of the position. Here is a web site for generating a sequence logo in PostScript format from an alignment.

On closer examination of this multiple alignment (together with the known 3D structure), there are a number of features in the multiple sequence alignment CBD_1 that stand out. For example:

One of the sequences (GUNK_FUSOX) contains an insertion. The location of this insertion corresponds to a turn in the 3D structure, so it can easily be accommodated without large rearrangements of any other parts of the domain.
There is a problem close to the insertion in the sequence GUNK_FUSOX. The Tyr residue immediately to the left of it should probably be in the column 13, which has a conserved aromatic residue (see above). Otherwise, the entire loop structure in this region would have to be very different. However, having two gaps on either side of a residue is something most multiple alignment programs do not like.
There are three strongly conserved Gly residues: numbers 6, 7, 15: There could be several (different) reasons for this; the phi-psi angles may have values forbidden or highly unfavourable for other residues, or the positions in the structure may not allow for any sidechain without destroying the function.
Inspection of the conserved residues shows that several of them (2 Trp/Phe, 4 Gln, 31 Tyr/Phe/Trp, 32 Tyr) are located on the same side of the 3D structure. There seems to be no particular architectural reason for the conservation of these residues (they do not form a hydrophobic core, for instance), so maybe it has to do with the function of binding cellulose?
Residue 13 Tyr/Phe/Trp has a very special reason for being conserved as an aromatic residue, as shown by the structure. There is a hydrogen bond between the amide H of the residue (residue 14) with the pi-system of the aromatic ring.
There are three strictly conserved Cys residues in the alignment. This is strange. These proteins are secreted, so one would assume that the cysteines form disulfide bridges. This means that the number of conserved Cys residues should be even. If one looks closer at the individual sequences in SWISS-PROT, one can spot another conserved Cys three residues further on from the end of the Pfam alignment. Apparently, the Pfam definition ought really to contain a few more residues at the C-terminus.

	.	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	.	1	.	1	1	.	.	3	1	.	2	.	.	1	2	2	.	.	4	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	11	7
3	.	5	.	.	1	.	10	.	.	.	.	.	.	.	1	.	.	.	.	.	1
4	.	1	.	.	.	.	.	.	.	.	.	.	.	.	17	.	.	.	.	.	.
5	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
6	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
7	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
8	.	1	.	.	1	.	.	.	7	.	.	2	3	.	3	.	1	.	.	.	.
9	17	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.
10	17	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
11	17	1	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
12	.	.	.	.	.	.	12	.	.	.	.	1	4	.	.	.	.	.	.	.	1
13	.	.	.	.	.	4	.	.	.	.	.	.	.	1	.	.	.	.	.	7	6
14	.	.	.	1	.	.	.	.	.	.	.	2	1	.	.	.	8	6	.	.	.
15	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
16	.	2	1	.	.	.	.	.	.	1	.	.	1	9	.	.	2	1	1	.	.
17	.	.	.	.	.	.	.	.	.	2	1	.	.	1	.	.	.	14	.	.	.
18	.	1	4	.	.	.	.	.	.	.	.	2	1	.	1	.	.	8	1	.	.
19	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
20	.	7	.	.	1	.	.	.	.	2	.	.	.	1	1	.	2	2	2	.	.
21	.	1	.	.	3	1	1	.	.	.	.	.	.	2	.	.	9	1	.	.	.
22	.	.	.	.	.	.	12	.	.	.	.	.	.	6	.	.	.	.	.	.	.
23	.	.	.	.	.	2	1	.	.	.	3	.	1	.	.	.	4	2	.	.	5
24	.	1	.	.	.	.	.	.	1	5	.	1	.	.	.	.	.	9	1	.	.
25	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
26	.	1	.	.	1	.	.	1	.	2	.	2	.	.	2	.	1	2	5	.	1
27	.	1	.	.	2	.	.	.	.	9	.	.	.	.	.	.	.	1	3	.	2
28	.	.	.	1	.	.	.	.	3	.	6	1	.	.	4	.	2	1	.	.	.
29	.	.	.	.	.	.	1	1	.	.	.	.	14	.	.	.	2	.	.	.	.
30	.	.	.	9	3	.	1	.	.	.	.	.	.	4	1	.	.	.	.	.	.
31	.	.	.	.	.	4	1	.	1	1	.	.	.	.	.	.	.	.	.	4	7
32	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	1	.	.	1	15

	.	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	.	1	.	1	1	.	.	3	1	.	2	.	.	1	2	2	.	.	4	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	11	7
3	.	5	.	.	1	.	10	.	.	.	.	.	.	.	1	.	.	.	.	.	1
4	.	1	.	.	.	.	.	.	.	.	.	.	.	.	17	.	.	.	.	.	.
5	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
6	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
7	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
8	.	1	.	.	1	.	.	.	7	.	.	2	3	.	3	.	1	.	.	.	.
9	17	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.
10	17	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
11	17	1	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
12	.	.	.	.	.	.	12	.	.	.	.	1	4	.	.	.	.	.	.	.	1
13	.	.	.	.	.	4	.	.	.	.	.	.	.	1	.	.	.	.	.	7	6
14	.	.	.	1	.	.	.	.	.	.	.	2	1	.	.	.	8	6	.	.	.
15	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
16	.	2	1	.	.	.	.	.	.	1	.	.	1	9	.	.	2	1	1	.	.
17	.	.	.	.	.	.	.	.	.	2	1	.	.	1	.	.	.	14	.	.	.
18	.	1	4	.	.	.	.	.	.	.	.	2	1	.	1	.	.	8	1	.	.
19	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
20	.	7	.	.	1	.	.	.	.	2	.	.	.	1	1	.	2	2	2	.	.
21	.	1	.	.	3	1	1	.	.	.	.	.	.	2	.	.	9	1	.	.	.
22	.	.	.	.	.	.	12	.	.	.	.	.	.	6	.	.	.	.	.	.	.
23	.	.	.	.	.	2	1	.	.	.	3	.	1	.	.	.	4	2	.	.	5
24	.	1	.	.	.	.	.	.	1	5	.	1	.	.	.	.	.	9	1	.	.
25	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
26	.	1	.	.	1	.	.	1	.	2	.	2	.	.	2	.	1	2	5	.	1
27	.	1	.	.	2	.	.	.	.	9	.	.	.	.	.	.	.	1	3	.	2
28	.	.	.	1	.	.	.	.	3	.	6	1	.	.	4	.	2	1	.	.	.
29	.	.	.	.	.	.	1	1	.	.	.	.	14	.	.	.	2	.	.	.	.
30	.	.	.	9	3	.	1	.	.	.	.	.	.	4	1	.	.	.	.	.	.
31	.	.	.	.	.	4	1	.	1	1	.	.	.	.	.	.	.	.	.	4	7
32	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	1	.	.	1	15

	.	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	.	1	.	1	1	.	.	3	1	.	2	.	.	1	2	2	.	.	4	.	.
2	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	11	7
3	.	5	.	.	1	.	10	.	.	.	.	.	.	.	1	.	.	.	.	.	1
4	.	1	.	.	.	.	.	.	.	.	.	.	.	.	17	.	.	.	.	.	.
5	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
6	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
7	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
8	.	1	.	.	1	.	.	.	7	.	.	2	3	.	3	.	1	.	.	.	.
9	17	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.
10	17	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
11	17	1	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
12	.	.	.	.	.	.	12	.	.	.	.	1	4	.	.	.	.	.	.	.	1
13	.	.	.	.	.	4	.	.	.	.	.	.	.	1	.	.	.	.	.	7	6
14	.	.	.	1	.	.	.	.	.	.	.	2	1	.	.	.	8	6	.	.	.
15	.	.	.	.	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.
16	.	2	1	.	.	.	.	.	.	1	.	.	1	9	.	.	2	1	1	.	.
17	.	.	.	.	.	.	.	.	.	2	1	.	.	1	.	.	.	14	.	.	.
18	.	1	4	.	.	.	.	.	.	.	.	2	1	.	1	.	.	8	1	.	.
19	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
20	.	7	.	.	1	.	.	.	.	2	.	.	.	1	1	.	2	2	2	.	.
21	.	1	.	.	3	1	1	.	.	.	.	.	.	2	.	.	9	1	.	.	.
22	.	.	.	.	.	.	12	.	.	.	.	.	.	6	.	.	.	.	.	.	.
23	.	.	.	.	.	2	1	.	.	.	3	.	1	.	.	.	4	2	.	.	5
24	.	1	.	.	.	.	.	.	1	5	.	1	.	.	.	.	.	9	1	.	.
25	.	.	18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
26	.	1	.	.	1	.	.	1	.	2	.	2	.	.	2	.	1	2	5	.	1
27	.	1	.	.	2	.	.	.	.	9	.	.	.	.	.	.	.	1	3	.	2
28	.	.	.	1	.	.	.	.	3	.	6	1	.	.	4	.	2	1	.	.	.
29	.	.	.	.	.	.	1	1	.	.	.	.	14	.	.	.	2	.	.	.	.
30	.	.	.	9	3	.	1	.	.	.	.	.	.	4	1	.	.	.	.	.	.
31	.	.	.	.	.	4	1	.	1	1	.	.	.	.	.	.	.	.	.	4	7
32	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	1	.	.	1	15