The Nobel Prize in Chemistry 1980

Walter Gilbert

contributions concerning the determination of base sequences in nucleic acids

1. INTRODUCTION

With the introduction of methods of rapid nucleic acid sequence determination, synthesis of mixed oligonucleotide probes and computer-assisted analysis of nucleic acid sequences, the use of a single symbol to designate a variety of possible nucleotides at a single position has become widespread over the last few years. Whereas the use of, for example, the symbols R and Y to designate purine (A or G) and pyrimidine (C or T) ribonucleotides respectively [1] is generally accepted, no agreed symbols exist for the other possible combinations. Indeed, a plethora of diverse systems has proliferated in the last few years [2-11]. It is striking that, in one extreme case, the combination (C or G) has been represented by at least five different symbols [2-4, 8, 11]. A standardized set of symbols is thus required to prevent confusion. The symbols are intended to be applicable to both deoxyribonucleic and ribonucleic acids. Thus it is important to note from the outset that the recommended symbols will not discriminate between DNA and RNA, and the symbol T will be employed at all positions where U might appear in the RNA. Similarly, no distinction will be made in the symbols between base, nucleoside and nucleotide. Sequences may be assumed to have a deoxyribose backbone (DNA) unless specified otherwise. These changes from earlier recommendations [1] reflect great advances in techniques for sequencing DNA, so that RNA sequences are now commonly deduced from the corresponding DNA sequences. Since the standard representation of a DNA sequence may be converted to the corresponding RNA sequence by the simple expedient of substituting T by U, it is not envisaged that data banks based on computer storage facilities will inevitably contain entries for both DNA and its RNA equivalent. Authors should always, however, make it clear which strand of DNA or RNA a given sequence refers to, and in circumstances where confusion between DNA and RNA is likely the sequence may be prefixed with the lower-case letter d or r, as in the previous recommendations [1]. As the present recommendations present unique alphabetic symbols for each nucleotide combination, the use of upper- and lower-case letters as equivalent does not lead to confusion. However, such use may cause confusion between r (ribo-) and R (purine), and care must be taken in those rare cases where the various symbols are used in combination. In general, it should be emphasised (i) that upper-case symbols are advocated. and (ii) that the present recommendations are not intended to prejudice any possible future use of contrasting upper- and lower-case letters for specific purposes. It was previously [1] recommended that hyphens should be used to represent phosphodiester linkages in known nucleotide sequences. As there is now little danger of confusion between codon triplets and nucleotide sequences this recommendation is no longer considered necessary. Hyphens may therefore be omitted from sequences, and are omitted from all sequences in this document. In addition it may be assumed that all sequences are presented 5' to 3' unless otherwise specified, although specific mention of this fact is not discouraged.

Although several diverse systems of symbols for incompletely specified bases already exist in the literature, this presentation makes no systematic review. Details of the previous recommendations may be found in [1], and of systems that have been used in the literature in [2-11].

2. APPLICATIONS OF A STANDARD NOMENCLATURE

2.1. Recognition sequences in DNA for restriction enzymes

Most restriction enzymes and their corresponding methylases recognise simple unique nucleotide sequences in DNA. For example. EcoRI and BamHI recognise the sequences 5'-GAATTC-3' and 5'-GGATCC-3' respectively. Nevertheless, a growing class of enzymes includes those that recognise series of derivative sequences, where two or more bases may be present at a particular position in the recognition sequence (for a complete listing see [12]). For instance, the enzyme AvaI recognises four different sequences 5'-CCCGGG-3', 5'-CCCGAG-3', 5'-CTCGGG-3' and 5'-CTCGAG-3'. The recognition sequence for AvaI may thus be represented as 5'-CYCGRC-3', where Y represents a pyrimidine and R represents a purine, as recommended previously [1]. However, several newer enzymes recognise combinations that are not covered by the existing symbols. For instance, AccI recognises the sequence 5'-GT(A or C) (G or T)AC-3' [13]. SduI recognises the sequence 5'-G(A or G or T)GC(A or C or T)C-3' [14]. The present symbols are intended to cover these possibilities.

2.2. Recognition sequences in DNA for other enzymes

Restriction enzymes are highly specific for particular nucleic acid sequences. For many other enzymes the specificity is rather more lax, however, and the symbols are intended to meet in part the need for presenting a schematic summary of the sequence features. For instance, sequences recognised by the RNA polymerase of Escherichia coli may be presented as the juxtaposition of two sequences 5'-AA(A or T)NTNNN(C or G)TTGACA-3' and 5'-(T or G)NNTATAAT-3' separated by 13 to 16 nucleotides (adapted from [15, 16]), where N represents any nucleotide. A similar treatment may be applied to the recognition sequences for other DNA binding proteins such as repressor molecules.

2.3. Recognition sequences in RNA for enzymes involved in translation RNA sequences are. as mentioned above, most con- veniently represented as their DNA counterparts. Thus the basic elements of a translation initiation site in Escherichia coli may be represented by 5'-(G or A) (G or A)GGGNNNNAN(C or T)ATGNN(A or T)NNNNN(C, T or G) (adapted from [17]). Similarly, translation initiation sites in eukaryotic mRNAs tend to conform to the sequence 5'-ANNATG(G or A)-3' [18].

2.4. Codon degeneracy Although there are 64 possible triplet codons, there are only 20 different amino acids coded by them. Thus most amino acids are inserted into a growing polypeptide chain in response to two or more different triplets in the mRNA ([19] for a general review). For example, proline is coded by 5'-CCN-3' and alanine by 5'-GCN-3'. In other cases the pattern may be more complex, such as for isoleucine, which is coded by 5'-AT(T, C or A)-3'. Note that certain amino acids (e.g. serine) may be coded by two distinct groups of triplets [here 5'-TCN-3' and 5'-AG(T or C)-3'], which cannot be adequately represented as 5'-(T or A) (C or G)N-3' (see Table 4). It is to be noted that synthetic oligonucleotide probes for detecting protein-coding sequences often involve the preparation of 'mixed probes'. Here a mixture of two (or more) nucleotides is incorporated at a single position in the oligonucleotide to take account of the redundancy of the genetic code (for instance [20]). It is anticipated that a single-letter code might be used to designate such mixtures.

2.5. Construction of ancestral sequences by parsimony procedures [21] Where two descendants differ in nucleic acid sequence at a particular position (for instance A in one and G in the other), the putative ancestral sequence can be represented [10] using a single-letter code, in this case R.

2.6. Other uses The symbols are intended to be useful for all purposes in which the exact identity of a nucleotide may vary. Thus uncertainties encountered with primary nucleic acid sequence data may, in some cases, be represented using standard symbols.

3. ALLOCATION OF SYMBOLS In the choice of symbols the following considerations have been taken into account: (i) conformity to previous IUPAC- IUB nomenclature [1]; (ii) logical derivation; (iii) ease of memorisation; (iv) availability of symbols on a standard type- writer keyboard; (v) historical precedence.

3.1. Guanine, adenine, thymine, cytosine: G, A, T, C These one-letter symbols have previously been established [1] and are generally used. There is, however, a problem of discriminating between the upper-case letters G and C on poorly copied sequences. Nevertheless, the use of alternative symbols for G (such as a barred-G, ) is not recommended. Discrimination between the lower-case letters is much clearer. Note that T and U may, in general, be considered as being synonyms, though care should be taken to avoid ambiguity in circumstances where it is likely. e.g. in discussing artificial hybrids of DNA and RNA and in cases where specific distinction between T and U is advisable.

3.2. Purine (adenine or guanine): R R is the symbol previously recommended [1].

3.3. Pyrimidine (thymine or cytosine): Y Y is the symbol previously recommended [1].

3.4. Adenine or thymine: W Although several diverse symbols have been used for this pair, (and for the reciprocal pair G+C), only two symbols have a rational basis, L and W: L derives from DNA density (light; G+C - heavy - would thus be H); W derives from the strength of the hydrogen bonding interaction between the base pairs (weak for A+T: G +C - strong - would thus be S). However, the system recommended for the three-base series (not-A = B, etc., see below, section 3.8) rules out H as this would be not-G. W is thus recommended.

3.5. Guanine or cytosine: S The choice of this symbol is discussed above in section 3.4.

3.6. Adenine or cytosine: M There are few common features between A and C. The presence of an NH2 group in similar positions on both bases (Fig. 1) makes possible a logically derived symbol. A and N being ruled out, M (from aMino) is recommended. Origin of the symbols M and K The four bases are drawn so as to show the relationship between adenine and cytosine on the one hand, which both have aMino groups at the ring position most distant from the point of attachment to the sugar, and between guanine and thymine on the other, which both have Keto groups at the corresponding position. The ring atoms are numbered as recommended [24-26], although for the present purpose this has the disadvantage of giving discordant numbers to the corresponding positions.

3.7. Guanine or thymine: K By analogy with A and C (section 3.6), both G and T have Keto groups in similar positions (Fig. 1).

3.8. Adenine or thymine or cytosine: H Not-G is the most simple means of memorising this combination and symbols logically related to G were examined. F and H would both be suitable, as the letters before and after G in the alphabet, but A would have no equivalent to F. The use of H has historical precedence [2].

3.9. Guanine or cytosine or thymine: B Not-A as above (section 3.8).

3.10. Guanine or adenine or cytosine: V Not-T by analogy with not-G (section 3.8) would be U but this is ruled out to eliminate confusion with uracil. V is the next logical choice. Note that T and U may in some cases be considered to be synonyms.

3.11. Guanine or adenine or thymine: D Not-C as above (section 3.8).

3.12. Guanine or adenine or thymine or cytosine: N This symbol is suggested by the sound of the word 'aNy'. The use of X to represent an unknown base is acknowledged, but this is not recommended as the symbol refers to xanthine [1]. Occasionally it may be desirable to distinguish between unspecified (N) and unknown (X), but if X is used for this purpose it should be explicitly defined.

4. OTHER ACCESSORY SYMBOLS

There are a number of instances in which additional symbols may be required for routine work. Although this section provides a number of suggestions, these do not form part of the present recommendations. First, we consider the uncertainty as to whether a base exists at a certain position or not. A symbol denoting 'G or A or T or C or no nucleotide', for example '?' or '+', might be used to define regions of uncertainty of limited variable size in a recognition sequence (see for instance [22]). Alternatively, one of these symbols might be used as a modifier to denote uncertainty: '?A' might, for instance, denote 'A or no nucleotide at this position'. Second, the unambiguous absence of a nucleotide introduced into a sequence for alignment or comparison purposes alone could be represented bya:', though a simple space has much to recommend itself. Third, a specified number of unknown nucleotides might be represented by a symbol such as '=' in conjunction with numerals. so that. for example, '=300=' might denote the presence of 300 unknown nucleotides. Fourth, the symbol 'N' (unknown or unspecified) may be replaced by the hyphen '-' in circumstances where rapid visual discrimination between 'known' (essential) and 'unknown' (non-essential) sequences is desirable. The value of this may be judged by comparing 'NNNNNCNNGNTNN' with '-----C--G-T--', for example. Note that the use of the lower-case letter n may avoid the necessity for an additional symbol, as in 'nnnnnCnnGnTnn'. In addition, the use of the oblique or slash '/' may present advantages in the definition of the precise cleavage sites of restriction endonucleases. For instance, the cleavage specificity of the common enzyme EcoRI might be represented by G/AATTC, where cleavage occurs in both strands of the self-symmetrical sequence between the G and A residues. It is emphasised that the symbols appearing in this section do not form an integral part of the recommendations and must therefore be defined explicitly in the context in which they are used.

5. DISCUSSION

The present nomenclature, summarised in Table 1, has been formulated to deal with incomplete specification of bases in nucleic acid sequences. In cases where two or more bases are permitted at a particular position the nomenclature permits the allocation of a single-letter symbol. The nomenclature may also be applied where uncertainty exists as to extent and/or identity. For double-stranded nucleic acids Table 2 permits the allocation of symbols to the complementary strand. Examples are given whereby the nomenclature is applied to sequences recognised by certain type II restriction endonucleases (Table 3) and to uncertainties in deriving a nucleic acid sequence from the corresponding amino acid sequence (Table 4 ). Two applications fall outside the scope of the nomenclature and these are considered separately below.

I was born on March 21, 1932 in Boston, Massachusetts. My father, Richard V. Gilbert, an economist, was at that time at Harvard University. He worked for the Office of Price Administration during the second World War and later headed up a planning group advising the Pakistani government. My mother, Emma Cohen, was a child psychologist, who practiced giving intelligence tests to me and my younger sister. She educated us at home for the first few years, to keep my sister and me amused. We loved reading and raided the adult section of the public library. In 1939 my family moved to Washington D.C.; I was educated there in public schools, later at the Sidwell Friends high school.

I always had an interest in science, in those years minerology and astronomy (I was a member of a minerological society and an astronomical society as a child). I became interested in inorganic chemistry at high school. In my last year in high school, 1949, I was fascinated by nuclear physics and would skip school for long periods to go down to the Library of Congress to read about Van de Graaf generators and simple atom smashers. I went to Harvard and majored in chemistry and physics. I became interested in theoretical physics and, as a graduate student, worked in the theory of elementary particles, the quantum theory of fields. I spent my first graduate year at Harvard, then went to the University of Cambridge for two years, where I received my doctorate degree in 1957. My thesis supervisor was Abdus Salam; I worked on dispersion relations for elementary particle scattering: an effort to use a notion of causality, formulated as a mathematical property of analyticity of the scattering amplitude, to predict some aspects of the interaction of elementary particles. I met Jim Watson during this period. I returned to Harvard and, after a postdoctoral year and a year as Julian Schwinger's assistant, became an assistant professor of Physics. During the late fifties and early sixties, I taught a wide range of courses in theoretical physics and worked with graduate students on problems in theory. However, after a few years my interests shifted from the mathematical formulations of theoretical physics to an experimental field.

In the summer of 1960, Jim Watson told me about an experiment that he and Francois Gros and his students were working on. I found the ideas exciting and joined in for the summer. We were trying to identify messenger RNA, a short-lived RNA copy of a DNA gene, which serves as a carrier of information from the genome to the ribosomes, the factories that make proteins. After each messenger is used a few times to dictate the structure of a protein, it is broken down and recycled to make other messenger RNA molecules. The experiments sought a fleeting new component that we finally managed to pin down. I found the experimental work exciting and have continued research in molecular biology ever since.

After a year of work on messenger RNA, I returned briefly to physics then came back to biology to study how proteins are synthesized. I showed that a single messenger molecule can service many ribosomes at once and that the growing polypeptide chain always remains attached to a transfer RNA molecule. This last discovery illuminated the mechanism of protein synthesis: the protein chain is transferred in turn from one amino-acid-bearing transfer RNA to another as it grows, their order dictated by messenger RNA and ultimately by the genetic code on the DNA. In the middle sixties, Benno Müller-Hill and I isolated the lactose repressor: the first example of a genetic control element. A repressor is a protein product made by one gene in the bacterium in order to control a second gene by turning it off when its product is not wanted. This control function had been defined genetically by the work of Jacob and Monod, but a repressor is made in such small amounts that it was an extraordinarily elusive biochemical entity. We identified, characterized, and purified one. We developed bacterial strains which made several thousand fold more protein, and showed how that repressor functioned. In the late sixties, David Dressler and I invented the rolling circle model, which describes one of the two ways DNA molecules duplicate themselves. In the early seventies I isolated the DNA fragment to which the lac repressor bound and studied the interaction of the bacterial RNA polymerase and the lac repressor with DNA. In the middle seventies, Allan Maxam and I developed the rapid chemical DNA sequencing. At this time, I also became interested in and developed some of the recombinant DNA techniques, specifically showing that blunt end ligation was efficient in putting DNA fragments together. In the late seventies with Lydia Villa Komaroff and Argiris Efstratiadis I worked on bacterial strains that expressed a mammalian gene product, insulin. Currently I am interested in, on one hand, the making of useful proteins in bacteria and, on the other hand, the structure of genes and the evolution of DNA sequences.

After my change from physics to molecular biology, I was promoted at Harvard in Biophysics and later in Biochemistry and Molecular Biology. Since 1974 I have been an American Cancer Society Professor of Molecular Biology.

I am married to Celia Gilbert, a poet, and have two children, John Richard and Kate.