DNA Sequencing

1. Introduction

DNA sequencing is a laboratory method used to determine the sequence of a DNA molecule. Sequencing DNA means determining the order of the four chemical building blocks – called “bases” – that make up the DNA molecule. The sequence tells scientists the kind of genetic information that is carried in a particular DNA segment. For example, scientists can use sequence information to determine which stretches of DNA contain genes and which stretches carry regulatory instructions, turning genes on or off. In addition, and importantly, sequence data can highlight changes in a gene that may cause disease. In the DNA double helix, the four chemical bases always bond with the same partner to form “base pairs.” Adenine (A) always pairs with thymine (T); cytosine (C) always pairs with guanine (G). This pairing is the basis for the mechanism by which DNA molecules are copied when cells divide, and the pairing also underlies the methods by which most DNA sequencing experiments are done. The human genome contains about 3 billion base pairs that spell out the instructions for making and maintaining a human being. In 1980 Frederick Sanger was awarded the Nobel Prize in chemistry for his contributions to understanding DNA sequences.

In the early 1990s Sanger sequencing was used for sequencing a large fraction of genomes currently used in modern databases, including the human genome. In Sanger sequencing, the DNA to be sequenced serves as a template for DNA synthesis. A DNA primer is designed to be a starting point for DNA synthesis on the strand of DNA to be sequenced. Four individual DNA synthesis reactions are performed. The four reactions include normal A, G, C, and T deoxynucleotide triphosphates (dNTPs), and each contains a low level of one of four dideoxynucleotide triphosphates (ddNTPs): ddATP, ddGTP, ddCTP, or ddTTP. The four reactions can be named A, G, C and T, according to which of the four ddNTPs was included. When a ddNTP is incorporated into a chain of nucleotides, synthesis terminates. This is because the ddNTP molecule lacks a 3′ hydroxyl group, which is required to form a link with the next nucleotide in the chain. Since the ddNTPs are randomly incorporated, synthesis terminates at many different positions for each reaction. Following synthesis, the products of the A, G, C, and T reactions are individually loaded into four lanes of a single gel and separated using gel electrophoresis, a method that separates DNA fragments by their sizes. The bands of the gel are detected, and then the sequence is read from the bottom of the gel to the top, including bands in all four lanes. For instance, if the lowest band across all four lanes appears in the A reaction lane, then the first nucleotide in the sequence is A. Then if the next band from bottom to top appears in the T lane, the second nucleotide in the sequence is T, and so on. Due to the use of dideoxynucleotides in the reactions, Sanger sequencing is also called “dideoxy” sequencing.

The modern rapid methods for DNA sequencing fall into two broad groups depending on the procedures used to generate and relate sets of labelled oligonucleotides, after resolution by gel electrophoresis, permit the DNA sequence to be deduced. The first group of methods employs a primed synthesis approach in which a single stranded template, containing or comprising the sequence of interest, is copied to produce the radioactively labelled complementary strand. Sets of partially elongated molecules are produced by using chain-termination inhibitors. These molecules can then be fractionated on denaturing gels. The patterns of labelled bands obtained will be used to deduce the sequence. This process is very adaptable and originally devised for sequencing naturally occurring single stranded DNAs. Highly efficient procedures have now been developed for generating single stranded templates from any duplex DNA. The forerunner of the primed synthesis method was the plus and minus method developed by Sanger and Coulson in 1975. By itself the method has distinct limitations and additional procedures need to be employed to confirm regions of the sequences deduce. One of these procedures is the depurination method, which was originally introduced by Burton and Peterson. Another development which made more flexible was the single site ribosubstitution reaction. Further developments in primed synthesis methods were made possible  by the introduction of the chain terminating dideoxy-nucleotides as specific inhibitors.

2. DNA Sequencing by primed synthesis methods

This method was developed by sanger et all in the year 1973. This method was used to determine two particular nucleotide sequences in bacteriophage DNA using DNA polymerase primed by synthetic oligonucleotides. The target oligonucleotide is hybridized to a specific complementary region on the single strand DNA. Deoxynucleotides is added sequentially by DNA polymerase to the 3‘-OH end of the primer. Using [32P]-labelled deoxyribonucleoside 5‘-triphosphates a radioactive complimentary copy of a defined region of the template is obtained. Next stage of development on this method came after two years.  Analysis of the complementary DNA is done by fractionating the DNA fragments by electrophoresis on high resolution by polyacrylamide gels. It constitutes a relatively rapid and simple method for sequence analysis and illustrates the principle on which the modern chain-termination procedure is based.

3. Principle of the ‘plus and minus’ method

3.1 Primed synthesis

A DNA primer (commonly a restriction fragment or synthetic oligonucleotide) is hybridized to the single-stranded DNA template and the primer is extended to a limited degree with DNA polymerase I in the presence of all four deoxynucleoside 5‘- triphosphates one of which is labelled. Samples of the reaction mixtures are taken at different time periods to find the degree of extension of primers. The reaction is terminated by addition of EDTA and the various samples are then combined. DNA polymerase is removed by phenol extraction. Extended polynucleotide chain, still hybridized to the template, is separated from the excess deoxynucleoside triphosphates by gel filtration on sephadex or agarose. The product is a mixture of partially elongated fragments with variable chain length. Ideally the product is a mixture of polynucleotides in which all possible chain lengths of the complementary strand are present corresponding to an elongation of the primer from 0 to 200 nucleotides.

 3.2 The ‘minus’ reaction                                                              

In 1968 Wu and Kaiser showed that in the absence of one of the four deoxynucleoside triphosphates, DNA polymerase would accurately catalyse chain extension up to the point where the missing nucleotide should have been incorporated. The ‘minus’ reaction utilizes the same principle. The random mixture of oligonucleotides, still annealed to the DNA template is re-incubated with DNA polymerase in the presence of only three deoxynucleoside triphosphates. Synthesis then proceeds as far as the missing triphosphate on each chain. Four separate reactions are set up, each with one of the triphosphates missing. After incubation the reaction mixtures are denatured to separate the nascent strands from the template and the four samples simultaneously analyzed by electrophoresis on a high resolution polyacrylamide gel in the presence of 8M urea and the separated oligonucleotides visualized by autoradiography. On the acrylamide gel mobility is essentially proportional to size so that the various oligonucleotide products, all of which should have a common S-end will be arranged in a ladder according to size. Each oligonucleotide should be resolved from its neighbour, which contains one more residue, by a distinct space. As would be expected the resolution falls off with increasing chain length and it is this factor which usually determines how far the autoradiograph may be read. The separation between two fragments of say 150 and 151 nucleotides will be much less than that between two fragments of 20 and 21 nucleotides. The autoradiograph of the -A channel will consist of a set of bands, each of which corresponds to an extension product up to, but not including, the next A residue in the sequence. Thus the positions of the A residues are located. In a similar way the positions of the other residues are located from the sequencing channels and, in principle, the sequence of the DNA is read off from the autoradiograph. Usually, however, this system alone is not sufficient to establish the sequence and a second line of attack, the ‘plus’ system is used to confirm and complement the data from the ‘minus’ system.

3.3 The ‘plus’ system

In the year 1971, 1972 Englund observed that in the presence of a single deoxynucleoside triphosphate, DNA polymerase from phage T4 infected E. coli (T4 polymerase) will degrade double-stranded DNA from its 3‘-ends but this 3’-exonuclease activity will stop at residues corresponding to the single deoxynucleoside triphosphate present in the reaction mixture. Since T4 polymerase lacks the 5’-exonuclease activity found in E. coli polymerase incubation with this enzyme serves to trim back each elongated product to the residue corresponding to the added deoxynucleotide. The polymerizing activity of the enzyme catalyses the turnover of this residue but effectively halts the progress of the 3’-exonuclease activity. In the ‘plus’ reaction this method is applied to four further samples of the primer template complex isolated above. Samples are incubated with T4 polymerase and a single triphosphate and, after denaturation, the products are analysed directly by gel electrophoresis. Thus, in the +A system only dATP is added and all the chains will consequently terminate with a terminal A residue. The bands observed on the gel will therefore correspond to a set of fragments representing all the extension products which terminate with A. The products from the ‘plus’ reaction will be one residue longer than the corresponding band in the ‘minus’ A reaction.

4. The polyacrylamide gel: Interpretation of results

The analysis of the products from the ‘plus’ and ‘minus’ reactions demands an acrylamide gel capable of resolving oligonucleotides which differ in length by only one residue. A 12% polyacrylamide slab gel in a Tris-borate buffer containing 8M-urea is usually employed for this purpose. Since the adequate resolution of products is the main limiting factor in this type of analysis and since this in turn depends on obtaining a sharp autoradiograph it is important to use extremely thin gels to avoid the blurring which would unavoidably result from the high energy p-emission of [PI embedded in a thicker gel. We define the smallest oligonucleotide in the -T channel as band 1. This means that the next residue after the 3’-terminus of the oligonucleotide corresponding to band 1 will be a T. This is equivalent to, and is confirmed by the presence of a band in the +T channel corresponding to the next longest oligonucleotide. Band 2 occurs in the +T channel and -A channel showing that its 3’-terminus is T and the adjacent nucleotide is an A, thus defining the dinucleotide sequence T-A. The next longest oligonucleotide occurs in the +A and -C channels. This defines the dinucleotide A-C and so extends the sequence to T-A-C. Typically a sequence of 60-100 nucleotides, starting about 10-20 nucleotides from the 3’-end of the primer sequence, can be obtained from a single gel. In the above example each nucleotide in the sequence is represented by bands in both the + and – channels. However, if a run of two or more identical nucleotides occur, only the first one will be seen in the minus reaction and only the last one of the run will be seen in the equivalent plus reaction. If a synthetic oligonucleotide or a small restriction fragment (<l00 nucleotides) is used as a primer for the initial extension, the products of the plus and minus reactions can be analysed directly. Clearly, the smaller the primer that can be used, consistent with its ability to yield a unique primer-template complex in the annealing reaction, the greater the amount of sequence information that can be deduced since the extension reactions can be pushed further and still yield resolvable fragments. When using primers of this sort it is important to maintain the integrity of the 5’-terminus so that the difference in length of the fragments depends only on differences at their 3’ termini. This is achieved by using DNA polymerase lacking the normal 5’- exonuclease activity of DNA polymerase I. If longer restriction fragments are used as primers it becomes necessary to cleave the primer from the can be deduced for the synthesized oligonucleotide. Typically a sequence of 60-100 nucleotides, starting about 10-20 nucleotides from the 3’-end of the primer sequence, can be obtained from a single gel. In the above example each nucleotide in the sequence is represented by bands in both the + and – channels. However, if a run of two or more identical nucleotides occur, only the first one will be seen in the minus reaction and only the last one of the run will be seen in the equivalent plus reaction.

The distance separating the different bands, in either the plus or minus patterns, will define how many residues there are in a run. This can lead to problems in the precise determination of the lengths of longer ’runs’ and, partly for that reason, it is advantageous to include a ‘zero’ channel on the sequencing gel. This is simply an aliquot of the initial reaction which ideally will contain labelled oligonucleotides of all possible chain lengths and will therefore yield distinct bands corresponding to each residue of a run.  This is conveniently done by digestion with the restriction enzyme originally used to prepare the primer. The products for analysis in this case represent only the denovo synthesized sequences and in consequence are theoretically capable of yielding relatively more sequence data than in experiments where the primer remains attached to the analysed products. Some restriction enzymes, however, are inhibited by the single-stranded DNA present in uncopied regions of the template and these enzymes cannot therefore be reliably used to cleave the primer from the extended product. One way round this problem is to use the single-site ribosubstitution method, in which a single ribonucleotide is incorporated at the priming site thus allowing the primer to be cleaved from the extended product with ribonuclease or alkali. This method is also useful if the restriction endonuclease used to generate the primer has a second cleavage site within the region to be sequenced.

5. Additional methods useful in conjunction with plus and minus method

5.1 Ribosubstitution method

This method was developed by Brown in the year 1978. In the protocol described above, restriction fragments are used as primers for the synthesis of cDNA, and the same endonuclease is subsequently used to remove the primer and generate a unique 5′-terminus on the cDNA. However, as pointed out earlier, some restriction enzymes are strongly inhibited by the single stranded regions present in the template and cannot be used to cleave at the restriction site. An additional problem arises if a second cleavage site for the same enzyme is present within the sequence copied into the radioactive DNA. Two sets of fragments would be generated and a unique sequence would not be obtained. In the ribosubstitution method these problems are circumvented by the addition of one or more ribonucleotides between the DNA primer and the radioactive cDNA. This site is susceptible to cleavage with ribonuclease or alkali. In the presence of Mn++ ions, E. coli DNA polymerase I will incorporate ribonucleotides into DNA. In the single site ribosubstitution reaction a ribonucleotide is incorporated at the 3’-end of a DNA primer in the presence of Mn2+, and with no other triphosphates present. Further ribonucleotide incorporation is effectively suppressed in the subsequent elongation reaction by the addition of deoxyribonucleoside triphosphates.

 5.2 Procedure

5.2.1 Step 1 : Annealing reaction

Mix: 5 µl of primer (approx 1-2 pmol in H2O), 1 µl DNA template (single strand, approx. 0.4 pmol in H2O), 1.25 ml M NaCl, 1.25 ml 10 X Pol mix and 6.5 µl H2O – Seal in glass capillary tube (approx. 10 cm long x 1 mm internal diameter). Denature by heating to 100°C for 3 min, anneal by incubating at 67°C for 45 min.

5.2.2 Step 2: Primed synthesis of r3’P]-cDNA

  1. Dry down 20 Ci, if activity about 300Ci/mMol in a siliconized tube* in vacuo. (This is conveniently done as soon as the annealing reaction is started.) Annealed reaction mixtures from step 1.

To dry dATP add:

  1. 5 µl dCTP (0.5 mM)
  2. 5 µl dlITP (0.5 mM)
  3. 5 µl dGTP (0.5 mM)
  4. 5 µl 10 x Pol mix
  1. Mix contents by sucking up and down in capillary from siliconised tube at 0°C.
  • Start reaction by mixing in 1 µl DNA polymerase I.
  1. Hold at 0°C.
  2. Remove aliquot (approx 15 µl) after 1 min and eject into 25 µl 0.1 M EDTA, pH 7.6 to stop the reaction. After 3 min eject remainder of reaction mixture into the same EDTA.

 5.2.3 Step 3: Removal of polymerase and triphosphates

  1. To the extension mixture from Step 2 add 25 ml phenol
  2. Vortex for 0.5-1 min.
  • Extract 5 times with 1 ml-portions of ether to remove phenol.
  1. Remove last traces of ether with a stream of air or nitrogen.
  2. Load the sample onto a column of G-100 Sephadex ( 3 m X 200 mm) equilibrated with a degassed buffer containing 10-4 M EDTA,5 x 10-3 M Tris-HCl, pH 7.5.
  3. The polynucleotide is eluted with the break-through volume of the column (flow rate approx 50pVmin). Fractions of 2-3 drops may be collected and the position of the eluted DNA located by a hand held mini-monitor. Appropriate fractions are combined and freeze-dried.
  • At this stage the product should register >300c.p.s. on the mini-monitor, equivalent to an incorporation of about 5%- 10% of the radioactive nucleotide. Under these conditions the extension of the primer ranges from zero to 150 to 200 nucleotides.

 5.2.4 Step 4 : Plus and minus reactions

  1. Dissolve the [32P]-labelled extended polynucleotide in 20 ml H2O
  2. Set up 8 capillaries with drawn out tips, resting tip down in siliconized tubes, on ice. Introduce into the tip of each capillary 2 pl of the polynucleotide solution and 2p1 of the appropriate plus or minus mix.
  • Add 1 p1 T4 polymerase to capillaries 1-4, mix as before and incubate at 37°C for 45 min.
  1. Add 1 pl ‘Klenow’ polymerase to capillaries 5-8, mix and incubate at 0°C for 45 min.
  2. The next step depends on whether a short (< 100 nucleotides) or longer polynucleotide was used as a primer. In the former case the reaction is terminated by blowing each reaction mixture into lop1 formamide-dye mix and the remaining radioactive polynucleotide from step 3 (approx. 4 pl) is added to lop1 formamide-dye mix. This is the ‘zero’ sample. The nine samples are now ready to proceed to step 5. Where a longer primer was used, this needs to be cleaved from the product using the appropriate restriction endonuclease. Add 1 µl, (0.5 to 1.0 unit) of the datum restriction endonuclease to each sample, mix, and incubate at 37°C for 30min. (The datum endonuclease is that which defines the 5’ of the sequence under investigation, which will usually be the endonuclease used to prepare the primer fragment.) Stop the reaction by blowing into 10 µl formamide-dye mix.
  3. In addition, to provide the reference pattern of oligonucleotides (the zero channel) a sample of the radioactive polynucleotide prepared in Step 3 is also digested with the restriction endonuclease.
  • Mix: 2 µl radioactive polynucleotide from step 3, 1.5 µl H2O, 0.5 µl x 10 restriction buffer,1 µl restriction endonuclease (0.5 to 1.0 unit).Incubate 37″C, 30 min Stop reaction by blowing into lop1 forrnamide-dye mixture.

5.3 Step 5 : Gel electrophoresis

 5.3.1 (i) Preparation of the acrylamide gel

This is conveniently done while the extension product is drying (step 3). The gel is cast in a cell made from two tempered glass plates,* 40 cm x 20 crn separated by two ‘perspex’ (polymethylmethacrylate sheet) spacers, 1.0-1.5 mm thick, running the length of the gel compartment. The design is essentially that of Studier (1973). The cell is sealed along the bottom and both sides with waterproof tape and immediately after pouring the gel a close-fitting well former giving 12 wells (1.1cm wide) is inserted and the gel allowed to set in the near horizontal position. Immediately before the gel is required the tape is peeled off the bottom of the cell and the well former carefully removed (care is required not to break the wells). The cell is clamped vertically in the electrophoresis apparatus and the buffer compartments filled with 1 xTBE buffer..

  1. Dissolve 63 g urea (AnalaR) in 15 ml 10 X TBE plus 60 ml 30%
  2. Add 5 ml 1.6% freshly prepared ammonium persulphate and
  • Degas on water pump.
  1. Add 75 µl TEMED (N,N,N’,N’,-tetramethylethylenediamine). Mix gently.
  2. Pour gel immediately.
  3. The gel usually sets within 20-30min and can be used after 1 hour. Alternatively the gel can be left overnight, with the well former in place, before use.

5.3.2  (ii) Running the gel

Heat the nine samples from step 4 at 90” for 3min. Using a pasteur pipette blow fresh 1 x TBE into the sample wells in order to remove urea which has diffused out of the gel.

Load the samples into the gel wells using a drawn-out capillary tube.

A suitable order is:

  1. Run the gel at about 6OOV until the fast migrating dye (bromophenol blue) is at the bottom of the gel (approx. 4 hr). The gel gets quite hot during electrophoresis ensuring the DNA remains fully denatured.
  2. Remove one glass plate from the gel and cover the exposed surface with cellophane film. Label with radioactive ink (preferably 35S) and autoradiograph for 1-2 days at -20°C .Alternatively the gel may be fixed by immersion in 10% acetic acid for 15-20min, washed 1-2 min in distilled water, blotted dry with absorbent paper, covered with cellophane and autoradiographed at room temperature.

 5.3.3 (iii) Enzymes and chemicals

Formamide-dye mix: 0.03% xylene cyanol FF, 0.03% bromophenol blue, 25 mM-EDTA in 90% formamide.

30% acrylamide: 29% (w/v) acrylamide, 1% (w/v) bis-acrylamide. Deionized by stirring with Amberlite MB-1 (5 g/100 mi) for 1 hr and filtering.

Stop mix: 0.03% bromophenol blue, 40% sucrose, 25mM EDTA in H2O.

Depurination analysis of defined fragments generated by primed synthesis

Depurination and depyrimidination are chemical procedure in which the purine or pyrimidine bases, respectively , are removed from the DNA molecule such that the runs of consecutive pyrimidines and purines are left intact. Method devised by Burton and Petterson in 1960 involves incubation the DNA with 2% diphenylamine in 60% formic acid. This results in quantitative elimination of the purine nucleosides leaving pyrimidine tracts as the 3’, 5’-diphosphates. Analysis of these pyrimidine clusters was one of the most successful earlier procedures for obtaining sequence information and in conjunction with T4 endonuclease IV digestion, was used to determine the sequence of a 48-nucleotide fragment from X174 DNA. Nowadays, depurination is not much used as a sequencing procedure per se but is valuable in confirming the length and distribution of pyrimidine tracts in defined fragments generated by primed synthesis methods.

6. Discussion and Summary

The plus and minus method indeed is a major breakthrough in the development of primed synthesis method for the rapid development of DNA sequences. However, there are certain limitations and one major limitation is that it cannot be applied directly to double stranded DNA. So a strand separation of either the template or primer must first be carried out. However with the advent of cloning techniques the problem of obtaining single stranded DNAs from naturally double stranded molecules has now been solved. As pointed by sanger et al in the year 1977 the plus and minus method cannot be regarded as a completely reliable method in the absence of confirmatory data. The additional methods ribosubstitution and depurination have been useful in overcoming two of the problems that arise.

The field of DNA sequencing has a diverse history. In 1970s and for almost another decade, DNA sequencing was barely automated and therefore very tedious process which allowed determining only a few hundred nucleotides in an experiment. In the late 1980s, semi-automated sequencers with higher throughput became available, still only able to determine a few sequences at a time. One breakthrough in the early 1990s was the development of capillary array electrophoresis and appropriate detection systems. Only within the last five years, alternative sequencing strategies like pyrosequencing, reversible terminator chemistry, sequencing-by-ligation, virtual terminator chemistry and real-time sequencing were developed or converged into new instruments. These new instruments require us to completely redefine the term “highthroughput sequencing”, as they outperform the older Sanger-sequencing technologies by a factor of 100 to 1,000 in daily throughput and reduce the cost of sequencing one million nucleotides (1Mb) to 4%-0.1% of that associated with Sanger sequencing. In the past few years a number of sequencing technologies have been developed that are parallelizable and therefore able to create more sequence output compared to conventional Sanger sequencing. These are collectively called next-generation sequencing (NGS) approaches.

7. Next generation sequencing

A DNA polymerase-based method was developed by Sanger and Coulson in the 1970s for DNA sequencing. It was the first enzyme-based approach and is known as Sanger sequencing. This method was used to sequence bacteriophage ΦX174 for the first time. From the 5’ end up to 80 nucleotides were accurately identified and the following 50 nucleotides were identified with lower accuracy. In the following years enzyme based methods were optimized. The primary limitation was low throughput due to the template preparation and the enzymatic reactions. Sanger sequencing based on the automated sequencers was able to sequence up to 384 reads in parallel and about 1,000 bases per read per hour. The sequencing cost dropped rapidly due to advances in the sequencing methods.

Three major providers of next generation DNA sequencing are Illumina/Solexa, Roche/454 and Life/APG. Their methods include template preparation and sequencing followed by imaging and image data analysis. Specific changes in the protocols separate these methods. The two common approaches used during the template preparation step are either using clonally amplified templates or single molecule templates. Although these approaches differ in biochemistry they all follow the principle of cyclic-array sequencing. These technologies have been released as commercial products, e.g., the Solexa Genome Analyzer, the SOLiD platform, 454 Genome Sequencers, and the HeliScope Single Molecule Sequencer technology. These technologies create reads of length 25 – 250 bps and with up to 40 million reads per run.

8. Sanger capillary sequencing

Current Sanger capillary array electrophoresis (CAE) systems, are based on the same general scheme applied in 1977 for the φX174 genome: First, millions of copies of the sequence to be determined are purified or amplified, depending on the source of the sequence. Reverse strand synthesis is performed on these copies using a known priming sequence upstream of the sequence to be determined and a mixture of deoxy-nucleotides (dNTPs, the standard building blocks of DNA) and dideoxy-nucleotides (ddNTP, modi-fied nucleotides missing a hydroxyl group at the third carbon atom of the sugar). The dNTP/ddNTP mixture causes random, non-reversible termination of the extension reaction. Further denaturation and clean-up of free nucleotides, primers and the enzyme is performed. Then the resulting molecules are sorted by their molecular weight (corresponding to the point of termination). Then the label attached to the terminating ddNTPs is read out sequentially in the order created by the sorting step. Sorting by molecular weight was originally performed using gel electrophoresis but is nowadays carried out by capillary electrophoresis. Originally, radioactive or optical labels were applied in four different terminator reactions. While today with the advent of more sensitive detection techniques four different fluorophores, one per nucleotide (A, C, G and T) are used in a single reaction. In the modern sequencing reactions several rounds of primer extensions (equivalent to a linear amplification) and more sensitive detection systems permit smaller amounts of starting DNA to be used for. Unfortunately, there is still little automation for creation of the high copy input DNA with known priming sites and it is done by cloning. In the cloning procedure the target sequence is introduced into a known vector sequence using restriction and ligation process. Then a bacterial strain is used to amplify the target sequence in vivo – thereby exploiting the low amplification error due to inherent proof-reading and repair mechanisms. However this process is very tedious. Later integrated microfluidic devices have been developed which aim to automate the DNA extraction, in vitro amplification and sequencing on the same chip. Using current Sanger sequencing technology, it is technically possible for up to 384 sequences of between 600 and 1,000 nucleotides (nt) in length to be sequenced in parallel. The sequencing error observed for Sanger sequencing is mainly due to errors in the amplification.

9. 454/Roche Genome Sequencer

The 454 Genome Sequencer (GS) platform based on the pyrosequencing approach, released in 2005 was the first of the new high-throughput sequencing platforms. It was developed by P˚al Nyr´en and Mostafa Ronaghi at the Royal Institute of Technology Stockholm in 1996. In contrast to the Sanger technology, pyrosequencing is based on iteratively complementing single strands and simultaneously reading out the signal emitted from the nucleotide being incorporated (also called “sequencing by synthesis“ or “sequencing during extension“). As the read out is done simultaneously with the sequence extension so electrophoresis is no longer required to generate an ordered read out of the nucleotides. In the pyrosequencing process, one nucleotide at a time is washed over several copies of the sequence to be determined, causing polymerases to incorporate the nucleotide if it is complementary to the template strand. The incorporation stops if the longest possible stretch of complementary nucleotides has been synthesized by the polymerase. In the process of incorporation, one pyrophosphate per nucleotide is released and converted to adenosine triphosphate (ATP) by an ATP sulfurylase. The ATP drives the light reaction of luciferases present and the emitted light signal is measured. To prevent the deoxyadenosine triphosphate (dATP) provided in a typical sequencing reaction from being used directly in the light reaction, deoxy-adenosine-5’-(alpha-thio)-triphosphate (dATPαS), which is not a substrate of the luciferase, is used for the base incorporation reaction of adenine. After capturing the light intensity, the remaining unincorporated nucleotides are washed away and the next nucleotide is provided. Free nucleotides are then washed over the sequencing plate and the light emitted during the incorporation is captured for all wells in parallel using a high resolution CCD camera, exploiting the light-transporting features of the plate used. One of the main prerequisites for applying array-based pyrosequencing approach is covering individual beads with multiple copies of the same molecule. This is done by first creating sequencing libraries in which every individual molecule gets two different adapter sequences, one at the 5’ end and one at the 3’ end of the molecule. In the case of the 454/Roche sequencing library preparation, this is done by sequential ligation of two pre-synthesized oligos. One of the adapters added is complementary to oligonucleotides on the sequencing beads and thus allows molecules to be bound to the beads by hybridization. Low molecule-to-bead ratios and amplification from the hybridized double-stranded sequence on the beads (kept separate using polymerase chain reaction in an water-in-oil emulsion, i.e emulsion PCR) makes it possible to grow beads with thousands of bound copies of a single starting molecule. Using the second adapter, beads covered with molecules can be separated from empty beads (using capture beads with oligonucleotides complementary to the second adapter) and are then used in the sequencing reaction as described above. The average substitution error rate is in the range of 10−3 to 10−4, which is higher than the Sanger sequencing but is the lowest among new sequencing technologies As mentioned earlier for Sanger sequencing, in vitro amplifications performed for the sequencing preparation cause a higher background error rate, that is, the error introduced into the sample before it enters the sequencing process. In addition, in bead preparation (i.e. emulsion PCR step) a fraction of the beads end up carrying copies of multiple different sequences. These “mixed beads” will participate in a high number of incorporations per flow cycle, resulting in sequencing reads that do not reflect real molecules. Most of these reads are automatically filtered during the software post-processing of the data. The filtering of mixed beads may however cause a depletion of real sequences with a high fraction of incorporations per flow cycle. Most of these problems can be resolved by higher coverage. Strong light signals in one well of the picotiter plate may also result in insertions in sequences in neighbouring wells. If the neighbouring well is empty this can generate so-called ghost wells, i.e. wells for which a signal is recorded even though they contain no sequence template, hence the intensities measured are completely caused by bleed-over signal from the neighbouring wells. Error rate in the case of 454 sequencing, is caused by (1) a reduction in enzyme efficiency, (2) some molecules on the beads no longer being elongated and (3) by an increasing, so-called, phasing effect. Phasing is observed when a population of DNA molecules amplified from the same starting molecule (ensemble) is sequenced, and describes the process whereby not all molecules in the ensemble are extended in every cycle. This causes the molecules in the ensemble to lose synchrony/phase, and results in an echo of the preceding cycles to be added to the signal as noise. The current 454/Roche GS FLX Titanium platform makes it possible to sequence about 1.5 million such beads in a single experiment and to determine sequences of length between 300- 500nt. This is largely due to limitations imposed by the efficiency of polymerases and luciferases which drops over the sequencing run resulting in decreased base qualities.

10. Illumina Genome Analyzer

The reversible terminator technology used by the Illumina Genome Analyzer employs the sequencing similar to sangers sequencing concept. The incorporation reaction is stopped after each base, the label of the base incorporated is read out with fluorescent dyes and the sequencing reaction is then continued with the incorporation of the next base. Similar to 454/Roche, the Illumina sequencing protocol also requires that the sequences to be determined are converted into a sequencing library, which allows them to be amplified and immobilized for sequencing. For this purpose two different adapters are added to the 5’ and 3’ ends of all molecules using ligation of so-called forked adapters2 . The library is then amplified using longer primer sequences which extend and further diversify the adapters, i.e. add further unique nucleotides at both adapter ends, to create the final sequence needed in subsequent steps. This double-stranded library is melted using sodium hydroxide to obtain single stranded DNAs, which are then pumped at a very low concentration through the channels of a flow cell. This flow cell has on its surface two populations of immobilized oligonucleotides complementary to the two different single stranded adapter ends of the sequencing library. These oligonucleotides will hybridize to the single stranded library molecules. By reverse strand synthesis starting from the hybridized (double-stranded) part, the new strand being created is covalently bound to the flow cell. If this new strand bends over and attaches to another oligonucleotide complementary to the second adapter sequence on the free end of the strand, it can be used to synthesize a second covalently bound reverse strand. This process of bending and reverse strand synthesis, called bridge amplification, is repeated several times and creates what are termed clusters, the accumulation of several thousand copies of the original sequence in very close proximity to each other on the flow cell. These randomly distributed clusters contain molecules that represent the forward as well as reverse strands of the original sequences. Before determining the sequence, one of the strands has to be removed to prevent it from hindering the extension reaction sterically or by complementary base pairing. Selective strand removal targets base modifications of the oligonucleotide populations on the flow cell. Following strand removal, each cluster on the flow cell consists of single stranded, identically oriented copies of the same sequence; which can be sequenced by hybridizing the sequencing primer onto the adapter sequences and starting the reversible terminator chemistry. “Solexa sequencing”, as it was introduced in early 2007, initially allowed for the simultaneous sequencing of several million very short sequences (at most 26nt) in a single experiment. In recent years the Illumina Genome Analyzer (GA), has increased flow cell cluster densities (more than 300 million clusters per run), a wider range of the flow cell is imaged, and sequence reads of up to 125nt can be generated. This is achieved by chemical melting and washing away the synthesized sequence, repeating a few bridge amplification cycles for reverse strand synthesis and then selectively removing the starting strand (again using base modifications of the flow cell oligonucleotide populations), before blocking 3’ ends and annealing another sequencing primer for the second read. Using this “paired end sequencing” approach, approximately twice the amount of data can be generated. The Illumina library and flow cell preparation includes several in vitro amplification steps which cause a high background error rate and contribute to the average error rate of about 10−2 to 10−3. Further, the flow cell preparation creates a fraction of ordinary looking clusters which are initiated from more than one individual sequence. These result in mixed signals and mostly low quality sequences for these clusters. In an effect similar to the 454 ghost wells, the Illumina image analysis software may identify reagent crystals, dust and lint particles as clusters and call sequences from these. Similar to the other platforms, the error rate increases with increasing position in the determined sequence. This is mainly due to phasing, which increases the background noise as sequencing progresses. While the ensemble sequencing process for pyrosequencing creates unidirectional phasing from lagging, non-extended molecules, reversible terminator sequencing creates bi-directional phasing as some incorporated nucleotides may also fail to be correctly terminated – allowing the extension of the sequence by another nucleotide in the same cycle. The Genome Analyzer uses four fluorescent dyes to distinguish the four nucleotides A, C, G and T. Of these, two pairs (A/C and G/T) excited using the same laser, are similar in their emission spectra and show only limited separation using optical filters. Therefore, the highest substitution errors observed are between A/C and G/T. Even though the Illumina Genome Analyzer reads show a higher average error rate and are considerably shorter than 454/Roche reads, this instrument determines more than 10,000Mb per day with a price of about 0.50$/Mb. This is more than ten times higher daily throughput than 454/Roche and for a considerably lower price per megabase.

11. Life Technologies SOLiD

The prototype of SOLid was developed by George Church’s laboratory at Harvard Medical School and the Howard Hughes Medical Institute and was published in 2005. The principle used in SOLid is sequencing-by-ligation and is very different from the approaches discussed thus far. Unlike others, the sequence extension reaction is carried out by ligases not by polymerases. In the sequencing-by-ligation process, a sequencing primer is hybridized to single-stranded copies of the library molecules to be sequenced. A mixture of 8mer probes carrying four distinct fluorescent labels compete for ligation to the sequencing primer. The fluorophore encoding, which is based on the two 3’ most nucleotides of the probe, is read. Three bases including the dye are cleaved from the 5’ end of the probe, leaving a free 5’ phosphate on the extended (by five nucleotides) primer, which is then available for further ligation. After multiple ligations (typically up to 10 cycles), the synthesized strands are melted and the ligation product is washed away before a new sequencing primer (shifted by one-nucleotide) is annealed. Starting from the new sequencing primer the ligation reaction is repeated. The same process is followed for three other primers, facilitating the read out of the dinucleotide encoding for each start position in the sequence. Using a specific fluorescent label encoding, the dye read outs (i.e. colors) can be converted to a sequence. This conversion from color space to sequence requires a known first base, which is the last base of the library adapter sequence. Given a reference sequence this encoding system allows for the detection of machine errors and the application of an error correction to reduce the average error rate. In the absence of a reference sequence, however, color conversion fails with an error in the dye read out and causes the sequence downstream of the error to be incorrect. For parallelization, the sequencing process uses beads covered with multiple copies of the sequence to be determined. These beads are created in a similar fashion to that described earlier for the 454/Roche platform. In contrast to the 454/Roche technology, the SOLiD system does not use a picotiter plate for fixation of the beads in the sequencing process. Te 3’ ends of the sequences on the beads are modified in a way that allows them to be covalently bound onto a glass slide. As for the Illumina system, this creates a random dispersion of the beads in the sequencing chamber and allows for higher loading densities. However, random dispersion complicates the identification of bead positions from images, and results in the possibility that chemical crystals, dust and lint particles can be misidentified as clusters. Further, dispersal of the beads results in a wide range of inter-bead distances which then have differing susceptibility to signals from neighbouring beads. Types and causes of sequence errors are diverse: First, the in vitro amplification steps cause a higher background error rate than in vivo amplifications using the Sanger cloning approach. Secondly, beads carrying a mixture of sequences and beads in close proximity to one another create false reads and low quality bases. Further, signal decline and incomplete dye removal result in increasing error as the ligation cycles progress. Phasing, as described earlier, is a minor issue on this platform as sequences not extended in the last cycle are non-reversibly terminated using phosphatases. Since hybridization is a stochastic process and probes do not necessarily hybridize adjacent to the (extended) sequencing primer, this causes a considerable reduction in the number of molecules participating in subsequent ligation reactions, and therefore substantial signal decline. Given the high efficiency of phosphatases the remaining phasing effect can be considered very low. However, incomplete cleavage of the dyes may allow cleavage in the next ligation reaction, which then allows for the extension in the next but one cycle. This causes a different phasing effect and additional noise from the previous cycle’s dyes in the dye identification process. The SOLiD system currently allows sequencing of more than 300 million beads in parallel, with a typical read length of between 25 and 75nt.

12. Helicos HeliScope

This sequencer is able to sequence individual molecules instead of molecule ensembles created by an amplification process. The advantage with single-molecule sequencing is that it is not affected by biases or errors introduced in a library preparation or amplification step. One more advantage is it may facilitate sequencing of minimal amounts of input DNA, which is requisite for many research projects. Using methods able to detect non-standard nucleotides, it could also allow for the identification of DNA modifications, commonly lost in the in vitro amplification process. The HeliScope instrument was installed in few numbers which might be surprising given the advantages of single molecule sequencing. This also reflects the specific limitations of this platform, the price (about one million dollars), and a relatively small market. The technology applied by the HeliScope could be termed asynchronous virtual terminator chemistry. Input DNA is fragmented and melted before a poly-A-tail is synthesized onto each single stranded molecule using a polyadenylate polymerase. In the last step of polyadenylation, a fluorescently labelled adenine is added. The library, i.e. the polyadenylated single stranded DNA, is washed over a flow cell where the poly-A tails bind to poly-T-oligonucleotides. The bound coordinates on the flow cell are determined using a fluorescence-based read out of the flow cell. Having these coordinates identified, the fluorescent label of the 3’ adenine is removed and the sequencing reaction started. Polymerases are washed through the flow cell with one type of fluorescently labelled nucleotides (A, C, G and T) at a time and the polymerases extend the reverse strand of the sequences starting from the poly-T-oligonucleotides. The nucleotide incorporation of the polymerases is slowed down by the fluorescent labelling and allows for at most one incorporation before the polymerase is washed away together with the non-incorporated nucleotides . The flow cell is then imaged again, the fluorescent dyes are removed and the reaction continued with another nucleotide. By this process not every molecule is extended in every cycle, which is why it is an asynchronous sequencing process resulting in sequences of different length. Since single molecules are sequenced, the signals being measured are weak. Also there is no possibility that mis incorporation errors can be corrected by an ensemble effect. Due to the fact that molecules are attached to the flow cell by hybridization only, there is a chance that template molecules can be lost in the wash steps. In addition, molecules may be irreversibly terminated by the incorporation of incorrectly synthesized nucleotides. Overall, reads which are 24 to 70nt are shorter than the other platforms. Due to the higher number of sequences determined in parallel, the total throughput per day is in a similar range as for SOLiD system. The average error rate, is slightly higher than other instruments and biased towards insertions and deletions rather than substitutions.

13. Single Molecule Real Time (SMRT)

Another technology for sequencing individual molecules is SMRT (Single Molecule Real Time) sequencing technology, which belongs to Pacific Biosciences. This technology performs the sequencing reaction on silicon dioxide chips. Chips are of 100nm metal film containing thousands of tens-of nanometers diameter holes, so called zero-mode waveguides (ZMWs). Each ZMW is used as a nano visualization chamber, providing a detection volume of about 20 zeptoliters (10−21 liters). At this volume, a single molecule can be illuminated while excluding other labeled nucleotides in the background – saving time and sequencing chemistry by omitting wash steps. For SMRT sequencing, a single DNA polymerase is fixed to the bottom of the surface within the detection volume of each ZMW. Nucleotides with different fluorescent dyes attached to the phosphate chain are used in concentrations allowing normal enzyme processivity. As the polymerase incorporates complementary nucleotides, the nucleotide is held within the detection volume for tens of milliseconds, orders of magnitude longer than for unspecific diffusion events. This way the fluorescent dye of the incorporated nucleotide can be identified during normal speed reverse strand synthesis [57]. Further, by attaching the fluorescent dyes to the phosphate chain of the deoxy-nucleotides the dye is released with the cleaved pyrophosphate, generating an unmodified complimentary DNA strand. In pilot experiments, Pacific Biosciences showed that their technology allows for direct sequencing of a few thousand bases before the polymerase is denatured due to optic and thermal stress from the laser read-out of the dyes. They were also able to show that they can measure differences in polymerase kinetics to such an extend that modified nucleotides may be detected. The SMRT technology was intended for release in the fourth quarter of 2010. Due to this recent release, the amount of information on the actual instrument is very limited and it is likely that further development is needed to create a robust system over the next years.

14. Summary and conclusions

The discussed technologies make it possible for even single research groups to generate large amounts of sequence data very rapidly and at substantially lower costs than traditional Sanger sequencing. Cost and time have been reduced considerably. The error profiles and limitations observed for the new platforms differ significantly from Sanger sequencing and between approaches. With the different requirement of different research problems vendors recently started to offer budget versions of their instruments with lower sequencing capacity. Often the choice of an appropriate sequencing platform is project-specific and sometimes even combinations can be advantageous. This may open the market further to companies and sequencing centres providing sequencing-on-demand services. In the future, laboratories will need to invest considerable time, expertise and money in the design of experiments and the analysis of the vast quantities of data that will be generated. The infrastructure needed for storing, handling and analyzing gigabytes of raw sequence data and lots of intermediate files generated by these instruments is high for smaller research groups. Even for larger groups and experienced genome centres this aspect remains an ever-increasing challenge. Thus especially financial considerations, the number of projects requiring high-throughput data and the interest of implementing own improvements to the instruments/protocols are important factors for instrument acquisition.