The sequencing strategy we have successfully implemented is one that couples the extremely random nebulizer-based physical shearing with forward and reverse universal primer sequencing off double stranded shotgun sub-clones using fluorescent-labeled Taq Terminators with ABI377-based data collection and analysis, followed by semi-automated sequence assembly with a variety of programs, i.e. Phred-Phrap, FAK II and CAP2. In accord with the Bermuda Meeting agreement, the resulting contigs greater than 2kb are submitted to GenBank as Level 1 data within 24 hours of being generated.
Subsequently, the initial closure and proof reading begins by employing semi-automated scripts to aid in picking the clones for the long reads needed for contig ordering and to cover regions when the sequence redundancy is less than the "Rule of 3". Once the contigs are ordered, the sequences are submitted to GenBank as Level 2 data. Then, final closure and proof reading is performed by custom synthetic primer walking, generating PCR-based templates for sequencing either directly or via an intermediate shotgun-based approach to close physical gaps, and/or direct sequencing off the target BAC, PAC, cosmid or fosmid clone. The resulting unique, unambiguous sequence then is analyzed via a semi-automated procedure that employs, XGrail, Powerblast, and Blast X to pin point regions of structural and biological interest (both predicted algorithmically and those based on known sequence homology). Finally, the resulting sequence and annotation are written to an ASN-1 format file for viewing and proofreading via Musk and/or Sequin and then submitted directly to GenBank as Level 3 data.
Several novel but effective procedures have been developed in our laboratory
and elsewhere to facilitate the above sequencing strategy as outlined in Table
1. The first step in this strategy is to obtain sufficient quantities of
genomic-free target BAC, PAC, Fosmid or Cosmid. We earlier had
developed and published a modified alkaline lysis procedure coupled with a
subsequent diatomaceous earth-based purification that yields high quantities
(ca. 1 mg/l of culture) of large insert-containing clones with less than 5% E.
coli genomic DNA contamination. The detailed protocols for this, and
subsequently discussed procedures, are available on our Web Site
(http://www.genome.ou.edu/proto.html)
However, at present we have shifted to a "double acetate" protocol which
was developed at the Washington University Genome Sequencing Center in
St. Louis, MO.
A copy of this protocol is available elsewhere on our web site at URL:
(http://www.genome.ou.edu/DblAcetateProcV3.html)
All target clones are archived and characterized as described in the following figure.
The second step in our shotgun-based sequencing strategy is to obtain a highly random population of physically sheared target large insert clone fragments. To this end we have developed and implemented a nebulizer-base method for physically shearing the target clones under condition where GC bias is reduced (in 50% glycerol at -20 degrees C) with a resulting fragment size range that is quite narrow. This procedure has been published and the most recent version again is available on our Web site (http://www.genome.ou.edu/proto.html).
After nebulization, the resulting fragments are made blunt ended by Klenow and T7 DNA polymerase and phosphorylated by treatment with T4 kinase and rATP immediately prior to fractionation on a low melting agarose gel. Fragments in the 1-3kb and 3-6kb size ranges are eluted, concentrated and used separately for ligation into SmaI/BAP treated pUC sequencing vector. Following host bacterial transformation via electroporation, a procedure that results in both more optimal and random transformants then when competent cells are produced by calcium chloride treatment. The resulting transformants are picked and grown in 2.5 ml of Terrific Broth in a 96 well microtiter deep well block by overnight, typically 16-18 hours rather than 20-22 hours of incubation with shaking at 37 degrees C. Typically 2-3 times more clonies are picked from transformants resulting from ligation of the larger (3-6kb) inserts than from the smaller (1-3kb) inserts. The resulting cells harboring the shotgun sub-clones are harvested by centrifugation and the resulting sequencing templates are isolated via the alkaline-lysis method adapted for 4 simultaneous 96 well isolation in less than 4 hours on a single Biomek 2000. This procedure and the Biomek 2000 isolation program are available for downloading from our Web site (http://www.genome.ou.edu/proto.html) and FTP Site (ftp://ftp.genome.ou.edu). Following ethanol precipitation, drying, and re-suspending the shotgun sequence templates in 50 ul of 2 x distilled, sterile, water, the DNAs are analyzed by agarose gel electrophoresis. Then, 1-2 ul of sequencing template is robotically dispersed into separate wells of a 384 well Robbins microtube plate via a Robbins Hydra 96 pipetting robot equipped with a 384 well rotating stage. Subsequently, the 3-4 ul sequencing reaction mix, containing dNTP, fluorescent labeled ddNTP, buffer, 5-10% DMSO and enzyme (Taq FS) is robotically pipetted by the Robbins Hydra 96 and the plate is incubated in a thermocycler as described using the Big Dyes. Once the cycling reaction are

complete, the unincorporated dye terminators are removed by chromatography on G-50 columns made in a micro-titer filter plate, again with the Robbins Hydra 96 performing all required pipetting steps. After collecting the reaction products by centrifugation through the G-50 column, they are dryed in vacuo and stored at -20 degree C. Prior to addition of EDTA/Formamide/blue dextran and loading onto the ABI (4.25% PAGE 6M urea) gels made from FMC-long ranger pre-mix in a 64 samples well/gel configuration. At present we are begining to integrate the 96 sample wells/gel configuration using a 0.2mm 96 well comb, the standard 48cm WTR plates and a Klone 8-channel variable spacing syringe. The detailed protocols for the sequencing reactions also are available on our Web site (http://www.genome.ou.edu/proto.html)
After electrophoresis and data collection the resulting data is re-tracted, analyzed with the ABI supplied software, and then the analyzed data is transferred via Fetch to a Sparc computer for sequence assembly via Phred/Phrap, CAP2 or FakII. Once the first 100 sequence reactions analyzed, the data is assembled and analyzed, via powblast versus the E. coli database for bacterial host contamination and examined. If the genomic host contamination is < 5%, then an additional 400 sequencing reactions (200 forward and 200 reverse reactions are completed per 50 kb of clone). Then, the vector removal and data clipping needed for FakII and CAP2 are performed by the autoted program and scripts to produce and editable XGAP database and/or the entire raw sequence is analyzed by Phred and Phrap to produce a Consed editable database. This Level 1 shotgun data then is submitted to GenBank prior to beginning the closure and proofreading process using long ranger gel reads for 8-10 hours, custom synthetic primers and dye terminators, and if necessary, PCR amplification of physical gap regions and subsequent sequencing of the PCR products for final closure and proofreading. Once the individual contigs are aligned, although not necessarily contiguous nor finally proofread, they are submitted to GenBank as Level 2 data. Then, once sufficient sequence data is collected to join all contigs to fulfill the "Rule of 3", and the final, unique, unambiguous sequence is obtained, it is annotated via the semi-automated XGrail, Powerblast and BlastX programs to produce an ANS1 format file that can be viewed and edited via Musk and Sequin and then submitted to GenBank as Level 3 data.
B. Closure, proofreading, finishing and polishing
The major, underlying principle that we are striving to maintain, is to obtain a final sequence which is based solely on accurate sequence data with minimal, if any, human intervention and interpretation. Because our tendency is to want to view the data at various stages in a project, the Consed program is an extremely useful tool for viewing Phred/Phrap assembled data. However, although it now is possible to do some editing in Consed, it is more advisable to use Consed and the Phred/Phrap output data mainly as tools to design the appropriate experiments needed to obtain the actual sequence data that will result in a more accurate sequence. It is tempting for individuals to believe that they know more than the computer does, and of course that is true. However, if the appropriate underlying data is collected, it is possible that Phred/Phrap can come up with a consensus sequence that would agree with that obtained by intuitive reasoning by a well trained, highly experinced individual.
Over the years we have investigated many strategies for obtaining the final, complete, unambigous sequence of cloned DNAs. Our present schema for our data flow is shown in the following figure.
Recently however, with the availablity of new and more efficient primer picking
programs (such as Primer from the Whitehead Institute, Primo from
South West Medical Center and our modification of it called PrimOU),
and the available of robust, inexpensive oligonucleotide synthesis
instruments (such as the MerMade developed at South West Medical
Center in Dallas), a more effective use of Custom Synthetic Primers
that can facilitate the closure, proofreading, finishing and polishing
phase of a "target clone" now is a reality.
Our present strategy therefore takes advantage of these inexpensive,
readily synthesized primers that allow for a more directed closure
and proof reading approach.
This strategy entails first determining which shotgun clones will
extend the resulting contigs and the using them as templates for:
1. confirming this observation by re-sequencing each of these templates
with both forward and reverse universal primers with the BigDye terminators,
2. and then re-sequencing each of these templates with the PrimOU chosen custom
synthetic primers with the BigDye terminators.
3. One or more additional rounds of PrimOU chosen custom synthetic primers
may be needed to close any remaining gaps.
4. However, often times regions that are GC rich or contain polyA or polyT
tracks may not be satisfactorly extended by the above approach. In those
instances, the reactions are repeated with the custom synthetic primers
using the dRhodamine terminators, to extend polyA or polyT tracks or
the newly available ABI BigDye mix that has the dITP replaced by dGTP,
which is oftentimes very effective in reading through GC rich regions.
In those instances where a shotgun clone template is not available for gap closure over a given region, the development of an improved "large insert target" clone isolation protocol, i.e. BAC, PAC, Fosmid and Cosmid "double acetate" protocol has facilitated obtaining "large insert target" clone templates that can be used directly for sequencing without any prior sub-cloning or PCR amplification. Here again, the strategy entails first using the BigDye standard dITP containing dye terminators for sequencing directly off the "large insert target" clone templates and then if necessary, because of the presence of either a GC rich or polyA/polyT rich region, employing the dGTP containing BigDye terminators or the dRhodamine terminators, respectively, with PrimOU picked custom synthetic primers produced on the MerMade oligonucleotide synthesizer.
Only in rare instances is it necessary to PCR the gap region for which shotgun subclones or appropriate sequencing reagents or reaction conditions are available and then shear these PCR products to small (approximately 200 to 500 bp fragements) and sub-clone them into pUC sequencing vector for subsequent sequencing using the BigDye terminators and universal forward and reverse primers, followed by separately assembling this "shatter" library data to obtain a concensus sequence accross this difficult region. A detailed protocol for producing these "shatter" clones is described in our protocol book in the section entitled "DNA sequencing based on PCR and random shotgun cloning"
Therefore, at present our strategy is to employ shotgun sequencing to 4-6 fold coverage and then employ Custom Synthetic Primers to obtain the final, unique, unambiguous sequence as described above.
This strategy has several main advantages:
1. It is readily automated,
2. It takes advantage of Phrap quality scores for determining
those regions for which additional sequence coverage, via new
Custom Synthetic Primer directed DNA sequencing should be performed.
3. It reduces the possibility of human decision making which may
result in unintentional sequencing errors,
4. The progress and accuracy of a project can be automatically
tracked and evaluated using computer-based approaches.
5. The final sequence is based on underlying data rather than
intuitive reasoning.
6. Ambiguous regions, i.e. regions that are more difficult to
obtain a sequence with a high degree of confidence, can be completed
before the entire "target" clone sequence is completed and thus the
analysis of "ordered and oriented" contigs can provided scientifically
useful information prior to entirely finishing the entire sequence.
7. Because many of the above processes reduce human intervention
by automating much of the decision making process, and because the
total cost of a typical 20-mer used as Custom Synthetic Primer is
less than $2, fewer highly trained and experineced "finishers" are
needed thereby easily recovering the minimally higher cost of primers.
8. Because the final fold coverage is approximately 6 to 7-fold (including
the sequencing reactions involving Custom Synthetic Primers), the same
amount of final sequence data can be obtained on fewer DNA sequencing
instruments, thus requiring fewer gels and a lower expense at the data
collection phase as well.
At present we are investigating the feasibility of switching from the shotgun phase to the Custom Synthetic Primer-based sequencing phase at a fold coverage less than 4-fold (as calcualted by Phred/Phrap) but our preliminary evidence indicates that although this might be feasable, anything less than 3-fold shotgun coverage severly limits the ability to sequence off pUC sub-clones for closure and proofreading as there is insufficient "target" clone coverage. Thus, the final closure and finishing must be done by sequencing directly off the "large insert target" clones (BAC,PAC, Fosmid or Cosmid). These reactions require 1/2x or 1/4x Big Dye reaction conditions rather than the 1/12x or 1/16x Big Dye reaction conditions used for pUC sub-clones and therefore the cost/reaction is increased almost 6-fold.
With the recently announced NHGRI "Working Draft" proposal, whose overall scheme is presented pictorally below,
we have been investigating rapid and efficient final closure and proofreading approaches. One approach that looks quite promicing is a modified Multiplex PCR strategy which is shown here.
The details of this procedure can be obtained from our Multiplex PCR page.
C. Quality Control Statements and Perspective on Finishing
All of the data submitted by our laboratory to GenBank as Level 3 data has been sequenced following the "Rule of 3". Furthermore, we believe it is imperative that the initial sequence data as part of the early phase of the human genomic sequencing be as accurate as possible with the available funds. It is our opinion that there are three levels at which accuracy must be maintained. First, target large insert clones corresponding correctly to the genome must be chosen. Here, we presently are only sequencing representative clones that have been mapped and correspond to a region where there is at least three-fold depth of clone coverage. We also now are beginning to implement a restriction digestion strategy to insure that the clone chosen is truly a representative of the region. Second, to ensure correct sequence assembly, we have developed scripts that allow us to obtain the final sequence from several assembly programs, namely Phred/Phrap, CAP2 and FakII. Thus, since all three programs assemble the data using slightly different algorithms, sequence read length and confidence levels, it is only when all three assemble the data correctly that we have renewed confidence in our final sequence. For this, and the other reasons stated above, we are requesting a multiprocessor UNIX computer and additional disk storage capabilities to facilitate this computer-based sequence validation process. Third, once we are confident that the final sequence is as accurate as possible and has been assembled correctly via the above methods, we then experimentally confirm the sequence. This is done easily via PCR using selected primer pairs that were used earlier in the closure phase. This interesting twist to sequence validation has been extremely useful and take minimal work. Specifically, two of the cosmids we recently thought were correctly assembled were shown not to be so via this PCR-based approach and additional sequencing reactions were needed to finally and unequivocally generate the final, unambiguous sequence.
D. Policy on Annotation
Our present policy on sequence annotation is:
Minimal annotation to include title, submitter, contact, authors and collaborators, organism and molecular source
An update to level 2 data that includes, in ASN.1 format, the results from Blast and Grail analysis as judged by Musk and XGrail. This information includes, predicted features such as exons, introns, GC islands, alu and other repeats, ESTs, and STSs.
Since most of the sequencing in our laboratory eventually will result in publication, as we obtain additional annotation, we submit updates directly to GenBank.
Data submission initially was an interesting experience but with Steve Kenton here working closely with Jim Ostell, et al at the NCBI/NLM, the process now has been partially automated and continues to improve daily.
E. Human Subject/IRB Approval Status of Clone Libraries
All of the target large insert clones we are sequencing have been and/or will be from laboratories which have obtained the appropriate approval status and mapping information.
Maps showing the location of the clones sequenced or in progress are available along with our protocols on our web site:
Bruce Roe, broe@ou.edu