Fold P0x100= coverage P0=e-c % not sequence % sequenced 0.25 P0=e-0.25 =1/e0.25 = 0.78 78% 22% 0.50 P0=e-0.50 =1/e0.50 = 0.61 61% 39% 0.75 P0=e-0.75 =1/e0.75 = 0.47 47% 53% 1 P0=e-1=1/e1=1/2.718 = 0.37 37% 63% 2 P0=e-2=1/e2=1/7.389 = 0.135 13.5% 87.5% 3 P0=e-3=1/e3=1/20.086 = 0.05 5% 95% 4 P0=e-4=1/e4=1/54.598 = 0.018 1.8% 98.2% 5 P0=e-5=1/e5=1/148.4 = 0.0067 0.6% 99.4% 6 P0=e-6=1/e6=1/403.4 = 0.0025 0.25% 99.75% 7 P0=e-7=1/e7=1/1096.6 = 0.0009 0.09% 99.91% 8 P0=e-8=1/e8=1/2980.95 = 0.0003 0.03% 99.97 9 P0=e-9=1/e9=1/8103.08 = 0.0001 0.01% 99.99% 10 P0=e-10=1/e10=1/22026.5 = 0.000045 0.005% 99.995%Note: % sequenced is independent of read length when fold coverage is considered
genome size = 50kb 150kb 300kb 2Mb 4Mb 500Mb Fold coverage Ge-c Ge-c Ge-c Ge-c Ge-c Ge-c 1 18,500 55,500 111,000 740,000 1,480,000 185,000,000 2 6,750 20,250 40,500 270,000 540,000 67,500,000 3 2,500 7,500 15,000 100,000 200,000 25,000,000 4 900 2,700 5,400 36,000 72,000 9,000,000 5 335 1,005 2,010 13,400 26,800 3,350,000 6 125 375 750 5,000 10,000 1,250,000 7 45 135 270 1,800 3,600 450,000 8 15 45 90 600 1,200 150,000 9 5 15 30 200 400 50,000 10 2 6 12 90 180 20,000
genome size = 50kb 150kb 300kb
Read Length = 400 500 600 400 500 600 400 500 600
Fold cov.
1 125 100 84 375 300 250 750 600 500
2 250 200 168 750 600 500 1500 1200 1000
3 375 300 242 1125 900 750 2250 1800 1500
4 500 400 326 1500 1200 1000 3000 2400 2000
5 625 500 410 1875 1500 1250 3750 3000 2500
6 750 600 500 2250 1800 1500 4500 3600 3000
7 875 700 583 2625 2100 1750 5250 4200 3500
8 1000 800 667 3000 2400 2000 6000 4800 4000
9 1125 900 750 3375 2700 2250 6750 5400 4500
10 1250 1000 833 3750 3000 2500 7500 6000 5000
genome size = 2Mb 4Mb 500Mb
Read Length = 400 500 600 400 500 600 400 500 600
Fold cov.
1 5000 4000 3333 10000 8000 6667 1250000 1000000 833333
2 10000 8000 6667 20000 16000 13334 2500000 2000000 1666666
3 15000 12000 10000 30000 24000 20000 3750000 3000000 2500000
4 20000 16000 13333 40000 32000 26667 5000000 4000000 3333333
5 25000 20000 16667 50000 40000 33334 6250000 5000000 4366666
6 30000 24000 20000 60000 48000 40000 7500000 6000000 5000000
7 35000 28000 23333 70000 56000 46667 8750000 7000000 5833333
8 40000 32000 26667 80000 64000 53333 10000000 8000000 6666666
9 45000 36000 30000 90000 72000 60000 11250000 9000000 7500000
10 50000 40000 33333 100000 80000 66667 12500000 10000000 8733333
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 125 0.37 46 100 0.37 37 84 0.37 31 2 250 0.135 34 200 0.135 27 168 0.135 23 3 375 0.05 19 300 0.05 15 242 0.05 12 4 500 0.018 9 400 0.018 7 326 0.018 6 5 625 0.0067 4 500 0.0067 3 410 0.0067 3 6 750 0.0025 2 600 0.0025 2 500 0.0025 1 7 875 0.0009 1 700 0.0009 1 583 0.0009 1 8 1000 0.0003 0 800 0.0003 0 667 0.0003 0 9 1125 0.0001 0 900 0.0001 0 750 0.0001 0 10 1250 0.000045 0 1000 0.000045 0 833 0.000045 0
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 375 0.37 139 300 0.37 111 250 0.37 93 2 750 0.135 101 600 0.135 81 500 0.135 68 3 1125 0.05 56 900 0.05 45 750 0.05 38 4 1500 0.018 27 1200 0.018 22 1000 0.018 18 5 1875 0.0067 13 1500 0.0067 10 1250 0.0067 8 6 2250 0.0025 6 1800 0.0025 5 1500 0.0025 4 7 2625 0.0009 2 2100 0.0009 2 1750 0.0009 2 8 3000 0.0003 1 2400 0.0003 1 2000 0.0003 1 9 3375 0.0001 0 2700 0.0001 0 2250 0.0001 0 10 3750 0.000045 0 3000 0.000045 0 2500 0.000045 0
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 750 0.37 278 600 0.37 222 500 0.37 185 2 1500 0.135 203 1200 0.135 162 1000 0.135 135 3 2250 0.05 113 1800 0.05 90 1500 0.05 75 4 3000 0.018 54 2400 0.018 43 2000 0.018 36 5 3750 0.0067 25 3000 0.0067 20 2500 0.0067 17 6 4500 0.0025 11 3600 0.0025 9 3000 0.0025 8 7 5250 0.0009 5 4200 0.0009 4 3500 0.0009 3 8 6000 0.0003 2 4800 0.0003 2 4000 0.0003 1 9 6750 0.0001 1 5400 0.0001 1 4500 0.0001 0 10 7500 0.000045 0 6000 0.000045 0 5000 0.000045 0
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 5000 0.37 1850 4000 0.37 1480 3333 0.37 1233 2 10000 0.135 1350 8000 0.135 1080 6667 0.135 900 3 15000 0.05 750 12000 0.05 600 10000 0.05 500 4 20000 0.018 360 16000 0.018 288 13333 0.018 240 5 25000 0.0067 168 20000 0.0067 134 16667 0.0067 112 6 30000 0.0025 75 24000 0.0025 60 20000 0.0025 50 7 35000 0.0009 32 28000 0.0009 25 23333 0.0009 21 8 40000 0.0003 12 32000 0.0003 10 26667 0.0003 8 9 45000 0.0001 5 36000 0.0001 4 30000 0.0001 3 10 50000 0.000045 2 40000 0.000045 2 33333 0.000045 1
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 10000 0.37 3700 8000 0.37 2960 6667 0.37 2467 2 20000 0.135 2700 16000 0.135 2160 13334 0.135 1800 3 30000 0.05 1500 24000 0.05 1200 20000 0.05 1000 4 40000 0.018 720 32000 0.018 576 26667 0.018 480 5 50000 0.0067 335 40000 0.0067 268 33334 0.0067 223 6 60000 0.0025 150 48000 0.0025 120 40000 0.0025 100 7 70000 0.0009 63 56000 0.0009 50 46667 0.0009 42 8 80000 0.0003 24 64000 0.0003 19 53333 0.0003 16 9 90000 0.0001 9 72000 0.0001 7 60000 0.0001 6 10 100000 0.000045 5 80000 0.000045 4 66667 0.000045 3
Read Length= 400 500 600 Fold Cov. N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c N e-c #Gaps=Ne-c 1 1250000 0.37 462500 1000000 0.37 370000 840000 0.37 308375 2 2500000 0.135 337500 2000000 0.135 270000 1680000 0.135 225000 3 3750000 0.05 187500 3000000 0.05 150000 2420000 0.05 125000 4 5000000 0.018 90000 4000000 0.018 72000 3260000 0.018 60000 5 6250000 0.0067 41875 5000000 0.0067 33500 4100000 0.0067 27875 6 7500000 0.0025 18750 6000000 0.0025 15000 5000000 0.0025 12500 7 8750000 0.0009 8125 7000000 0.0009 6250 5830000 0.0009 5250 8 10000000 0.0003 3000 8000000 0.0003 2375 6670000 0.0003 2000 9 11250000 0.0001 1125 9000000 0.0001 875 7500000 0.0001 750 10 12500000 0.000045 625 10000000 0.000045 500 8330000 0.000045 375
From Table 2: From Table4: fold Total bases total gap length Number of Gap Length/# gaps= % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c # bases per gap complete 1 50000 0.37 18,500 37 500 63 2 100000 0.135 6,750 27 250 87.5 3 150000 0.05 2,500 15 167 95 4 200000 0.018 900 7 129 98.2 5 250000 0.0067 335 3 112 99.4 6 300000 0.0025 125 2 63 99.75 7 350000 0.0009 45 1 45 99.91 8 400000 0.0003 15 1 15 99.97 9 450000 0.0001 5 1 5 99.99 10 500000 0.000045 2 1 2 99.995
From Table 2: From Table4: fold Total bases total gap length Number of Gap Length/# gaps= % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c # bases per gap complete 1 150000 0.37 55,500 111 500 63 2 300000 0.135 20,250 81 250 87.5 3 450000 0.05 7,500 45 167 95 4 600000 0.018 2,700 22 123 98.2 5 750000 0.0067 1,005 10 101 99.4 6 900000 0.0025 375 5 75 99.75 7 1050000 0.0009 135 2 68 99.91 8 1200000 0.0003 45 1 45 99.97 9 1350000 0.0001 15 1 15 99.99 10 1500000 0.000045 6 1 6 99.995
From Table 2: From Table4: fold Total bases total gap length Number of Gap Length/# gaps= % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c # bases per gap complete 1 300000 0.37 111,000 222 500 63 2 600000 0.135 40,500 162 250 87.5 3 900000 0.05 15,000 90 167 95 4 1200000 0.018 5,400 43 126 98.2 5 1550000 0.0067 2,010 20 101 99.4 6 1800000 0.0025 750 9 83 99.75 7 2100000 0.0009 270 4 68 99.91 8 2400000 0.0003 90 2 45 99.97 9 2700000 0.0001 30 1 30 99.99 10 3000000 0.000045 12 1 12 99.995
From Table 2: From Table4: fold Total bases total gap length Number of Gap Length/# gaps= % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c # bases per gap complete 1 2000000 0.37 740,000 1480 500 63 2 4000000 0.135 270,000 1080 250 87.5 3 6000000 0.05 100,000 600 167 95 4 8000000 0.018 36,000 288 125 98.2 5 10000000 0.0067 13,400 134 100 99.4 6 12000000 0.0025 5,000 60 83 99.75 7 14000000 0.0009 1,800 25 72 99.91 8 16000000 0.0003 600 10 60 99.97 9 18000000 0.0001 200 4 50 99.99 10 20000000 0.000045 90 2 45 99.995
From Table 2: From Table4: fold Total bases total gap length Number of Gap Length/# gaps= % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c # bases per gap complete 1 4000000 0.37 1,480,000 2960 500 63 2 8000000 0.135 540,000 2160 250 87.5 3 12000000 0.05 200,000 1200 167 95 4 16000000 0.018 72,000 576 125 98.2 5 20000000 0.0067 26,800 268 100 99.4 6 24000000 0.0025 10,000 120 83 99.75 7 28000000 0.0009 3,600 50 72 99.91 8 32000000 0.0003 1,200 19 63 99.97 9 36000000 0.0001 400 7 57 99.99 10 40000000 0.000045 180 4 45 99.995
From Table 2: From Table4: Gap Length genome - gap length fold Total bases total gap length Number of # gaps # contigs % coverage sequenced e-c in bases =Ge-c Gaps = Ne-c =# bases/gap = contig length complete 1 500000000 0.37 185,000,000 370000 500 851 63 2 1000000000 0.135 67,500,000 270000 250 1620 87.5 3 1500000000 0.05 25,000,000 150000 167 3167 95 4 2000000000 0.018 9,000,000 72000 125 6819 98.2 5 2500000000 0.0067 3,350,000 33500 100 14825 99.4 6 3000000000 0.0025 1,250,000 15000 83 33250 99.75 7 3500000000 0.0009 450,000 6250 72 79928 99.91 8 4000000000 0.0003 150,000 2375 63 210463 99.97 9 4500000000 0.0001 50,000 875 57 571371 99.99 10 5000000000 0.000045 20,000 500 40 999960 99.995
Results from physical mapping projects have recently been reported for the genomes of Escherichia coli,
Saccharomyces cerevisiae, and Caenorhabditis elegans, and similar projects are currently being planned for
other organisms. In such projects, the physical map is assembled by first "fingerprinting" a large number of
clones chosen at random from a recombinant library and then inferring overlaps between clones with
sufficiently similar fingerprints. Although the basic approach is the same, there are many possible choices for
the fingerprint used to characterize the clones and the rules for declaring overlap. In this paper, we derive
simple formulas showing how the progress of a physical mapping project is affected by the nature of the
fingerprinting scheme. Using these formulas, we discuss the analytic considerations involved in selecting an
appropriate fingerprinting scheme for a particular project.
A complete physical map of the DNA of an organism, consisting of overlapping clones spanning the genome,
is an extremely useful tool for genomic analysis. Various methods for the construction of such physical maps
are available. One approach is to assemble the physical map by "fingerprinting" a large number of random
clones and inferring overlap between clones with sufficiently similar fingerprints. E.S. Lander and M.S.
Waterman (1988, Genomics 2:231-239) have recently provided a mathematical analysis of such physical
mapping schemes, useful for planning such a project. Another approach is to assemble the physical map by
"anchoring" a large number of random clones--that is, by taking random short regions called anchors and
identifying the clones containing each anchor. Here, we provide a mathematical analysis of such a physical
mapping scheme.
Physical maps can be constructed by "fingerprinting" a large number of random clones and inferring overlap
between clones when the fingerprints are sufficiently similar. E. Lander and M. Waterman (Genomics 2:
231-239, 1988) gave a mathematical analysis of such mapping strategies. The analysis is useful for comparing
various fingerprinting methods. Recently it has been proposed that ends of clones rather than the entire clone be
fingerprinted or characterized. Such fingerprints, which include sequenced clone ends, require a mathematical
analysis deeper than that of Lander-Waterman. This paper studies clone islands, which can include
uncharacterized regions, and also the islands that are formed entirely from the ends of clones.
Several recent mapping efforts have used so-called "directed" approaches to construct their maps. However,
most, but not all, published methods for modeling the progress in physical mapping projects have been focused
on random approaches, such as bottom-up fingerprinting and STS-content mapping. In addition, those few
efforts that did model directed approaches used methods that required assuming that all insert lengths were the
same. This assumption is unnecessary. Using properties of stationary processes, one can derive simple
asymptotic formulas that apply equally to constant and variable clone lengths. Also, in the case of constant
clone lengths, these results are equivalent to, and extend, those published results for directed mapping derived
by other methods. Simulations show that these methods provide estimates well within the limits of uncertainty
inherent in any mapping project.
Bruce Roe, broe@ou.edu